CN112989844A

CN112989844A - Model training and text recognition method, device, equipment and storage medium

Info

Publication number: CN112989844A
Application number: CN202110262745.5A
Authority: CN
Inventors: 丁建平
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-06-18

Abstract

The embodiment of the invention provides a method, a device, equipment and a storage medium for model training and text recognition, wherein the method comprises the following steps: acquiring text data to be identified; taking a first layer of conversion network layer of a text recognition model obtained by pre-training as a target network, taking text data to be recognized as target data, and inputting the target data into the target network for feature extraction to obtain a feature extraction result; inputting the feature extraction result into an output classifier connected with a target network, and judging whether the obtained target identification result meets a preset condition; if so, taking the target recognition result as a text recognition result of the data to be recognized; and if not, taking the next layer of conversion network layer of the text recognition model as a target network, taking the feature extraction result as target data, and returning to the step of inputting the target data into the target network for feature extraction. Therefore, under the condition that the output result of any layer in the text recognition model meets the condition, the text recognition result is output in advance, and the time consumption of calculation is reduced.

Description

Model training and text recognition method, device, equipment and storage medium

Technical Field

The invention relates to the field of intelligent recognition, in particular to a method, a device, equipment and a storage medium for model training and text recognition.

Background

In the field of intelligent recognition, recognition of texts is a research direction with great application value, and many applications in real life are closely related to the texts, for example, when OTT (over the Top, internet television) and baseline voice assistant services interact with user voices, text information is extracted from the user voices, and then the texts need to be recognized, so that the user intentions are understood, and corresponding results are returned.

Generally, the recognition of text includes intent recognition for classifying sentences in the text into corresponding intent categories and slot filling for labeling each word in a given sentence with a corresponding label.

In the prior art, a Bert (Bidirectional Encoder representation based on a transform network) model can be used for recognizing a text, but the Bert model at least comprises 12 layers of transform network layers, so that the computation of the text recognition model is long and cannot meet the requirements of QPS (Query Per Second) of a service.

Disclosure of Invention

Embodiments of the present invention provide a method, an apparatus, a device, and a storage medium for model training and text recognition, so as to reduce the time consumption for calculating a text recognition model and meet the QPS requirements of services. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided a model training method, including:

acquiring sample text data;

inputting the sample text data into a pre-training model for text recognition processing to obtain a first predicted text recognition result, wherein the pre-training model comprises a plurality of layers of intermediate transformation network layers, a last layer transformation network layer and an output classifier connected with the last layer transformation network layer;

aiming at each intermediate conversion network layer, inputting the output result of the sample text data in the intermediate conversion network layer to a preset classifier connected with the intermediate conversion network layer to obtain a second predicted text recognition result;

judging whether a loss value between the second predicted text recognition result and the first predicted text recognition result is smaller than a preset loss value threshold value or not;

and if not, performing iterative adjustment on a preset classifier connected with the intermediate conversion network layer, and if the preset classifier is smaller than the preset classifier, taking the preset classifier after the iterative adjustment as an output classifier corresponding to the intermediate conversion network layer to obtain a text recognition model, wherein the text recognition model comprises the pre-training model and the output classifier connected with each intermediate conversion network layer.

Optionally, before the sample text data is input to a pre-training model for text recognition processing to obtain a first predicted text recognition result, the method further includes:

inputting the sample text data into a preset model for text recognition processing to obtain an initial predicted text recognition result, wherein the preset model comprises a plurality of layers of intermediate transformation network layers, a last layer transformation network layer and an output classifier connected with the last layer transformation network layer;

and judging whether the initial predicted text recognition result is converged, if not, performing iterative adjustment on the preset model, and if so, taking the preset model after iterative adjustment as the pre-training model.

Optionally, the determining whether a loss value between the second predicted text recognition result and the first predicted text recognition result is smaller than a preset loss value threshold includes:

calculating K-L divergence between a second predicted text recognition result output by a preset classifier connected with the intermediate conversion network layer and the first predicted text recognition result;

and calculating the K-L divergence corresponding to the intermediate conversion network layer and the sum of the K-L divergences corresponding to each intermediate conversion network layer before the intermediate conversion network layer to obtain a loss value between the second predicted text recognition result corresponding to the intermediate conversion network layer and the first predicted text recognition result.

Optionally, the output classifier includes an intention classification classifier and/or a slot identification classifier.

In a second aspect of the present invention, there is also provided a text recognition method, including:

acquiring text data to be identified;

taking a first layer of conversion network layer of a text recognition model obtained by pre-training as a target network, taking the text data to be recognized as target data, and inputting the target data into the target network for feature extraction to obtain a feature extraction result, wherein the text recognition model comprises a plurality of layers of conversion network layers, and each layer of conversion network layer is respectively connected with a corresponding output classifier obtained by pre-training;

inputting the feature extraction result to an output classifier connected with the target network, classifying and identifying the feature extraction result, and judging whether the obtained target identification result meets a preset condition;

if so, taking the target recognition result as a text recognition result of the text data to be recognized;

and if not, taking the next layer of conversion network layer of the text recognition model as a target network, taking the feature extraction result as target data, and returning to the step of inputting the target data into the target network for feature extraction.

Optionally, the determining whether the obtained target recognition result meets a preset condition includes:

calculating the instability of the obtained target recognition result, wherein the instability is used for representing the error degree of the obtained target recognition result exceeding the preset error range;

judging whether the instability is smaller than a preset instability threshold value or not;

if the target identification result is smaller than the preset condition, the target identification result is judged to meet the preset condition, and if the target identification result is not smaller than the preset condition, the target identification result is judged to not meet the preset condition.

Optionally, the following formula is adopted to calculate the instability of the obtained target recognition result:

wherein, the P_sAnd N is the output distribution of the target recognition result, and the N is the number of the classes of the intention classification of the target recognition result.

In a third aspect of the present invention, there is also provided a model training apparatus, including:

the sample acquisition module is used for acquiring sample text data;

the first prediction module is used for inputting the sample text data into a pre-training model for text recognition processing to obtain a first prediction text recognition result, wherein the pre-training model comprises a plurality of layers of intermediate transformation network layers, a last layer of transformation network layer and an output classifier connected with the last layer of transformation network layer;

the second prediction module is used for inputting the output result of the sample text data in the middle conversion network layer to a preset classifier connected with the middle conversion network layer aiming at each middle conversion network layer to obtain a second prediction text recognition result;

the judging module is used for judging whether the loss value between the second predicted text recognition result and the first predicted text recognition result is smaller than a preset loss value threshold value or not; and if not, performing iterative adjustment on a preset classifier connected with the intermediate conversion network layer, and if not, taking the preset classifier after the iterative adjustment as an output classifier of the intermediate conversion network layer to obtain a text recognition model, wherein the text recognition model comprises the pre-training model and the output classifier connected with each intermediate conversion network layer.

In a fourth aspect of the present invention, there is also provided a text recognition apparatus, including:

the data acquisition module is used for acquiring text data to be identified;

the feature extraction module is used for taking a first layer of conversion network layer of a text recognition model obtained through pre-training as a target network, taking the text data to be recognized as target data, inputting the target data into the target network for feature extraction, and obtaining a feature extraction result, wherein the text recognition model comprises a plurality of layers of conversion network layers, and each layer of conversion network layer is respectively connected with a corresponding output classifier obtained through pre-training;

the output module is used for inputting the feature extraction result to an output classifier connected with the target network, classifying and identifying the feature extraction result, and judging whether the obtained target identification result meets a preset condition; if so, taking the target recognition result as a text recognition result of the data to be recognized;

and the feature extraction module is further configured to, when the target recognition result does not satisfy a preset condition, use a next-layer transformation network layer of the text recognition model as a target network, use the feature extraction result as target data, and return to the step of inputting the target data to the target network for feature extraction.

In another aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and a processor for implementing any of the above text recognition methods when executing the program stored in the memory.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform any of the above-described text recognition methods.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the text recognition methods described above.

According to the model training and text recognition method, device, equipment and storage medium provided by the embodiment of the invention, the text data to be recognized is obtained; taking a first layer of conversion network layer of a text recognition model obtained by pre-training as a target network, taking the text data to be recognized as target data, and inputting the target data into the target network for feature extraction to obtain a feature extraction result, wherein the text recognition model comprises a plurality of layers of conversion network layers, and each layer of conversion network layer is respectively connected with a corresponding output classifier obtained by pre-training; inputting the feature extraction result to an output classifier connected with the target network, and judging whether the obtained target identification result meets a preset condition; if so, taking the target recognition result as a text recognition result of the data to be recognized; and if not, taking the next layer of conversion network layer of the text recognition model as a target network, taking the feature extraction result as target data, and returning to the step of inputting the target data into the target network for feature extraction. In the text recognition process, the output result of each layer of the conversion network layer in the text recognition model is input to the corresponding output classifier, and under the condition that the target recognition result output by the output classifier meets the condition, the target recognition result is output in advance as the text recognition result and does not enter the next conversion network layer for calculation any more, so that the calculation time consumption of the text recognition model is reduced, and the QPS requirement of the service is met.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a flow chart illustrating steps of a model training method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a model training scheme according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating steps of a text recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention;

fig. 6 is a schematic diagram of an electronic device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

In the related art, texts are generally recognized by using a Bert model, but the Bert model at least comprises 12 layers of transform networks, so that the computation of the text recognition model is long and cannot meet the QPS requirement of a service.

In order to solve the above problems, embodiments of the present invention provide a model training method and a text recognition method, and the following generally describes the model training method and the text recognition method provided in embodiments of the present invention.

The model training method comprises the following steps:

acquiring sample text data;

inputting sample text data into a pre-training model for text recognition processing to obtain a first predicted text recognition result, wherein the pre-training model comprises a plurality of layers of intermediate transformation network layers, a last layer transformation network layer and an output classifier connected with the last layer transformation network layer;

judging whether the loss value between the second predicted text recognition result and the first predicted text recognition result is smaller than a preset loss value threshold value or not;

and if not, performing iterative adjustment on a preset classifier connected with the intermediate conversion network layer, and if the preset classifier is smaller than the preset classifier, taking the preset classifier after the iterative adjustment as an output classifier corresponding to the intermediate conversion network layer to obtain a text recognition model, wherein the text recognition model comprises a pre-training model and the output classifier connected with each intermediate conversion network layer.

The text recognition method comprises the following steps:

acquiring text data to be identified;

taking a first layer of conversion network layer of a text recognition model obtained by pre-training as a target network, taking text data to be recognized as target data, and inputting the target data into the target network for feature extraction to obtain a feature extraction result, wherein the text recognition model comprises a plurality of conversion network layers, and each conversion network layer is respectively connected with a corresponding output classifier obtained by pre-training;

inputting the feature extraction result into an output classifier connected with a target network, classifying and identifying the feature extraction result, and judging whether the obtained target identification result meets a preset condition;

As can be seen from the above, in the model training method and the text recognition method provided in the embodiments of the present invention, in the text recognition process, the output result of each layer of the transform network layer in the text recognition model is input to the corresponding output classifier, and when the target recognition result output by the output classifier satisfies the condition, the target recognition result is output in advance as the text recognition result, and does not enter the next layer of the transform network layer for calculation, so that the time consumed by the calculation of the text recognition model is reduced, and the QPS requirement of the service is satisfied.

Referring to fig. 1, a flowchart illustrating steps of a model training method according to the present application is shown, which may specifically include the following steps:

s101: sample text data is obtained.

The sample text data can be text data crawled from the internet, and after the sample text data is obtained, the sample text data can be preprocessed, wrong manual marks are checked and eliminated, in addition, standardized processing can be carried out, so that the format of the sample text data is more uniform, and a text recognition model obtained through training is more accurate.

S102: and inputting the sample text data into a pre-training model for text recognition processing to obtain a first predicted text recognition result.

The pre-training model comprises a plurality of layers of intermediate transformation network layers, a last layer transformation network layer and an output classifier connected with the last layer transformation network layer. The intermediate transformation network layer comprises all other transformation network layers except the last transformation network layer, namely the transformation network layers from the first layer to the second last layer in the pre-training model are all intermediate transformation network layers; the preset classifier comprises an intention classification classifier and/or a slot position identification classifier, the intention identification classifier is used for classifying the text into different intention types according to the intention of the sentences in the text, and the slot position filling classifier is used for marking each word in each sentence in the text with a corresponding label.

For example, the slot identifying classifier can identify a slot tag of each character in each sample text data, the slot tags are classified into B, I, O types, B represents that a current character is the beginning of any slot, I represents a slot where the current character continues to a previous character, and O represents that the current character does not belong to any slot, so that the data in the BIO format is obtained. The slot position identification classifier can identify slot position labels of each character through data labeling and model training, firstly, an ontology library is defined, namely entity types to be identified, such as entity types of a television station, a video name and the like, then text data with entity words are crawled through a crawler, the text data are labeled, then, feature extraction is carried out on input text data, feature vectors of label quantity dimensions and dependency relations among the slot position labels are obtained, and finally the slot position labels corresponding to each character are obtained. Such as:

happy big book camp wanting to see mango table

O O O B-TV I-TV I-TV O B-TITLE I-TITLE I-TITLE I-TITLE I-TITLE

Wherein, the three characters of 'i want to see' and 'do not belong to any slot position, the slot position of the mango station' is a television station, the slot position of the 'happy big book camp' is a video name, and the sentence meaning picture is the video intention of searching the television station.

In the application, the pre-training model is a pre-acquired model which can be used for text recognition, and the sample text data is sequentially processed by the output classifiers connected with the multi-layer middle transformation network layer, the last transformation network layer and the last transformation network layer in the pre-training model, so that a first text recognition result of the sample text data can be obtained, and the obtained first text recognition result has certain credibility and can be used for subsequent loss value calculation to verify the accuracy of text recognition.

The pre-training model may be a pre-acquired trained model, or in another implementation manner, the pre-training model may be trained first to obtain a pre-training model, and then the sample text data is input to the pre-training model to perform text recognition processing to obtain a first predicted text recognition result. Specifically, firstly, sample text data can be input into a preset model for text recognition processing to obtain an initial predicted text recognition result, wherein the preset model comprises a plurality of layers of intermediate transformation network layers, a last layer transformation network layer and an output classifier connected with the last layer transformation network layer; and then, judging whether the initial prediction text recognition result is converged, if not, performing iterative adjustment on the preset model, and if so, taking the preset model after iterative adjustment as a pre-training model.

The pre-training model may be a Bert model, and the Bert model at least includes 12 transform network layers.

S103: and aiming at each intermediate conversion network layer, inputting the output result of the sample text data in the intermediate conversion network layer to a preset classifier connected with the intermediate conversion network layer to obtain a second predicted text recognition result.

Each layer of intermediate transformation network layer is respectively connected with corresponding preset classifiers, the preset classifiers can comprise one or more than one, the preset classifiers connected with each layer of intermediate transformation network layer have the same function as the output classifiers connected with the last layer of transformation network layer, and are trained to be used for processing the output result of the intermediate transformation network layer, so that the purpose classification and/or slot position recognition of the sample text data is realized, and the second prediction text recognition result is obtained.

S104: judging whether the loss value between the second predicted text recognition result and the first predicted text recognition result is smaller than a preset loss value threshold value or not; if not, executing S105; if so, go to step S106.

In this step, determining whether a loss value between the second predicted text recognition result and the first predicted text recognition result is smaller than a preset loss value threshold may include the following steps:

firstly, calculating K-L divergence between a second predicted text recognition result and a first predicted text recognition result output by a preset classifier connected with the intermediate conversion network layer.

By way of example, the following formula may be employed:

wherein, P_sIs the output distribution of a preset classifier connected to the intermediate conversion network layer, i.e., the second predicted text recognition result of the layer, Pt is the output distribution of the first predicted text recognition result, N is the number of classes of the second predicted text recognition result determined according to the output distribution of the preset classifier connected to the intermediate conversion network layer, D_KLIs the K-L divergence between the second predicted-text recognition result and the first predicted-text recognition result.

And then, calculating the K-L divergence corresponding to the intermediate conversion network layer and the sum of the K-L divergences corresponding to each intermediate conversion network layer before the intermediate conversion network layer to obtain a loss value between the second predicted text recognition result and the first predicted text recognition result corresponding to the intermediate conversion network layer.

By way of example, the following formula may be employed:

wherein, L is the total number of the conversion network layers including the middle conversion network layer and the last conversion network layer in the text recognition model.

S105: and iteratively adjusting the preset classifier connected with the intermediate transformation network layer.

That is to say, according to the loss value obtained by the current calculation, the model parameter of the connected preset classifier is adjusted to obtain the adjusted preset classifier, then, the newly input output result of the intermediate conversion network layer is processed by using the adjusted preset classifier, the corresponding second predicted text recognition result is obtained, and the step of judging whether the loss value between the second predicted text recognition result and the first predicted text recognition result is smaller than the preset loss value threshold value is returned.

S106: and taking the preset classifier after the iterative adjustment as an output classifier corresponding to the intermediate conversion network layer to obtain a text recognition model, wherein the text recognition model comprises a pre-training model and an output classifier connected with each intermediate conversion network layer.

Fig. 2 is a schematic diagram of a scheme for model training in the embodiment of the present invention. In this example, the text recognition model includes a pre-training model and an output classifier connected to each intermediate conversion network layer in the pre-training model, and the pre-training model includes 11 intermediate conversion network layers, 1 last conversion network layer, and an output classifier connected to the last conversion network layer, where, as shown in fig. 2, the conversion network layer 1 to the conversion network layer 11 are intermediate conversion network layers, and the conversion network layer 12 is a last conversion network layer. In the text recognition model, the transformation network layers are connected in sequence, the output of the previous transformation network layer is used as the input of the next transformation network layer, and meanwhile, the output of each transformation network layer can also be input to the output classifier connected with the transformation network layer to obtain the corresponding predicted text recognition result, for example, as shown in fig. 2, the transformation network layer 1 is connected with the output classifier 1, and simultaneously, the transformation network layer 2 is also connected with the transformation network layer 2, the transformation network layer 2 is connected with the output classifier 2 and the transformation network layer 3, and so on.

In the training process, first, fine-tuning training is performed, and the model parameters of the transform network layer 1, the transform network layers 2 and …, the transform network layer 12, and the output classifier 12 shown in fig. 2 are updated by inputting the sample text data as an input queue into the pre-training model.

For example, firstly, sample text data is input to the transform network layer 1 as an input queue for feature extraction, then, the output result of the transform network layer 1 is input to the transform network layer 2 for further feature extraction, then, the output result of the transform network layer 2 is input to the transform network layer 3 for feature extraction, and so on until the output result of the transform network layer 12 is obtained, the output result of the transform network layer 12 can be input to the output classifier 12 connected to the transform network layer 12, and an initial predicted text recognition result is obtained. And further, judging whether the initial predicted text recognition result is converged, if not, performing iterative adjustment on the preset model, namely, adjusting model parameters of the transformation network layer 1, the transformation network layers 2 and …, the transformation network layer 12 and the output classifier 12 by using a back propagation algorithm according to an error between the initial predicted text recognition result and an expected value of the text recognition result of the sample text data until the predicted text recognition result is converged.

Then, distillation training is carried out, the obtained model parameters of the transformation network layer 1, the transformation network layer 2, …, the transformation network layer 12 and the output classifier 12 are kept unchanged, the model parameters of each layer of output classifier are respectively adjusted according to the difference between the output of each layer of output classifier and the output of the last layer of output classifier, so that the output of each layer of output classifier and the output of the last layer of output classifier have a good fitting effect, and the model parameters of the output classifier 1, the output classifier 2, … and the output classifier 11 shown in fig. 2 are updated, namely the output classifiers respectively connected with each middle transformation network layer are trained. Thus, after training is finished, each layer of output classifier has the capability of intention classification and slot position identification.

For example, after determining the model parameters of the transform network layer 1, the transform network layers 2 and …, the transform network layer 12, and the output classifier 12, the output of the transform network layer 1 may be input to the connected output classifier 1 to obtain a second predicted text recognition result of the transform network layer 1, and then, it is determined whether a loss value between the second predicted text recognition result and a first predicted text recognition result output by the trained output classifier 12 is smaller than a preset loss value threshold, if not, the output classifier 1 is iteratively adjusted, that is, the model parameters of the output classifier 1 are adjusted by using a back propagation algorithm according to a loss value between the first predicted text recognition result and the second text recognition result until the loss value is smaller than the preset loss value threshold, and so on, the output classifier 1, the model parameters of the output classifier 1, the output classifier 12, and the like may be determined, Model parameters of the output classifier 2, … and the output classifier 11, thereby obtaining a text recognition model including the transformation network layer 1, the transformation network layer 2, …, the transformation network layer 12, and the transformation network layer 1, the transformation network layer 2, … and the transformation network layer 12.

As can be seen from the above, in the model training method provided in the embodiment of the present invention, after the pre-training model is obtained, the preset classifiers connected to each conversion network layer are respectively trained to obtain the text recognition model, so that in the text recognition process, the output result of each conversion network layer in the text recognition model is input to the corresponding output classifier, and in the case that the target recognition result output by the output classifier satisfies the condition, the target recognition result is output in advance as the text recognition result, and does not enter the next conversion network layer for calculation, thereby reducing the time consumption of the text recognition model and satisfying the QPS requirement of the service.

Referring to fig. 3, a flowchart illustrating steps of a text recognition method according to the present application is shown, which may specifically include the following steps:

s301: and acquiring text data to be recognized.

The text data to be recognized may be text data extracted after recognizing user speech, text data input by a user, text data crawled from the internet, and is not limited specifically.

In this step, after the text data to be recognized is obtained, the text data to be recognized may be preprocessed to check and remove false manual labels, and in addition, standardized processing may be performed, so that the format of the text data to be recognized is more uniform, and the text recognition result is more accurate.

S302: and taking a first layer of conversion network layer of the text recognition model obtained by pre-training as a target network, taking the text data to be recognized as target data, and inputting the target data into the target network for feature extraction to obtain a feature extraction result.

The text recognition model comprises a plurality of layers of transformation network layers, and each layer of transformation network layer is respectively connected with a corresponding output classifier obtained by pre-training. The output classifier comprises an intention classification classifier and/or a slot recognition classifier, the intention recognition classifier is used for classifying the text into different intention types according to the intention of the sentences in the text, and the slot recognition classifier is used for marking each word in each sentence in the text with a corresponding slot label.

S303: and inputting the feature extraction result into an output classifier connected with a target network, classifying and identifying the feature extraction result, and judging whether the obtained target identification result meets a preset condition. If yes, S304 is executed, and if no, S305 is executed.

In one implementation manner, the step of determining whether the obtained target recognition result satisfies a preset condition includes: calculating the instability of the obtained target recognition result; judging whether the instability is smaller than a preset instability threshold value or not; if the target identification result is smaller than the preset condition, the target identification result is judged to meet the preset condition, and if the target identification result is not smaller than the preset condition, the target identification result is judged not to meet the preset condition. And the instability degree is used for representing the error degree of the obtained target recognition result exceeding the preset error range.

The instability of the obtained target recognition result can be calculated by adopting the following formula:

wherein, P_sAnd N is the output distribution of the target recognition result and the number of categories of the target recognition result.

S304: and taking the target recognition result as a text recognition result of the text data to be recognized.

Therefore, under the condition that the target recognition result output by the output classifier meets the condition, the target recognition result is output as a text recognition result in advance, and the next layer of conversion network layer calculation is not performed any more, so that the calculation time of the text recognition model is reduced.

S305: and taking the next conversion network layer of the text recognition model as a target network, taking the feature extraction result as target data, returning to S102, and inputting the target data into the target network for feature extraction.

That is, when the target recognition result output by the output classifier does not satisfy the condition, the next conversion network layer of the text recognition model is used as the target network, the feature extraction result is used as the target data, and the new target data is input to the new target network for feature extraction, so that a new feature extraction result is obtained. In this way, the accuracy of the text recognition result of the text data to be recognized can be maintained at a high level, without causing a significant reduction in the recognition accuracy of the text recognition model in order to reduce the calculation time.

In the embodiment of the invention, the higher the number of layers of the conversion network layer in the text recognition model is, the more stable the output result of the conversion network layer is, the lower the instability value is, the higher the instability threshold value is set, the easier the result is output in advance, the fewer the number of layers of reasoning is, the higher the speed is, and the lower the accuracy is; conversely, the lower the instability threshold is set, the more layers are inferred, the slower the speed is, and the higher the accuracy is. For example, when the threshold value of the instability is 0.1, the speed is increased by 1.5 times, and the accuracy is hardly reduced. In this way, the recognition accuracy of the text recognition model can still be maintained while reducing the computation time.

As can be seen from the above, in the text recognition process, the output result of each layer of the transform network layer in the text recognition model is input to the corresponding output classifier, and when the target recognition result output by the output classifier meets the condition, the target recognition result is output in advance as the text recognition result and does not enter the next layer of the transform network layer for calculation, so that the calculation time of the text recognition model is reduced, and the QPS requirement of the service is met.

Referring to fig. 4, a block diagram of a model training apparatus according to the present application is shown, and the apparatus may specifically include the following modules:

a sample obtaining module 401, configured to obtain sample text data;

a first prediction module 402, configured to input the sample text data into a pre-training model for text recognition processing to obtain a first predicted text recognition result, where the pre-training model includes multiple layers of intermediate transformation network layers, a last layer of transformation network layer, and an output classifier connected to the last layer of transformation network layer;

a second prediction module 403, configured to, for each intermediate conversion network layer, input an output result of the sample text data in the intermediate conversion network layer to a preset classifier connected to the intermediate conversion network layer, so as to obtain a second prediction text recognition result;

a determining module 404, configured to determine whether a loss value between the second predicted text recognition result and the first predicted text recognition result is smaller than a preset loss value threshold; and if not, performing iterative adjustment on a preset classifier connected with the intermediate conversion network layer, and if the preset classifier is smaller than the preset classifier, taking the preset classifier after the iterative adjustment as an output classifier corresponding to the intermediate conversion network layer to obtain a text recognition model, wherein the text recognition model comprises the pre-training model and the output classifier connected with each intermediate conversion network layer.

In one implementation, before the inputting the sample text data into a pre-training model for text recognition processing to obtain a first predicted text recognition result, the method further includes:

In one implementation, the determining whether a loss value between the second predicted-text recognition result and the first predicted-text recognition result is smaller than a preset loss value threshold includes:

In one implementation, the output classifier includes an intent classification classifier and/or a slot identification classifier.

As can be seen from the above, in the model training device provided in the embodiment of the present invention, after the pre-training model is obtained, the preset classifiers connected to each conversion network layer are respectively trained to obtain the text recognition model, so that in the text recognition process, the output result of each conversion network layer in the text recognition model is input to the corresponding output classifier, and in the case that the target recognition result output by the output classifier satisfies the condition, the target recognition result is output in advance as the text recognition result, and does not enter the next conversion network layer for calculation, thereby reducing the time consumption of the text recognition model and satisfying the QPS requirement of the service.

Referring to fig. 5, a block diagram of a text recognition apparatus according to the present application is shown, and the apparatus may specifically include the following modules:

a data obtaining module 501, configured to obtain text data to be identified;

the feature extraction module 502 is configured to use a first layer of transformation network layer of a text recognition model obtained through pre-training as a target network, use the text data to be recognized as target data, input the target data to the target network for feature extraction, and obtain a feature extraction result, where the text recognition model includes multiple layers of transformation network layers, and each layer of transformation network layer is connected to a corresponding output classifier obtained through pre-training;

an output module 503, configured to input the feature extraction result to an output classifier connected to the target network, perform classification and identification on the feature extraction result, and determine whether the obtained target identification result meets a preset condition; if so, taking the target recognition result as a text recognition result of the data to be recognized;

the feature extraction module 502 is further configured to, when the target recognition result does not meet a preset condition, use a next-layer transformation network layer of the text recognition model as a target network, use the feature extraction result as target data, and return the target data to the step of inputting the target data to the target network for feature extraction.

In one implementation, the output module 503 is specifically configured to:

calculating the instability of the obtained target recognition result;

In one implementation, the output module 503 is specifically configured to calculate the instability of the obtained target recognition result by using the following formula:

As can be seen from the above, in the text recognition process, the text recognition device provided in the embodiment of the present invention inputs the output result of each layer of the transform network layer in the text recognition model to the corresponding output classifier, and outputs the target recognition result as the text recognition result in advance when the target recognition result output by the output classifier satisfies the condition, and does not enter the next layer of the transform network layer for calculation any more, so that the time consumed by calculating the text recognition model is reduced, and the QPS requirement of the service is satisfied.

An embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to implement the following steps when executing the program stored in the memory 603:

acquiring sample text data;

Or the following steps are realized:

acquiring text data to be identified;

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM), or may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which has instructions stored therein, which when run on a computer, cause the computer to perform the text recognition method described in any of the above embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the text recognition method as described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of model training, the method comprising:

acquiring sample text data;

2. The method of claim 1, wherein before the inputting the sample text data into a pre-trained model for text recognition processing to obtain a first predicted text recognition result, the method further comprises:

3. The method of claim 1, wherein determining whether a loss between the second predicted-text recognition result and the first predicted-text recognition result is less than a predetermined loss threshold comprises:

4. The method of any of claims 1 to 3, wherein the output classifier comprises an intent classification classifier and/or a slot identification classifier.

5. A method of text recognition, the method comprising:

acquiring text data to be identified;

6. The method according to claim 5, wherein the determining whether the obtained target recognition result satisfies a preset condition includes:

7. The method according to claim 6, wherein the instability of the obtained target recognition result is calculated by using the following formula:

8. A model training apparatus, the apparatus comprising:

the sample acquisition module is used for acquiring sample text data;

9. A text recognition apparatus, characterized in that the apparatus comprises:

the data acquisition module is used for acquiring text data to be identified;

10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 7 when executing a program stored in the memory.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.