CN115512375A - Training method of text error correction model, text recognition method and related equipment - Google Patents

Training method of text error correction model, text recognition method and related equipment Download PDF

Info

Publication number
CN115512375A
CN115512375A CN202110632820.2A CN202110632820A CN115512375A CN 115512375 A CN115512375 A CN 115512375A CN 202110632820 A CN202110632820 A CN 202110632820A CN 115512375 A CN115512375 A CN 115512375A
Authority
CN
China
Prior art keywords
text
error correction
target image
result
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110632820.2A
Other languages
Chinese (zh)
Inventor
胡蒙
黄川�
贾珏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile IoT Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile IoT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile IoT Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202110632820.2A priority Critical patent/CN115512375A/en
Publication of CN115512375A publication Critical patent/CN115512375A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a training method of a text error correction model, a text recognition method and related equipment, and relates to the technical field of text recognition, wherein the training method of the text error correction model comprises the following steps: performing text detection on the first sample image to obtain a first target image and a second target image, wherein the first target image is a partial image of the first sample image including a text region, and the second target image is an image of the first target image without background information; performing text recognition on the first target image to obtain a first text recognition result and text features corresponding to the first text recognition result; and inputting the second target image, the first text recognition result and the text characteristics corresponding to the first text recognition result into a text error correction model, and training the text error correction model based on the output of the text error correction model, wherein the output of the text error correction model comprises an error correction result and a confidence coefficient corresponding to the error correction result. The method and the device can improve the accuracy of the trained text error correction model.

Description

Training method of text error correction model, text recognition method and related equipment
Technical Field
The invention relates to the technical field of text recognition, in particular to a training method of a text error correction model, a text recognition method and related equipment.
Background
With the development of information processing technology, the optical character recognition technology for character recognition based on machine deep learning is greatly improved. The optical character recognition needs to correct the text aiming at the text recognition result to ensure the accuracy of the text recognition result, and at present, when the text correction model is trained, the text correction model and the text recognition model are in a decoupling state, so that the accuracy of the trained text correction model is low.
Disclosure of Invention
The embodiment of the invention provides a training method of a text error correction model, a text recognition method and related equipment, which are used for solving the problem that the accuracy of the trained text error correction model is low because the text error correction model and the text recognition model are in a decoupling state when the existing text error correction model is trained.
In order to solve the technical problem, the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a method for training a text correction model, where the method includes:
performing text detection on a first sample image to obtain a first target image and a second target image, wherein the first target image is a partial image of the first sample image, which includes a text region, and the second target image is an image of the first target image, from which background information is removed;
performing text recognition on the first target image to obtain a first text recognition result and text features corresponding to the first text recognition result;
inputting the second target image, the first text recognition result and the text features corresponding to the first text recognition result into a text error correction model, and training the text error correction model based on the output of the text error correction model, wherein the output of the text error correction model comprises an error correction result and a confidence coefficient corresponding to the error correction result.
In a second aspect, an embodiment of the present invention provides a text recognition method, where the method includes:
acquiring a second text recognition result of the image to be processed;
and performing text error correction on the second text recognition result by adopting a text error correction model, wherein the text error correction model is the text error correction model of the first aspect.
In a third aspect, an embodiment of the present invention provides a device for training a text correction model, where the device includes:
the detection module is used for performing text detection on a first sample image to obtain a first target image and a second target image, wherein the first target image is a partial image of the first sample image, which comprises a text region, and the second target image is an image of the first target image, which is removed from background information;
the recognition module is used for performing text recognition on the first target image to obtain a first text recognition result and text features corresponding to the first text recognition result;
and the training module is used for inputting the second target image, the first text recognition result and the text features corresponding to the first text recognition result into a text error correction model, and training the text error correction model based on the output of the text error correction model, wherein the output of the text error correction model comprises an error correction result and a confidence coefficient corresponding to the error correction result.
In a fourth aspect, an embodiment of the present invention provides a text recognition apparatus, where the apparatus includes:
the acquisition module is used for acquiring a second text recognition result of the image to be processed;
and the error correction module is used for performing text error correction on the second text recognition result by adopting a text error correction model, wherein the text error correction model is the text error correction model of the first aspect.
In a fifth aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor and a program stored on the memory and executable on the processor, the program implementing the steps in the training method of the text correction model according to the first aspect when executed by the processor; alternatively, the program implements the steps in the text recognition method according to the second aspect when executed by the processor.
In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements the steps of the training method for text error correction model according to the first aspect; or the computer program realizes the steps of the text recognition method according to the second aspect when executed by a processor.
In the embodiment of the invention, text detection is performed on a first target image to obtain a first target image and a second target image, wherein the first target image is a partial image of the first target image, which includes a text region, and the second target image is an image of the first target image, which is removed of background information; performing text recognition on the first target image to obtain a first text recognition result and text features corresponding to the first text recognition result; inputting the second target image, the first text recognition result and the text features corresponding to the first text recognition result into a text error correction model, and training the text error correction model based on the output of the text error correction model, wherein the output of the text error correction model comprises an error correction result and a confidence coefficient corresponding to the error correction result. Therefore, when the text error correction model is trained, the text error correction model and the text recognition model are tightly coupled through the second target image and the text features corresponding to the first text recognition result, the accuracy of the trained text error correction model can be improved, and the cost of data annotation can be reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive labor.
FIG. 1 is a flowchart of a training method of a text error correction model according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a detection submodel according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an identifier model according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a text error correction model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a computing BERT network input according to an embodiment of the present invention;
FIG. 6 is a flow chart of a text recognition method according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a training apparatus for text error correction models according to an embodiment of the present invention;
FIG. 8 is a second schematic structural diagram of a training apparatus for text error correction models according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of a training method of a text error correction model according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:
step 101, performing text detection on a first sample image to obtain a first target image and a second target image, where the first target image is a partial image of the first sample image that includes a text region, and the second target image is an image of the first target image without background information.
The text recognition model may include a detection submodel and a recognition submodel. The network structure capable of achieving the Text detection effect can be used as the network structure of the detection submodel, for example, the detection submodel can be an EAST (Efficient and Accurate Scene Text) model. The text detection may be performed on the first sample image through the detection sub-model, taking the detection sub-model as an EAST model as an example, and the output of the EAST model may be four diagonal points representing a text box in the first sample image, an inclination angle of the text box, and a score indicating whether each position point in the first sample image is a text region. The first target image can be obtained through the four diagonal points representing the text box in the first sample image and the inclination angle of the text box, and the second target image can be obtained through the first target image and the score of whether each position point is a text region.
As a specific embodiment, the EAST model may be a full convolution network and may include a plurality of convolution (conv) layers and a plurality of inverse pooling (unsoll) layers, and as shown in fig. 2, the EAST model may include eight convolution layers from the first convolution layer to the eighth convolution layer and three inverse pooling layers from the first inverse pooling layer to the third inverse pooling layer. The input of the first convolution layer is a first sample image, the input of the second convolution layer to the fourth convolution layer is the output of the previous convolution layer, the input of the first anti-pooling layer is the output of the fourth convolution layer, the input of the fifth convolution layer is the splicing (concat) result of the output of the first anti-pooling layer and the output of the third convolution layer, the input of the sixth convolution layer is the splicing result of the output of the second anti-pooling layer and the output of the second convolution layer, the input of the seventh convolution layer is the splicing result of the output of the third anti-pooling layer and the output of the first convolution layer, and the input of the eighth convolution layer is the output of the seventh convolution layer.
In addition, through the convolution operation and the inverse pooling operation of fig. 2, a feature map (feature map) having the same size as the first sample image can be obtained. The feature map is passed through 1*1 convolution kernels with channels 1, 4 and 1 to obtain score map (score map), text boxes (text boxes) and text angles (text angle). score map indicates whether a position point in the first sample image is a score of a text region, text boxes indicate distances from the position point to four sides of the text box, and text angle indicates a tilt angle of the text box for the position point. The inclined text detection box can be obtained through text boxes and text angles, and the first target image can be a result obtained by cutting the first sample image according to the text detection box. And setting the value of the position point corresponding to the score map smaller than the preset score in the first sample image to be 0, namely setting the position of the non-text area to be 0, and then cutting according to the text detection box to obtain a second target image, so that the background information in the first target image can be removed, and the influence of background factors can be removed. The preset score may be 0.4, or 0.6, or 0.8, etc., and may be, for example, 0.5. The first target image and the second target image may be generated by rotating the picture according to the angle of each text detection box to make the text box horizontal and then cropping the text box.
It should be noted that the detection submodel may be trained on a certain real data set, so that retraining or joint training is not required. The data for training the detection submodel can come from the sample set, and in order to improve the accuracy of sample labeling in the sample set, the sample image with the intersection ratio of the text box detected by the detection submodel and the labeled text box lower than 0.7 can be deleted from the sample set.
And 102, performing text recognition on the first target image to obtain a first text recognition result and text features corresponding to the first text recognition result.
The Network structure that can achieve the text recognition effect may be a Network structure of a recognition submodel, and the recognition submodel may include, for example, a CRNN (Convolutional Recurrent Neural Network) and a CTC (connection semantic Temporal Classification) model. The first target image may be text recognized by the recognition submodel. Taking the identification submodels including the CRNN and the CTC models as examples, the CNN extracts a feature sequence from the first target image, the RNN predicts the tag (true value) distribution of the feature sequence obtained from the convolutional layer, and the CTC converts the tag distribution obtained from the loop layer into the first text identification result through operations such as deduplication and integration.
As shown in fig. 3, taking the text content in the first sample image including "the same kind of product produced by sales members" as an example, t2, t3 and t6, t7 and t12, t13 output by the RNN are determined to be the same character respectively through the output of the CTC, the inputs of the RNN corresponding to the time sequences are spliced together to be used as the feature sequence of the corresponding text, the inputs of the RNN corresponding to the time sequences are taken as the feature sequences of the text at other positions and are marked as fi, i represents the text number, each text corresponds to one feature sequence, and f1 to fn are the text features corresponding to the first text recognition result.
It should be noted that the recognition submodel may be trained on a certain real data set, so that retraining or joint training is not required. The data for training the recognition submodel may be from a real data set and data generated by the detection submodel, and in order to make the recognized text length the same as the real text length, sample data for which the recognized text length is not equal to the real text length in the data generated by the recognition submodel may be deleted.
Step 103, inputting the second target image, the first text recognition result and the text features corresponding to the first text recognition result into a text error correction model, and training the text error correction model based on the output of the text error correction model, where the output of the text error correction model includes an error correction result and a confidence corresponding to the error correction result.
Wherein, the loss value of the error correction result can be calculated, and the model parameter of the text error correction model is reversely updated according to the loss value of the error correction result; or the loss value of the error correction result and the loss value of the confidence coefficient may be calculated respectively, and the model parameter of the text error correction model is updated reversely based on the loss value of the error correction result and the loss value of the confidence coefficient, for example, the loss value of the error correction result and the loss value of the confidence coefficient may be weighted and averaged, and the loss value of the error correction result and the loss value of the confidence coefficient may be used as the output loss value of the text error correction model to update the model parameter of the text error correction model reversely. The loss value can be calculated in a cross-entropy manner.
It should be noted that the trained text error correction model can be used for text recognition, and the accuracy of optical character recognition can be improved. And acquiring a second text recognition result of the image to be processed, and performing text error correction on the second text recognition result by adopting a trained text error correction model. The obtaining of the second text recognition result of the image to be processed may include performing text detection on the image to be processed to obtain a third target image and a fourth target image, where the third target image is a partial image of the image to be processed, which includes a text region, and the fourth target image is an image of the third target image, which is obtained after background information is removed; and performing text recognition on the third target image to obtain a second text recognition result and text features corresponding to the second text recognition result. The performing text correction on the second text recognition result by using the trained text correction model may include inputting the fourth target image, the second text recognition result, and text features corresponding to the second text recognition result into the trained text correction model for text correction. The output of the text error correction model may include an error correction result and a confidence corresponding to the error correction result, and when the confidence is lower than a preset threshold, the second text recognition result may be used as a final text recognition result; and when the confidence coefficient is higher than the preset threshold value, taking the error correction result as a final text recognition result.
In the embodiment of the invention, text detection is performed on a first target image to obtain a first target image and a second target image, wherein the first target image is a partial image of the first target image, which includes a text region, and the second target image is an image of the first target image, which is removed of background information; performing text recognition on the first target image to obtain a first text recognition result and text features corresponding to the first text recognition result; inputting the second target image, the first text recognition result and the text features corresponding to the first text recognition result into a text error correction model, and training the text error correction model based on the output of the text error correction model, wherein the output of the text error correction model comprises an error correction result and a confidence coefficient corresponding to the error correction result. Therefore, when the text error correction model is trained, the text error correction model and the text recognition model are tightly coupled through the second target image and the text features corresponding to the first text recognition result, the accuracy of the trained text error correction model can be improved, and the cost of data annotation can be reduced.
Optionally, the text error correction model includes a first text error correction network and a second text error correction network;
the inputting the second target image, the first text recognition result and the text feature corresponding to the first text recognition result into a text correction model includes:
inputting the text characteristics corresponding to the first text recognition result into the first text error correction network to obtain a first sub error correction result;
inputting the second target image and the first text recognition result into the second text error correction network to obtain a second sub error correction result and the confidence coefficient;
wherein the error correction result is determined based on the first sub error correction result and the second sub error correction result.
The first sub-error-correction result may be a probability distribution of characters at each position point of the text, or may be a feature value of the characters at each position point of the text. The second sub-correction result may be a probability distribution of the characters at each position point of the text, or may be a feature value of the characters at each position point of the text. The characters of each position point of the text can be determined through probability distribution or characteristic values. The error correction result may be a weighted average of the first sub-error correction result and the second sub-error correction result, and the error correction result Q may be, for example:
Qi=α*Pi+(1-α)*Ti
pi can be a first sub-error correction result, ti can be a second sub-error correction result, i is larger than 1 and smaller than n, n is the number of characters in the error correction result, and alpha is a preset value. Illustratively, α may take a value of 0.1.
Additionally, the first text error correction network may include a convolutional layer, a fully-connected layer connected to the convolutional layer, and a normalization layer connected to the fully-connected layer; alternatively, as shown in FIG. 4, a convolutional layer, a normalization layer connected to the convolutional layer; alternatively, a network structure having the same effect may be used as the network structure of the first text error correction network, which is not limited in this embodiment. As shown in fig. 4, the second text error correction network may include a BERT (Bidirectional Encoder Representation from converters) network and a convolutional network (CNN), or a network structure having the same effect may also be used as the network structure of the second text error correction network, which is not limited in this embodiment.
In the embodiment, the text error correction model is constructed by the first text error correction network and the second text error correction network, and an error correction model structure of the text error correction model tightly coupled with the text recognition model is provided, so that a better text error correction effect can be obtained.
Optionally, the second text error correction network includes a BERT network and a convolutional network, and the inputting the second target image and the first text recognition result into the second text error correction network to obtain a second sub-error correction result and the confidence level includes:
inputting the second target image into the convolution network to obtain a convolution vector;
and inputting the mark embedding vector and the position embedding vector corresponding to the first text recognition result and the convolution vector into the BERT network to obtain a second sub-error correction result and the confidence coefficient.
The label embedding (Token embedding) vector corresponding to the first text recognition result may be a word vector, and a first word of the word vector may be an E [ CLS ] flag, which is used to distinguish whether the text needs to be corrected. Position embedding (Position Embeddings) vectors may be used to characterize the learned Position features. As shown in fig. 5, taking the first text recognition result as "sales member produced commodity product", the input of the BERT network may be a superposition of Token Embeddings vector, position Embeddings vector and convolution vector. The convolution vector can be superimposed on E [ CLS ] in the Token entries vector, and whether the text needs error correction can be restrained through the image information. The second sub-correction result may include T1 to Tn, and T1 to Tn may be probability distributions of characters of respective location points of the text.
Taking the second text error correction network including the BERT network and the convolutional network as an example, as shown in fig. 4, the text content in the second target image may be "a similar product produced by a sales member", the second target image is processed by CNN and then is superimposed on E [ CLS ] in the Token Embeddings vector to be used as [ CLS ] input by the BERT network, and Tokn obtained by superimposing the Token Embeddings vector and the Position Embeddings vector corresponding to the first text recognition result "a commercial product produced by the sales member" is used as E1 to En input by the BERT network. The BERT network deletes the unnecessary Segment Embeddings vectors. The BERT network outputs the confidence coefficient and the second sub-error correction results T1 to Tn. And performing weighted average processing on the second sub error correction results T1 to Tn and the first sub error correction results after normalization of the text features f1 to fn corresponding to the first text recognition results respectively, calculating a loss value, during training, reserving 50% of samples which are different from the real text in the first text recognition results, and replacing at least one word in other samples with a word of which the similarity with the feature vector is higher than the preset similarity.
In the embodiment, the convolution vector obtained by the second target image through the convolution network is input into the BERT network, so that the image information in the text detection process can be applied to text error correction, and the accuracy of the trained text error correction model can be improved.
Optionally, the first text error correction network includes a convolutional layer, a fully-connected layer connected to the convolutional layer, and a normalization layer connected to the fully-connected layer.
And processing the text features f1 to fn corresponding to the first text recognition result respectively through a convolution layer, a full-link layer connected with the convolution layer and a normalization layer connected with the full-link layer to obtain a first sub-error correction result with the same dimensionality as a second sub-error correction result.
In this embodiment, the text feature corresponding to the first text recognition result is processed by the convolution layer, the full-link layer connected to the convolution layer, and the normalization layer connected to the full-link layer to obtain the first sub-error correction result, so that the error correction result can be determined from the first sub-error correction result.
Optionally, before performing text detection on the first sample image, the method further includes:
extracting a feature vector corresponding to each word in preset text content;
replacing at least one word in the preset text content based on the feature vector;
fusing the replaced preset text content based on the preset background image to obtain a second sample image;
adding the second sample image into a sample set to obtain an expanded sample set;
wherein the first sample image is any one sample image in the extended sample set.
The feature vector corresponding to each word in the preset text content can be extracted through the font model. The replacing of the at least one word in the preset text content based on the feature vector may be replacing the at least one word in the preset text content with a word having a similarity higher than a preset similarity with the feature vector thereof. The preset text content may be a sentence randomly selected from a pre-stored semantically complete set of sentences. For example, a portion of a semantically complete sentence set may be randomly selected to replace one or more of the words into other words that are similar to their glyphs. For a word replaced in the preset text content, a label representing that the word is wrong can be set for the word. The fusion processing of the replaced preset text content based on the preset background image may be poisson fusion processing of the replaced preset text content based on the preset background image, and for example, the replaced preset text content may be superimposed on a background image that is the same as or similar to the sample image in the sample set, and subjected to basic transformation such as rotation, inclination, color dithering and the like; the overlapped image can be binarized, and then expansion and corrosion processing are carried out to obtain a mask (mask) aiming at the replaced preset text content; the binarized image and the preset background image may be subjected to poisson fusion of NORMAL CLONE (NORMAL _ CLONE) based on a mask to obtain a second sample image, and the replaced preset text content and the text position thereof in the second sample image may be recorded as a tag.
In the embodiment, the replaced preset text content is subjected to fusion processing based on the preset background image to obtain a second sample image, and the second sample image is added to the sample set, so that the sample set can be expanded, and the influence of background information on the text error correction model can be reduced.
Referring to fig. 6, fig. 6 is a flowchart of a text recognition method according to an embodiment of the present invention, and as shown in fig. 6, the method includes the following steps:
step 201, acquiring a second text recognition result of an image to be processed;
step 202, performing text error correction on the second text recognition result by using a text error correction model, where the text error correction model is the text error correction model according to the embodiment of the present invention.
In the embodiment of the invention, a second text recognition result of the image to be processed is obtained; and performing text error correction on the second text recognition result by adopting a text error correction model, wherein the text error correction model is the text error correction model in the embodiment of the invention. Therefore, when the text error correction model is trained, the text error correction model and the text recognition model are tightly coupled through the second target image and the text features corresponding to the first text recognition result, the accuracy of the trained text error correction model can be improved, the trained text error correction model is applied to optical character recognition, and the accuracy of the optical character recognition can be improved.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an apparatus for training a text correction model according to an embodiment of the present invention, and as shown in fig. 7, the apparatus 300 includes:
a detection module 301, configured to perform text detection on a first sample image to obtain a first target image and a second target image, where the first target image is a partial image of the first sample image that includes a text region, and the second target image is an image of the first target image without background information;
the recognition module 302 is configured to perform text recognition on the first target image to obtain a first text recognition result and a text feature corresponding to the first text recognition result;
a training module 303, configured to input the second target image, the first text recognition result, and text features corresponding to the first text recognition result into a text error correction model, and train the text error correction model based on an output of the text error correction model, where the output of the text error correction model includes an error correction result and a confidence corresponding to the error correction result.
Optionally, the text error correction model includes a first text error correction network and a second text error correction network;
the training module 303 is specifically configured to:
inputting the text characteristics corresponding to the first text recognition result into the first text error correction network to obtain a first sub error correction result;
inputting the second target image and the first text recognition result into the second text error correction network to obtain a second sub error correction result and the confidence coefficient;
training the text error correction model based on the output of the text error correction model;
wherein the error correction result is determined based on the first sub error correction result and the second sub error correction result.
Optionally, the second text error correction network includes a BERT network and a convolutional network, and the training module 303 is further specifically configured to:
inputting the second target image into the convolution network to obtain a convolution vector;
and inputting the mark embedding vector and the position embedding vector corresponding to the first text recognition result and the convolution vector into the BERT network to obtain a second sub-error correction result and the confidence coefficient.
Optionally, the first text error correction network includes a convolutional layer, a fully-connected layer connected to the convolutional layer, and a normalization layer connected to the fully-connected layer.
Optionally, as shown in fig. 8, the apparatus 300 further includes:
the extracting module 304 is configured to extract a feature vector corresponding to each word in the preset text content;
a replacing module 305, configured to replace at least one word in the preset text content based on the feature vector;
the processing module 306 is configured to perform fusion processing on the replaced preset text content based on a preset background image to obtain a second sample image;
an adding module 307, configured to add the second sample image to the sample set, so as to obtain an expanded sample set;
wherein the first sample image is any one sample image in the extended sample set.
The training apparatus for text error correction model can implement each process implemented in the embodiment of the method in fig. 1, and is not described here again to avoid repetition.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present invention, and as shown in fig. 9, the apparatus 400 includes:
the obtaining module 401 is configured to obtain a second text recognition result of the image to be processed;
an error correction module 402, configured to perform text error correction on the second text recognition result by using a text error correction model, where the text error correction model is the text error correction model according to the embodiment of the present invention.
The text recognition apparatus can implement each process implemented in the method embodiment of fig. 6, and is not described here again to avoid repetition.
As shown in fig. 10, an embodiment of the present invention further provides an electronic device 500, including: the processor 501, the memory 502, and the program stored in the memory 502 and capable of being executed on the processor 501, where the program is executed by the processor 501 to implement each process of the above-mentioned training method for text error correction model embodiment, or the program is executed by the processor 501 to implement each process of the above-mentioned text recognition method embodiment, and can achieve the same technical effect, and are not described herein again to avoid repetition.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above-mentioned embodiment of the text error correction model training method, or when the computer program is executed by the processor, the computer program implements each process of the above-mentioned embodiment of the text recognition method, and can achieve the same technical effect, and in order to avoid repetition, the computer program is not described herein again. The computer readable storage medium is, for example, ROM, RAM, magnetic disk or optical disk.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the methods according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A method for training a text correction model, the method comprising:
performing text detection on a first sample image to obtain a first target image and a second target image, wherein the first target image is a partial image of the first sample image, which includes a text region, and the second target image is an image of the first target image, from which background information is removed;
performing text recognition on the first target image to obtain a first text recognition result and text features corresponding to the first text recognition result;
inputting the second target image, the first text recognition result and the text features corresponding to the first text recognition result into a text error correction model, and training the text error correction model based on the output of the text error correction model, wherein the output of the text error correction model comprises an error correction result and a confidence coefficient corresponding to the error correction result.
2. The method of claim 1, wherein the text correction model comprises a first text correction network and a second text correction network;
the inputting the second target image, the first text recognition result and the text feature corresponding to the first text recognition result into a text correction model includes:
inputting the text characteristics corresponding to the first text recognition result into the first text error correction network to obtain a first sub error correction result;
inputting the second target image and the first text recognition result into the second text error correction network to obtain a second sub error correction result and the confidence coefficient;
wherein the error correction result is determined based on the first sub error correction result and the second sub error correction result.
3. The method of claim 2, wherein the second text correction network comprises a BERT network and a convolutional network, and wherein inputting the second target image and the first text recognition result into the second text correction network to obtain a second sub-correction result and the confidence level comprises:
inputting the second target image into the convolution network to obtain a convolution vector;
and inputting the mark embedding vector and the position embedding vector corresponding to the first text recognition result and the convolution vector into the BERT network to obtain a second sub-error correction result and the confidence coefficient.
4. The method of claim 2, wherein the first text error correction network comprises a convolutional layer, a fully-connected layer connected to the convolutional layer, and a normalization layer connected to the fully-connected layer.
5. The method of claim 1, wherein prior to the text detection of the first sample image, the method further comprises:
extracting a feature vector corresponding to each word in preset text content;
replacing at least one word in the preset text content based on the feature vector;
fusing the replaced preset text content based on the preset background image to obtain a second sample image;
adding the second sample image to the sample set to obtain an expanded sample set;
wherein the first sample image is any one sample image in the extended sample set.
6. A method of text recognition, the method comprising:
acquiring a second text recognition result of the image to be processed;
performing text correction on the second text recognition result by using a text correction model, wherein the text correction model is the text correction model in any one of claims 1 to 5.
7. An apparatus for training a text correction model, the apparatus comprising:
the detection module is used for performing text detection on a first sample image to obtain a first target image and a second target image, wherein the first target image is a partial image of the first sample image, which comprises a text region, and the second target image is an image of the first target image, which is removed from background information;
the recognition module is used for performing text recognition on the first target image to obtain a first text recognition result and text features corresponding to the first text recognition result;
and the training module is used for inputting the second target image, the first text recognition result and the text features corresponding to the first text recognition result into a text error correction model, and training the text error correction model based on the output of the text error correction model, wherein the output of the text error correction model comprises an error correction result and a confidence coefficient corresponding to the error correction result.
8. The apparatus of claim 7, wherein the text correction model comprises a first text correction network and a second text correction network;
the training module is specifically configured to:
inputting the text characteristics corresponding to the first text recognition result into the first text error correction network to obtain a first sub error correction result;
inputting the second target image and the first text recognition result into the second text error correction network to obtain a second sub error correction result and the confidence coefficient;
training the text error correction model based on the output of the text error correction model;
wherein the error correction result is determined based on the first sub error correction result and the second sub error correction result.
9. A text recognition apparatus, the apparatus comprising:
the acquisition module is used for acquiring a second text recognition result of the image to be processed;
a text correction module, configured to perform text correction on the second text recognition result by using a text correction model, where the text correction model is the text correction model according to any one of claims 1 to 5.
10. An electronic device, comprising: a memory, a processor and a program stored on the memory and executable on the processor, the program, when executed by the processor, implementing the steps in the training method of the text correction model according to any one of claims 1 to 5; alternatively, the program realizes the steps in the text recognition method according to claim 6 when executed by the processor.
CN202110632820.2A 2021-06-07 2021-06-07 Training method of text error correction model, text recognition method and related equipment Pending CN115512375A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110632820.2A CN115512375A (en) 2021-06-07 2021-06-07 Training method of text error correction model, text recognition method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110632820.2A CN115512375A (en) 2021-06-07 2021-06-07 Training method of text error correction model, text recognition method and related equipment

Publications (1)

Publication Number Publication Date
CN115512375A true CN115512375A (en) 2022-12-23

Family

ID=84500469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110632820.2A Pending CN115512375A (en) 2021-06-07 2021-06-07 Training method of text error correction model, text recognition method and related equipment

Country Status (1)

Country Link
CN (1) CN115512375A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116882383A (en) * 2023-07-26 2023-10-13 中信联合云科技有限责任公司 Digital intelligent proofreading system based on text analysis

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116882383A (en) * 2023-07-26 2023-10-13 中信联合云科技有限责任公司 Digital intelligent proofreading system based on text analysis

Similar Documents

Publication Publication Date Title
CN110851641B (en) Cross-modal retrieval method and device and readable storage medium
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
CN112381837A (en) Image processing method and electronic equipment
CN115526259A (en) Training method and device for multi-mode pre-training model
CN113657098B (en) Text error correction method, device, equipment and storage medium
CN114528394B (en) Text triple extraction method and device based on mask language model
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
CN112633423A (en) Training method of text recognition model, text recognition method, device and equipment
CN112966685A (en) Attack network training method and device for scene text recognition and related equipment
CN111368066A (en) Method, device and computer readable storage medium for acquiring dialogue abstract
CN115393625A (en) Semi-supervised training of image segmentation from coarse markers
CN115512375A (en) Training method of text error correction model, text recognition method and related equipment
CN113255343A (en) Semantic identification method and device for label data, computer equipment and storage medium
CN115187839B (en) Image-text semantic alignment model training method and device
CN112183513A (en) Method and device for identifying characters in image, electronic equipment and storage medium
CN114022684B (en) Human body posture estimation method and device
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN115147846A (en) Multi-language bill identification method, device, equipment and storage medium
KR102476334B1 (en) Diary generator using deep learning
CN108021918B (en) Character recognition method and device
CN113657364A (en) Method, device, equipment and storage medium for recognizing character mark
EP3757825A1 (en) Methods and systems for automatic text segmentation
CN111078869A (en) Method and device for classifying financial websites based on neural network
CN117076596B (en) Data storage method, device and server applying artificial intelligence
CN115100419B (en) Target detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination