CN111339910B

CN111339910B - Text processing and text classification model training method and device

Info

Publication number: CN111339910B
Application number: CN202010111039.6A
Authority: CN
Inventors: 李哲; 李若愚
Original assignee: Alipay Labs Singapore Pte Ltd
Current assignee: Alipay Labs Singapore Pte Ltd
Priority date: 2020-02-24
Filing date: 2020-02-24
Publication date: 2023-11-28
Anticipated expiration: 2040-02-24
Also published as: CN111339910A

Abstract

The embodiment of the specification provides a text processing method, a text classification model training method and a text classification model training device, which comprise the following steps: acquiring target OCR text data of a target certificate; for text content of a text line or a text column in target OCR text data, identifying a data type to which the text content may belong using a text classification model; determining the data type of the text content of each text row or text column in the target OCR text data according to each data type and type determining model; the text classification model is trained based on a sample OCR text data set corresponding to each certificate, wherein the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data belongs to the data type.

Description

Text processing and text classification model training method and device

Technical Field

The application relates to the technical field of computers, in particular to a text processing and text classification model training method and device.

Background

With the rapid development of computer and internet technologies, optical character recognition (Optical Character Recognition, OCR) technology has been widely used. When a user needs to be authenticated in the business handling process, the user usually scans the credentials of the user or uploads the credentials of the user, then recognizes the credentials image through the background OCR technology, and translates the credentials image into computer characters. Finally, the information required for the current authentication, such as name, certificate number, etc., needs to be extracted from the text obtained by OCR recognition.

Therefore, there is a need to propose a technical solution to reliably extract the required information from OCR text.

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a method and apparatus for text processing and training a text classification model, so as to reliably extract required information from OCR text.

In order to solve the above technical problems, the embodiments of the present specification are implemented as follows:

the embodiment of the specification provides a text processing method, which comprises the following steps:

acquiring target OCR text data of a target certificate;

identifying, for text content of each text line or text column in the target OCR text data, a data type to which the text content may belong using a text classification model; the text classification model is trained based on a sample OCR text data set corresponding to certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data types of text contents of each text row or text column in the correct sample OCR text data and the error sample OCR text data;

And determining the data type of the text content of each text row or text column in the target OCR text data according to each data type and type determining model.

The embodiment of the specification also provides a training method of the text classification model, which comprises the following steps:

determining a layout arrangement template corresponding to each certificate based on layout arrangement of each certificate;

aiming at certificates arranged in each format, configuring a sample OCR text data set for a format arrangement template corresponding to the certificate; the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data type of the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data;

training the text classification model based on the sample OCR text data set corresponding to each certificate.

The embodiment of the specification also provides a text processing device, which comprises:

the acquisition module acquires target OCR text data of the target certificate;

an identification module for identifying, for each text line or text column of text content in the target OCR text data, a data type to which the text content may belong using a text classification model; the text classification model is trained based on a sample OCR text data set corresponding to certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data types of text contents of each text row or text column in the correct sample OCR text data and the error sample OCR text data;

And the first determining module is used for determining the data type of the text content of each text row or text column in the target OCR text data according to the data type and the type determining model.

The embodiment of the specification also provides a training device of the text classification model, which comprises:

the second determining module is used for determining a layout arrangement template corresponding to each certificate based on layout arrangement of the certificates;

the configuration module is used for configuring a sample OCR text data set for the layout configuration templates corresponding to the certificates according to the certificates arranged in each layout; the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data type of the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data;

and the training module is used for training the text classification model based on the sample OCR text data set corresponding to each certificate.

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to:

Acquiring target OCR text data of a target certificate;

The embodiment of the specification also provides training equipment of the text classification model, which comprises the following steps:

a processor; and

The present description also provides a storage medium for storing computer-executable instructions that, when executed, implement the following:

acquiring target OCR text data of a target certificate;

According to the technical scheme, the text content of each text row or text column in target OCR text data is identified to be possibly belonging to the data type based on a trained text classification model, and then the data type of the text content is determined from the possibly belonging data types according to each data type and type determination model; in the technical scheme, when a text classification model is trained, corresponding sample OCR text data sets are respectively configured for various certificates arranged in different formats, so that the trained text classification model can identify the certificates arranged in various formats, and when the text is processed, OCR text data corresponding to the certificates arranged in various formats can be processed simultaneously; in addition, when model training is carried out, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of a text classification model is improved, special problems in various OCR scenes can be better processed, and the accuracy of text type recognition can be improved, so that the text content of a required data type can be accurately extracted.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is one of the method flowcharts of the text processing method provided in the embodiments of the present disclosure;

FIG. 2 is a second flowchart of a text processing method according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of layout templates in the text processing method provided in the embodiment of the present disclosure;

FIG. 4 (a) is a schematic diagram of sample OCR text data in the text processing method according to the embodiment of the present disclosure;

FIG. 4 (b) is a schematic diagram of sample OCR text data in the text processing method according to the embodiment of the present disclosure;

FIG. 4 (c) is a third schematic diagram of sample OCR text data in the text processing method according to the embodiment of the present disclosure;

FIG. 5 is a third flowchart of a method for processing text according to an embodiment of the present disclosure;

FIG. 6 is a method flow diagram of a training method for a text classification model provided in an embodiment of the present disclosure;

fig. 7 is a schematic block diagram of a text processing device according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram of module composition of a training device for text classification model according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a text processing device according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solution of the present application better understood by those skilled in the art, the technical solution of the present embodiment will be clearly and completely described in the following description with reference to the accompanying drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, shall fall within the scope of the application.

The idea of the embodiment of the specification is that a text classification model is adopted to identify the data types possibly belonging to text contents of each text row or text column in OCR text data, and the text classification model is generated based on a sample data set corresponding to certificates arranged in various formats, so that the certificates arranged in various formats can be identified simultaneously, the influence caused by different formats is avoided, error sample data generated based on errors possibly occurring in the OCR recognition process is also included in the sample data set, and the applicability of the text classification model can be improved. Based on this, the embodiments of the present specification provide a text processing method, apparatus, device and storage medium, which are used for processing OCR text data, and will be discussed in detail below.

In a specific application scenario, the text processing method provided in the embodiments of the present disclosure may be applied to an authentication device, that is, the execution body of the method may be an authentication device, and specifically, may be a text processing apparatus installed on the authentication device. The authentication device may be an authentication client or an authentication server.

Fig. 1 is one of the method flowcharts of the text processing method provided in the embodiment of the present disclosure, where the method shown in fig. 1 at least includes the following steps:

step 102, obtaining target OCR text data of the target certificate.

The target certificate can be identity card, passport, driving license and other certificates.

In a specific embodiment, when the identity of the user needs to be verified, acquiring a certificate image of a target certificate of the user, and identifying the certificate image through an OCR (optical character recognition) module to obtain target OCR text data corresponding to the target certificate.

In addition, in the embodiment of the present disclosure, the layout of the target document is not changed when OCR recognition is performed on the target document, that is, the layout of the obtained target OCR text data is kept consistent with the layout of the target document.

For example, if the target certificates are arranged according to the rows, the identified target OCR text is also arranged according to the rows, the arrangement sequence and the content of each row are consistent with the target certificates, and the number of rows is kept unchanged; if the target certificates are arranged according to the columns, the target OCR texts obtained through recognition are also arranged according to the columns, the arrangement sequence and the content of each column are consistent with those of the target certificates, and the column number is kept unchanged.

Step 104, for each text line or text column of text content in the target OCR text data, identifying the data type to which the text content may belong by using a text classification model; the text classification model is trained based on a sample OCR text data set corresponding to certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data types of text contents of each text row or text column in the correct sample OCR text data and the error sample OCR text data.

Wherein, the data types refer to fields such as 'name', 'gender', 'certificate number', 'address', and the like.

For the text content of each text row or text column in the target OCR text data, performing type recognition by using a text classification model, wherein the output result of the text classification model comprises each possible data type corresponding to each text content and the probability that the text content possibly belongs to each data type. In a specific implementation, for each text content, a set number of data types may be truncated from a plurality of possible data types output from the text classification model according to the order of probability from high to low as the data type to which the text content may belong.

For ease of understanding, the following examples are presented.

For example, for text content a of a text line in the target OCR text data, one possible recognition result obtained by the text classification model is as follows:

the probability that text content a belongs to "name" is 96%;

the probability that text content a belongs to "gender" is 28%;

the probability that the text content A belongs to the national family is 65%;

the probability that the text content A belongs to the certificate number is 54%;

the probability that the text content a belongs to the "address" is 38%.

The data types corresponding to the identified text content A are ordered according to the order of the probability from high to low, and the ordered order is as follows: name, ethnicity, certificate number, address, gender, then the first 3 data types are truncated from the ordered sequence as the data types to which the text content a may belong, i.e., the data types to which the text content a may belong are name, ethnicity, and certificate number.

Of course, the data types, probability values, and number of intercepted data types in the above examples are merely exemplary illustrations, and do not constitute limitations of the embodiments of the present description.

The layout arrangement of the certificates refers to the arrangement of each text content in the certificates on the certificates.

In addition, in the embodiment of the present disclosure, when the text classification model is generated, the sample OCR data used is a sample OCR text data set corresponding to each document arranged in each format. Since multiple layout arrangements may exist for the same document, multiple sample OCR text data sets may occur for the same document. Thus, the generated text classification model can be suitable for certificates with various layout arrangements, and therefore, the identification of the certificates with various layout arrangements can be realized.

In addition, when character recognition is performed by using OCR technology, there often occurs a case where a character line is missed or similar character recognition errors occur. For example, in some cases, N in the document may be identified as M, or S in the document may be identified as 5, etc. Therefore, in the embodiment of the specification, when the sample OCR text data set corresponding to the certificate is acquired, some error sample OCR text data can be generated based on errors frequently occurring in OCR recognition, and the text classification model is trained based on the correct sample OCR text data and the error OCR text sample data, so that the obtained trained text classification model can better handle special situations of missing or wrong detection of the OCR text, and the applicability of the text classification model is improved.

And 106, determining the data type of the text content of each text line or text column in the target OCR text data according to the data type and the type determination model.

In the embodiment of the present specification, after determining the data type to which the text content of each text line or text column in the target OCR text data belongs, the text content of the specified data type is extracted from the target OCR text data according to the data type to which each text content belongs.

For example, in one embodiment, a document number of the user needs to be acquired, and after the data type of each text content in the target OCR text data corresponding to the target document is identified, a text line or text column corresponding to the "document number" is found, where the text content corresponding to the text line or text column is the document number.

Optionally, in the step 106, according to the data type to which each text content may belong and the probability of belonging to each data type, the data type corresponding to the highest probability among the probabilities corresponding to each text content may be selected as the data type to which the text content belongs, if there are two consistent data types of the text content, the second highest probability among the probabilities corresponding to the two text content is compared, and the data type corresponding to the larger probability is used as the data type to which the text content belongs.

In a specific embodiment, in the step 106, the determining a data type to which the text content of each text line or text column in the target OCR text data belongs according to each data type and type determining model specifically includes the following steps:

combining the data types possibly belonging to each text content to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data; inputting a plurality of possible data type combination sequences into a type determination model for processing, and determining one data type combination sequence output by the type determination model as a data type combination sequence corresponding to target OCR text data; and determining the data type of each text content according to the data type combination sequence corresponding to the target OCR text data.

Generally, for credentials, text content with multiple data types is contained on the credentials, for example, in the case of identity cards, it is necessary to contain "name", "gender", "ethnicity", "address", "identity card number", and so on. However, the data types of the text contents of each text line or text column are different, and when determining the data type of the text contents of each text line or text column, if each text content is divided, there may be a case where two or more texts correspond to the same data type, considering only the probability that each text content belongs to each data type. Therefore, in the embodiment of the present disclosure, the data types to which each text content in the target OCR text data may belong are combined to obtain the data type combination sequence to which the target OCR text data may belong, that is, the data type determination is performed by taking the target OCR text data as a whole, so that a situation that two or more texts all correspond to the same data type may be avoided.

Since the characters in the respective certificates are arranged according to a certain rule, for example, the first line is a name, the second line is a sex, and the like, and the arrangement of the respective text contents in the OCR text data obtained after OCR recognition is kept identical to that of the original certificate. In order to facilitate determining the data type to which each text content belongs from the data type combination sequence corresponding to the target OCR text data, in a specific embodiment, each data type in the data type combination sequence may be arranged according to the arrangement order of each text content in the target OCR text data.

For example, text contents including three text lines in the target OCR text data are respectively noted as text content 1, text content 2, and text content 3, and text content 1 is arranged in the preceding line of text content 2, and text content 2 is arranged in the preceding line of text content 3. So when generating the data type combination sequence, the plurality of data type combination sequences to which the target OCR text data may belong may be generated in the order of the data type to which the text content 1 belongs, the data type to which the text content 2 belongs, and the data type to which the text content 3 belongs.

To facilitate an understanding of the specific process of data type combination described above, the following will be exemplified.

For example, in one embodiment, the text content included in the target OCR text data includes text content a, text content B, and text content C, and text content a is arranged in a first row in the target OCR text data, text content B is arranged in a second row in the target OCR text data, text content C is arranged in a third row in the target OCR text data, and possible data types of text content a, text content B, and text content C by the text classification model are as follows:

text content a may belong to "name", "gender" and "ethnicity";

text content B may belong to "gender" and "name";

the text content C may belong to the "ethnicity" and "gender".

The text content A, the text content B and the text content C are respectively combined according to the possible data types, and the possible data type combination sequences corresponding to the obtained target OCR text data are as follows:

sequence 1: name, sex, ethnicity

Sequence 2: name, ethnicity

Sequence 3: name, sex

Sequence 4: name, sex

Sequence 5: sex, ethnicity

Sequence 6: sex, sex

Sequence 7: gender, name, ethnicity

Sequence 8: gender, name, gender

Sequence 9: ethnicity, sex, ethnicity

Sequence 10: ethnicity, sex

Sequence 11: ethnicity, name, ethnicity

Sequence 12: ethnicity, name, sex

And then inputting each obtained data type combination sequence into a pre-trained type determination model, and determining one data type combination sequence from the plurality of data type combination sequences as the data type to which the target OCR text data belongs through the type determination model.

Fig. 2 is a second flowchart of a method for processing text according to an embodiment of the present disclosure, where the method shown in fig. 2 includes at least the following steps:

step 202, obtaining target OCR text data of a target certificate.

In step 204, for each text line or text column of text content in the target OCR text data, a text classification model is used to identify the data type to which the text content may belong.

And 206, combining the data types possibly belonging to the text contents to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data.

Step 208, determining the data type combination sequence corresponding to the target OCR text data by using the trained type determination model according to the plurality of possible data type combination sequences.

Step 210, determining the data type of each text content according to the data type combination sequence corresponding to the target OCR text data.

In practice, the text classification model needs to be trained before the method provided by the embodiments of the present specification is performed, and therefore, before performing step 102 to obtain the target OCR text data of the target document, the method provided by the embodiments of the present specification further includes the following steps:

determining a layout arrangement template corresponding to each certificate based on layout arrangement of the certificate; aiming at certificates arranged in each format, configuring a sample OCT text data set for a format arrangement template corresponding to the certificate; training a text classification model based on the sample OCR text data set corresponding to each certificate; the sample OCR text data set comprises correct sample OCR text data and error sample OCR text sample data, and the data type of the text content of each text line or text column in the correct sample OCR text data and the error sample OCR text data.

The format template is used for representing text lines where text contents of various types in the certificate are located. A schematic diagram of the layout template is shown in fig. 3.

In the implementation, if a certain certificate has multiple layout, a layout template is determined for each layout of the certificate. That is, in the embodiment of the present specification, one layout arrangement corresponds to one template. For example, for the document a, if three layout arrangements exist for the document a, the layout arrangements are respectively marked as layout arrangement 1, layout arrangement 2 and layout arrangement 3, and when the sample OCR text data set is configured, the sample OCR text data set is configured for the template corresponding to the layout arrangement 1, the sample OCR text data set is configured for the template corresponding to the layout arrangement 2, and the sample OCR text data set is configured for the template corresponding to the layout arrangement 3.

In the embodiment of the present disclosure, a name and address database is pre-established, and when a sample OCR text data set is configured for each layout template, information such as a name and an address may be selected from the name and address database to configure the sample OCR text data set.

In addition, in the embodiment of the present disclosure, when configuring the sample OCR text data set for the layout template, it is necessary to configure correct sample OCR text data for the layout template, and also to configure error sample OCR text data based on errors that often occur in the OCR recognition process.

In an embodiment of the present disclosure, the error sample OCR text data includes at least one or more of the following sample data:

deleting sample data obtained by text content of at least one text row or text column in the correct sample OCR text data;

sample data resulting from replacing characters in correct sample OCR text data with similar characters.

In the implementation, for the correct sample data corresponding to each layout template, a set number of correct sample OCR text data may be selected based on the probability of missing lines during OCR recognition, so as to generate incorrect sample OCR text data.

For ease of understanding, the following examples are presented.

For example, in one specific embodiment, the number of correct sample OCR text data configured for a layout template is 1000, and if the probability of missing lines is 5% in the OCR recognition process, deleting any one line or multiple lines of text content in 50 sample data in the correct sample OCR text data; and then replacing similar characters frequently suffering from recognition errors in the rest of the correct sample OCR text data. For example, if a character M appears in OCR text data of a correct sample, the character M may be replaced with a character N to obtain OCR text data of a wrong sample; if the character S appears in the OCR text data of a certain correct sample, the character S can be replaced by a number 5 to obtain the OCR text data of the error sample; if a number 8 appears in the OCR text data of a certain correct sample, 8 can be replaced by 9, so that the OCR text data of an error sample can be obtained.

Wherein, a schematic diagram of the OCR text data of a correct sample is shown in fig. 4 (a), a schematic diagram of the OCR text data of an error sample obtained after the line missing simulation is shown in fig. 4 (b), and a schematic diagram of the OCR text data of an error sample obtained after the substitution of similar characters (substitution of o with 0, substitution of 5 with S, substitution of 9 with 8) is shown in fig. 4 (c).

And then marking out the data types of the text contents of each line in the correct sample OCR text data and the error sample OCR text data respectively to obtain sample OCR text data sets corresponding to each layout arrangement template, and training a text classification model based on the sample OCR text data sets corresponding to each layout arrangement template.

In the embodiment of the present disclosure, the text classification model is a Bi-directional long-short memory cyclic neural network (BiLSTM) text classification model. In addition, other text classification models existing at present may be adopted, so long as the models capable of realizing text classification can be applied to the embodiments of the present specification, and the embodiments of the present specification do not limit specific models of the text classification models.

In addition, in the embodiment of the present specification, the above-described type determination model is a Markov (Markov) probability model. The training process is specifically as follows:

after obtaining the sample OCR text data set corresponding to each layout arrangement template, then generating a data type combination sequence corresponding to each text line or text column in the sample OCR text data set based on the data type of the text content of each text line or text column in the sample OCR text data set (including correct sample OCR text data and error sample OCR text data).

For example, for the OCR text data shown in fig. 4 (a), the corresponding data type combination sequence is: name, certificate number, year of birth, month, address, date of issuance.

In order to facilitate understanding of the method provided by the embodiments of the present disclosure, the method provided by the embodiments of the present disclosure will be described below in conjunction with a specific application scenario. In a specific application scenario, when authenticating the user a, the certificate number of the user a needs to be extracted from the certificate of the user a. Based on the application scenario, fig. 5 shows a third method flowchart of the text processing method provided in the embodiment of the present disclosure, and the method shown in fig. 5 at least includes the following steps:

Step 502, collecting a certificate image of the user A, and performing OCR (optical character recognition) on the certificate image to obtain OCR text data of the certificate of the user A.

In step 504, for the text content of each text line in the OCR text data, a pre-trained Bi LSTM classification model is used to identify each data type corresponding to the text content and a probability that the text content belongs to each data type.

Step 506, for each text content, intercepting a set number of data types from high to low according to the probability that the text content belongs to each data type as the data type to which the text content may belong.

And step 508, combining the data types possibly belonging to the text contents to obtain a plurality of possible data type combination sequences corresponding to the OCR text data of the user A.

Step 510, determining the data type combination sequence corresponding to the OCR text data by using a pre-trained Markov probability model according to the plurality of possible data type combination sequences.

And step 512, determining the data type of the text content of each text line according to the data type combination sequence corresponding to the OCR text data.

Step 514, determining the text line corresponding to the certificate number based on the data type of the text content of each text line, and extracting the text content of the text line.

According to the processing method of the OCR text, which is provided by the embodiment of the specification, the data type possibly belonged to the text content of each text row or text column in target OCR text data is identified based on a trained text classification model, and then the data type belonged to the text content is determined from the data types possibly belonged to the text content according to each data type; in the technical scheme, when a text classification model is trained, corresponding sample OCR text data sets are respectively configured for various certificates arranged in different formats, so that the trained text classification model can identify the certificates arranged in various formats, and when the text is processed, OCR text data corresponding to the certificates arranged in various formats can be processed simultaneously; in addition, when model training is carried out, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of a text classification model is improved, special problems in various OCR scenes can be better processed, and the accuracy of text type recognition can be improved, so that the text content of a required data type can be accurately extracted.

Corresponding to the text processing method provided in the embodiment of the present specification, the embodiment of the present specification further uses a training method for a text classification model, and the trained text classification model is applied to the embodiments shown in fig. 1 to 5. Fig. 6 is one of the method flowcharts of the training method of the text classification model according to the embodiment of the present disclosure, and the method shown in fig. 6 at least includes the following steps:

step 602, determining a layout configuration template corresponding to each certificate based on layout configuration of the certificate;

step 604, for each layout document, configuring a sample OCR text data set for a layout template corresponding to the document; the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data type of the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data;

step 606, training a text classification model based on the sample OCR text data set corresponding to each certificate.

Specifically, in the embodiment of the present disclosure, in step 604, for each document arranged in each format, a sample OCR text data set is configured for a layout template corresponding to the document, including the following steps:

According to a pre-established sample user database, configuring a plurality of sample user data for each layout configuration template to obtain a plurality of correct sample OCR text data corresponding to the layout configuration template; processing the OCR text data of the correct sample according to a set rule to obtain a plurality of OCR sample data of the error sample corresponding to the layout arrangement template; the plurality of correct sample OCR text data, the plurality of error sample OCT text data, the correct sample OCR text data, and the data type to which the text content of each text line or text column in the error sample OCR text data belongs are combined as a sample OCR text data set.

Specifically, the processing the correct sample OCR text data according to the set rule includes:

deleting text content of at least one text line or text column in the correct sample OCR text data;

and/or the number of the groups of groups,

the characters in the correct sample OCR text data are replaced with similar characters.

The processing of the correct sample OCR text data according to the set rule at least includes the following three embodiments:

deleting text content of at least one text line or text column in the correct sample OCR text data.

Replacing characters in the correct text OCR text data with similar characters;

deleting the text content of at least one text line or text column in the correct sample OCR text data and replacing the characters in the correct text OCR text data with similar characters.

The specific implementation process of each step in the embodiments of the present disclosure may refer to the embodiments shown in fig. 1 to 5, and will not be described herein again.

According to the training method of the text classification model, corresponding sample OCR text data sets are respectively configured for the certificates arranged in different formats, so that the trained text classification model can identify the certificates arranged in various formats, and the OCR text data corresponding to the certificates arranged in various formats can be processed simultaneously when the text is processed; in addition, when model training is carried out, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of a text classification model is improved, special problems in various OCR scenes can be better processed, and the accuracy of text type recognition can be improved, so that the text content of a required data type can be accurately extracted.

The embodiment of the present disclosure also provides a text processing device based on the same concept, which is used for executing the text processing device provided by the embodiment of the present disclosure shown in fig. 1 to 5. Fig. 7 is a schematic block diagram of a text processing device according to an embodiment of the present disclosure, where the device shown in fig. 7 includes at least the following blocks:

an acquisition module 702 for acquiring target OCR text data of a target document;

an identification module 704 for identifying, for each text line or text column of text content in the target OCR text data, a data type to which the text content may belong using a text classification model; the text classification model is trained based on a sample OCR text data set corresponding to certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data types of text contents of each text row or text column in the correct sample OCR text data and the error sample OCR text data;

a first determining module 706, configured to determine, according to the respective data types and types determining models, a data type to which the text content of each text line or text column in the target OCR text data belongs.

Optionally, the first determining module 706 includes:

the combination unit is used for combining the data types possibly belonging to each text content to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data;

the first determining unit inputs a plurality of possible data type combination sequences into the type determining model for processing, and determines one data type combination sequence output by the type determining model as a data type combination sequence corresponding to the target OCR text data;

and a second determining unit for determining the data type of each text content according to the data type combination sequence of the target OCR text data.

Optionally, the apparatus provided in the embodiments of the present specification further includes:

the third determining module is used for determining a layout arrangement template corresponding to the certificates based on layout arrangement of the certificates;

the configuration module is used for configuring a sample OCR text data set for the layout configuration template corresponding to the certificates aiming at the certificates arranged in each layout; the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data type of the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data;

Optionally, the error sample OCR text data includes at least one or more of the following sample data:

Optionally, an apparatus provided in an embodiment of the present disclosure further includes:

and the extraction module is used for extracting the text content with the specified data type from the target OCR text data according to the data type corresponding to each text content in the target OCR text data.

Optionally, the text classification model is a Bi-directional long-short-term memory cyclic neural network Bi LSTM model;

the type determination model is a Markov probability model.

It should be noted that, the text processing device provided in the embodiment of the present disclosure is in communication with the text processing method provided in the embodiment of fig. 1 to 5 of the embodiment of the present disclosure based on the same invention, so that the implementation of this embodiment may refer to the implementation of the foregoing text processing method, and the repetition is omitted.

The text processing device provided in the embodiments of the present disclosure identifies, based on a trained text classification model, a data type to which text content of each text line or text column in target OCR text data may belong, and then determines, from the data types to which the text content may belong, the data type to which the text content belongs according to each data type; in the technical scheme, when a text classification model is trained, corresponding sample OCR text data sets are respectively configured for various certificates arranged in different formats, so that the trained text classification model can identify the certificates arranged in various formats, and when the text is processed, OCR text data corresponding to the certificates arranged in various formats can be processed simultaneously; in addition, when model training is carried out, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of a text classification model is improved, special problems in various OCR scenes can be better processed, and the accuracy of text type recognition can be improved, so that the text content of a required data type can be accurately extracted.

Corresponding to the method provided in the embodiment shown in fig. 6 of the present specification, based on the same idea, the embodiment of the present specification further provides a training device for a text classification model, which is used for executing the method provided in the embodiment shown in fig. 6 of the present specification, fig. 8 is a schematic block diagram of the training device for a text classification model provided in the embodiment of the present specification, and the device shown in fig. 8 at least includes:

a second determining module 802, configured to determine a layout template corresponding to the credentials based on layout arrangements of the credentials;

the configuration module 804 is configured to configure a sample OCR text data set for a layout template corresponding to a document for each layout of the document; the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data type of the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data;

a training module 806 is configured to train the text classification model based on the sample OCR text data set corresponding to each document.

Optionally, the configuring module 804 includes:

the configuration unit is used for configuring a plurality of sample user data for each layout configuration template according to a pre-established sample user database to obtain a plurality of correct sample OCR text data corresponding to the layout configuration templates;

The processing unit is used for processing the OCR text data of the correct sample according to the set rule to obtain a plurality of error sample OCR text data corresponding to the layout arrangement template;

and a combination unit for combining the plurality of correct sample OCR text data, the plurality of error sample OCR text data, the correct sample OCR text data and the data type of the text content of each text line or text column in the error sample OCR text data as a sample OCR text data set.

Optionally, the processing unit is specifically configured to:

and/or the number of the groups of groups,

It should be noted that, the training device for the text classification model provided in the embodiment of the present disclosure is in communication with the training method for the text classification model provided in the embodiment of fig. 6 of the embodiment of the present disclosure based on the same invention, so that the implementation of the embodiment may refer to the implementation of the training method for the text classification model, and the repetition is omitted.

According to the training device for the text classification model, corresponding sample OCR text data sets are respectively configured for the certificates arranged in different formats, so that the trained text classification model can identify the certificates arranged in various formats, and the OCR text data corresponding to the certificates arranged in various formats can be processed simultaneously when the text is processed; in addition, when model training is carried out, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of a text classification model is improved, special problems in various OCR scenes can be better processed, and the accuracy of text type recognition can be improved, so that the text content of a required data type can be accurately extracted.

Further, based on the methods shown in fig. 1 to 5, the embodiment of the present disclosure further provides a text processing device, as shown in fig. 9.

The text processing device may vary widely in configuration or performance, may include one or more processors 901 and memory 902, and may store one or more stored applications or data in memory 902. Wherein the memory 902 may be transient storage or persistent storage. The application programs stored in memory 902 may include one or more modules (not shown in the figures), each of which may include a series of computer-executable instruction information in a processing device for text. Still further, the processor 901 may be arranged to communicate with the memory 902 and execute a series of computer executable instruction information in the memory 902 on a processing device for text. The text processing device may also include one or more power supplies 903, one or more wired or wireless network interfaces 904, one or more input output interfaces 905, one or more keyboards 906, and the like.

In a particular embodiment, a processing device for text includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instruction information in the processing device for text, and execution of the one or more programs by one or more processors includes computer-executable instruction information for:

Acquiring target OCR text data of a target certificate;

Optionally, when the computer executable instruction information is executed, determining a data type to which the text content of each text line or text column in the target OCR text data belongs according to each data type and type determining model, including:

combining the data types possibly belonging to each text content to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data;

Inputting a plurality of possible data type combination sequences into a type determination model for processing, and determining one type combination sequence output by the type determination model as a data type combination sequence to which target OCR text data belong;

and determining the data type of each text content according to the data type combination sequence of the target OCR text data.

Optionally, the computer executable instruction information, when executed, may further perform the following steps before obtaining the target OCR text data of the target document:

determining a layout arrangement template corresponding to the certificates based on layout arrangement of the certificates;

for certificates arranged in each format, configuring a sample OCR text data set for a format arrangement template corresponding to the certificate; the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data type of the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data;

a text classification model is trained based on the sample OCR text data set corresponding to each document.

Optionally, the computer executable instruction information, when executed, includes at least one or more of the following sample data:

Optionally, when the computer executable instruction information is executed, after determining the data type of the text content of each text line or text column in the target OCR text data according to each data type and the probability corresponding to the data type, the following steps may be further executed:

and extracting the text content with the specified data type from the target OCR text data according to the data type corresponding to each text content in the target OCR text data.

Optionally, when the computer executable instruction information is executed, the text classification model is a Bi-directional long-short-term memory cyclic neural network Bi LSTM model;

the type determination model is a Markov probability model.

Further, based on the methods shown in fig. 1 to 5, the embodiment of the present disclosure further provides a training device for a text classification model, where the specific structure of the training device may refer to the text processing device shown in fig. 9.

In a specific embodiment, the training device of the text classification model includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instruction information in the training device of the text classification model, and the execution of the one or more programs by the one or more processors comprises computer-executable instruction information for:

Optionally, when the computer executable instruction information is executed, for each layout document, the sample OCR text data set is configured for a layout template corresponding to the document, including:

according to a pre-established sample user database, configuring a plurality of sample user data for each layout template to obtain a plurality of correct sample OCR text data corresponding to the layout template;

processing the OCR text data of the correct sample according to a set rule to obtain a plurality of OCR text data of the error sample corresponding to the layout arrangement template;

the plurality of correct sample OCR text data, the plurality of error sample OCR text data, the correct sample OCR text data and the data type to which the text content of each text line or text column in the error sample OCR text data belongs are combined as a sample OCR text data set.

Optionally, the computer executable instruction information, when executed, processes the correct sample OCR text data according to a set rule, including:

And/or the number of the groups of groups,

According to the training equipment for the text classification model, corresponding sample OCR text data sets are respectively configured for the certificates arranged in different formats, so that the trained text classification model can identify the certificates arranged in various formats, and the OCR text data corresponding to the certificates arranged in various formats can be processed simultaneously when the text is processed; in addition, when model training is carried out, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of a text classification model is improved, special problems in various OCR scenes can be better processed, and the accuracy of text type recognition can be improved, so that the text content of a required data type can be accurately extracted.

Further, based on the method shown in fig. 1 to 5, the embodiment of the present disclosure further provides a storage medium, which is used to store computer executable instruction information, and in a specific embodiment, the storage medium may be a U disc, an optical disc, a hard disk, etc., where the computer executable instruction information stored in the storage medium can implement the following flow when executed by a processor:

Acquiring target OCR text data of a target certificate;

Optionally, the computer executable instruction information stored in the storage medium, when executed by the processor, determines a data type to which the text content of each text line or text column in the target OCR text data belongs according to the respective data type and type determination model, including:

Inputting a plurality of possible data type combination sequences into the type determining model for processing, and determining one data type combination sequence output by the type determining model as a data type combination sequence corresponding to the target OCR text data;

Optionally, the storage medium stores computer executable instruction information that, when executed by the processor, further performs the steps of:

Optionally, the storage medium stores computer executable instruction information that, when executed by the processor, includes at least one or more of the following sample data:

Optionally, when the computer executable instruction information stored in the storage medium is executed by the processor, after determining, according to each data type and the probability corresponding to the data type, the data type to which the text content of each text line or text column in the target OCR text data belongs, the following steps may be further executed:

Optionally, the computer executable instruction information stored in the storage medium, when executed by the processor, the text classification model is a Bi-directional long-short-term memory recurrent neural network Bi LSTM model;

the type determination model is a Markov probability model.

The storage medium provided in the embodiments of the present disclosure stores computer-executable instruction information that, when executed by a processor, identifies a data type to which text content of each text line or text column in target OCR text data may belong based on a trained text classification model, and then determines a data type to which the text content belongs from the data types to which the text content may belong according to the respective data types; in the technical scheme, when a text classification model is trained, corresponding sample OCR text data sets are respectively configured for various certificates arranged in different formats, so that the trained text classification model can identify the certificates arranged in various formats, and when the text is processed, OCR text data corresponding to the certificates arranged in various formats can be processed simultaneously; in addition, when model training is carried out, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of a text classification model is improved, special problems in various OCR scenes can be better processed, and the accuracy of text type recognition can be improved, so that the text content of a required data type can be accurately extracted.

Further, based on the method shown in fig. 6, the embodiment of the present disclosure further provides a storage medium, which is used to store computer executable instruction information, and in a specific embodiment, the storage medium may be a U disc, an optical disc, a hard disc, etc., where the computer executable instruction information stored in the storage medium can implement the following flow when executed by a processor:

Optionally, when the computer executable instruction information stored in the storage medium is executed by the processor, for each layout document, a sample OCR text data set is configured for a layout template corresponding to the document, including:

Optionally, the computer executable instruction information stored on the storage medium, when executed by the processor, processes the correct sample OCR text data according to a set rule, including:

and/or the number of the groups of groups,

When the computer executable instruction information stored in the storage medium provided by the embodiment of the specification is executed by the processor, the corresponding sample OCR text data sets are respectively configured for the certificates arranged in different formats, so that the trained text classification model can identify the certificates arranged in various formats, and the OCR text data corresponding to the certificates arranged in various formats can be processed simultaneously when the text is processed; in addition, when model training is carried out, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of a text classification model is improved, special problems in various OCR scenes can be better processed, and the accuracy of text type recognition can be improved, so that the text content of a required data type can be accurately extracted.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instruction information. These computer program instruction information may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instruction information, which is executed by the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instruction information may also be stored in a computer readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instruction information stored in the computer readable memory produce an article of manufacture including instruction information means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instruction information may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instruction information which is executed on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instruction information, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The application may be described in the general context of computer-executable instruction information, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method of processing text, the method comprising:

acquiring target OCR text data of a target certificate;

for the text content of each text row or text column in the target OCR text data, identifying each data type which corresponds to the text content and possibly belongs to by using a text classification model; the text classification model is trained based on a sample OCR text data set corresponding to certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data types of text contents of each text row or text column in the correct sample OCR text data and the error sample OCR text data;

Determining the data type of the text content of each text row or text column in the target OCR text data according to each data type and type determining model;

the determining the data type of the text content of each text line or text column in the target OCR text data according to the data type and the type determining model comprises the following steps:

combining the data types possibly belonging to the text contents to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data;

and determining the data type of each text content according to the data type combination sequence corresponding to the target OCR text data.

2. The method of claim 1, prior to the obtaining the target OCR text data of the target document, the method further comprising:

3. The method of claim 2, the erroneous sample OCR text data comprising at least one or more of the following sample data:

deleting the sample data obtained by text content of at least one text row or text column in the correct sample OCR text data;

sample data obtained by replacing characters in the correct sample OCR text data with similar characters.

4. The method of claim 1, wherein after determining the data type to which the text content of each text line or text column in the target OCR text data belongs according to the respective data type and type determination model, the method further comprises:

And extracting text content with a specified data type from the target OCR text data according to the data type of each text content in the target OCR text data.

5. The method of claim 1, wherein the text classification model is a bi-directional long-short-term memory recurrent neural network BiLSTM model;

the type determination model is a Markov probability model.

6. The method of claim 2, wherein the configuring, for each layout document, the sample OCR text data set for the layout template corresponding to the document includes:

according to a pre-established sample user database, configuring a plurality of sample user data for each layout configuration template to obtain a plurality of correct sample OCR text data corresponding to the layout configuration template;

processing the correct sample OCR text data according to a set rule to obtain a plurality of error sample OCR text data corresponding to the layout arrangement template;

and combining the plurality of correct sample OCR text data, the plurality of error sample OCR text data, the correct sample OCR text data and the data type of the text content of each text line or text column in the error sample OCR text data as the sample OCR text data set.

7. The method of claim 6, wherein said processing said correct sample OCR text data according to a set rule comprises:

and/or the number of the groups of groups,

replacing characters in the correct sample OCR text data with similar characters.

8. A text processing apparatus, the apparatus comprising:

the acquisition module acquires target OCR text data of the target certificate;

the identification module is used for identifying each data type which corresponds to the text content and possibly belongs to the text content by using a text classification model aiming at the text content of each text row or text column in the target OCR text data; the text classification model is trained based on a sample OCR text data set corresponding to certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data and error sample OCR text data, and the data types of text contents of each text row or text column in the correct sample OCR text data and the error sample OCR text data;

the first determining module is used for determining the data type of the text content of each text row or text column in the target OCR text data according to each data type and each type determining model;

The first determining module includes:

a combination unit for combining the data types possibly belonging to the text contents to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data;

a first determining unit, configured to input a plurality of possible data type combination sequences into the type determining model for processing, and determine one data type combination sequence output by the type determining model as a data type combination sequence corresponding to the target OCR text data;

and the second determining unit is used for determining the data type of each text content according to the data type combination sequence corresponding to the target OCR text data.

9. The apparatus of claim 8, the apparatus further comprising:

10. A text processing apparatus comprising:

a processor; and

acquiring target OCR text data of a target certificate;

11. The apparatus of claim 10, further comprising:

the executable instructions, when executed, cause the processor to:

12. A storage medium storing computer-executable instructions that when executed implement the following:

acquiring target OCR text data of a target certificate;

13. The storage medium of claim 12, the executable instructions when executed further implement the following:

And training a text classification model based on the sample OCR text data set corresponding to each certificate.