CN114187448A

CN114187448A - Document image recognition method and device, electronic equipment and computer readable medium

Info

Publication number: CN114187448A
Application number: CN202111505415.0A
Authority: CN
Inventors: 李晨霞; 杜宇宁; 周军; 郭若愚; 杨烨华; 赖宝华; 刘其文; 胡晓光; 于佃海; 马艳军
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-03-15

Abstract

The disclosure provides a document image identification method and device, and relates to the technical fields of image identification, deep learning and the like. The specific implementation scheme is as follows: acquiring a document image to be identified; detecting whether a document image to be identified has at least one identification element; in response to at least one identification element in the document image to be identified, dividing the document image to be identified into at least one layout area; and identifying the layout area corresponding to each identification element to obtain an identification result of the layout area corresponding to the identification element. The embodiment improves the efficiency of document image recognition.

Description

Document image recognition method and device, electronic equipment and computer readable medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the technical fields of image recognition, deep learning, and the like, and in particular, to a document image recognition method and apparatus, an electronic device, a computer-readable medium, and a computer program product.

Background

In a daily office scene, a text document with a form needs to be processed frequently; for example, various file information when the user applies for qualification is extracted in the financial industry; the industry keeps track of the manufacturing conditions of the entity, and the electronic storage requirements of invoices and various forms, and the like.

In the field of document analysis, researchers propose different layout analysis algorithms and table recognition algorithms, but the layout analysis algorithms and the table recognition algorithms can only solve a single subtask in some aspect and cannot process the text document uniformly.

Disclosure of Invention

A document image recognition method and apparatus, an electronic device, a computer-readable medium, and a computer program product are provided.

According to a first aspect, there is provided a document image recognition method, the method comprising: acquiring a document image to be identified; detecting whether a document image to be identified has at least one identification element; in response to at least one identification element in the document image to be identified, dividing the document image to be identified into at least one layout area; and identifying the layout area corresponding to each identification element to obtain an identification result of the layout area corresponding to the identification element.

According to a second aspect, there is also provided a document image recognition method, the method comprising: acquiring a document image to be identified; inputting a document image to be recognized into a layout recognition model which is trained in advance, so that the layout recognition model detects whether the document image to be recognized has at least one recognition element; responding to the document image to be identified with at least one identification element, and obtaining at least one layout area output by the layout identification model; acquiring recognition element models which correspond to the recognition elements and are trained in advance, wherein each recognition element model is used for recognizing one recognition element; and aiming at each identification element, identifying the layout area corresponding to the identification element by adopting the acquired identification element model to obtain an identification result of the layout area corresponding to the identification element.

According to a third aspect, there is provided a document image recognition apparatus, the apparatus comprising: an acquisition unit configured to acquire a document image to be recognized; a detecting unit configured to detect whether a document image to be recognized has at least one recognition element; the dividing unit is configured to respond to at least one identification element in the document image to be identified and divide the document image to be identified into at least one layout area; and the identification unit is configured to identify the layout area corresponding to each identification element and obtain the identification result of the layout area corresponding to the identification element.

According to a fourth aspect, there is provided a document image recognition apparatus comprising: an image acquisition unit configured to acquire a document image to be recognized; the input unit is configured to input the document image to be recognized into a pre-trained layout recognition model so that the layout recognition model detects whether the document image to be recognized has at least one recognition element; the obtaining unit is configured to respond to the document image to be recognized with at least one recognition element, and obtain at least one layout area output by the layout recognition model; a model acquisition unit configured to acquire recognition element models trained in advance corresponding to respective recognition elements, each recognition element model being used for recognizing one kind of recognition element; and the identifying unit is configured to identify the layout area corresponding to each identifying element by adopting the acquired identifying element model, and obtain the identification result of the layout area corresponding to the identifying element.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described in any one of the implementations of the first aspect or the second aspect.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method as described in any one of the implementations of the first or second aspect.

According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first or second aspect.

The document image identification method and the document image identification device provided by the embodiment of the disclosure comprise the following steps of firstly, acquiring a document image to be identified; secondly, detecting whether the document image to be identified has at least one identification element; thirdly, in response to that the document image to be recognized has at least one recognition element, dividing the document image to be recognized into at least one layout area; finally, aiming at each identification element, identifying the layout area corresponding to the identification element to obtain the identification result of the layout area corresponding to the identification element. Therefore, when the document image to be recognized is determined to have the recognition elements, the layout areas of the document image to be recognized are divided according to the recognition elements, the layout areas are recognized respectively, the recognition result of each layout area is obtained, integrated recognition of the document image to be recognized is achieved, the efficiency of recognizing the document image in different scenes is improved, and the application effect of document structure analysis in the industry is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of one embodiment of a document image recognition method according to the present disclosure;

FIG. 2 is a schematic diagram illustrating a process for identifying layout areas corresponding to forms to obtain editable forms according to the present disclosure;

FIG. 3 is a flow diagram of another embodiment of a document image recognition method according to the present disclosure;

FIG. 4 is a schematic block diagram of one embodiment of an image recognition apparatus according to the present disclosure;

FIG. 5 is a schematic block diagram of another embodiment of an image recognition apparatus according to the present disclosure;

FIG. 6 is a block diagram of an electronic device for implementing a document image recognition method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

FIG. 1 shows a flow 100 according to one embodiment of a document image recognition method of the present disclosure, comprising the steps of:

step 101, obtaining a document image to be identified.

In this embodiment, the document image to be recognized is an image with document information (e.g., text, table, etc.), an electronic document in an editable form in the document image to be recognized can be obtained by recognizing the document image to be recognized, and furthermore, the electronic document in the editable form (including an electronic form and a document of text) can be edited (e.g., text is added, a table is modified, etc.).

The execution body on which the document image recognition method operates may acquire the document image to be recognized through various ways, for example, acquiring the image to be recognized in real time from a terminal, or directly obtaining the document image to be recognized prestored in a database.

Step 102, detecting whether the document image to be identified has at least one identification element.

In this embodiment, the identification element is a main document factor in the document image, the identification element may be at least one item of text, table, picture, title, and the like, the identification element is also an element corresponding to a detection method or an identification element model, and the identification element in the region of the layout area may be detected by different detection methods or identification element models; the document image to be recognized is an image containing document information, which in this embodiment is information that can be used for text input or line drawing by a computer.

In this embodiment, whether the image to be recognized has the recognition element can be detected in various ways, wherein the recognition element can be recognized in the image to be recognized by an image recognition technology because the recognition element is embodied in the image to be recognized in an image form.

Alternatively, the identification element in the image to be identified may also be identified through a model, for example, acquiring an identification element feature, inputting the image to be identified into a pre-trained layout identification model, detecting the feature of the image to be identified through the layout identification model, and determining that the image to be identified has the identification element in response to that the similarity between the feature of a certain region of the image to be identified output by the layout identification model and the identification element feature is greater than a similarity threshold.

Optionally, when it is detected that the document image to be recognized does not have any recognition element, the document image to be recognized is not processed.

Optionally, the execution subject on which the document image recognition method of this embodiment operates may also detect a non-recognition element and partition a layout area for the non-recognition element, where the non-recognition element is an element that has no corresponding detection method and is unrelated to the document, for example, a picture in the image to be recognized, which does not need to set a detection method independently, and the execution subject directly outputs the picture after detecting the picture.

Step 103, responding to at least one identification element in the document image to be identified, and dividing the document image to be identified into at least one layout area.

In this embodiment, dividing the document image to be recognized into at least one layout area refers to detecting and classifying different areas of the document image to be recognized, and determining different areas corresponding to the recognition elements in the document image to be recognized, for example, dividing a text image portion in the document image to be recognized into layout areas corresponding to texts, where the layout area corresponding to the texts may be a text paragraph or a text line.

Optionally, the layout area division may also be determined according to requirements, for example, detecting and classifying areas such as texts, tables, pictures, titles, and tables in the document pictures. The specific category may be defined according to actual requirements, for example, four categories of detecting titles, pictures, texts and tables, or only tables.

Different layout area division methods can also be adopted for the layout areas corresponding to different identification elements. For example, for a text, a layout algorithm may be adopted to divide black and white communication fields in an image to be recognized into characters, text lines, text blocks and the like from top to bottom in sequence, so as to obtain a layout area of the text.

Optionally, the dividing the domain text into the layout area further includes: detecting whether an identification element of a document image to be identified comprises a table or not, detecting whether a current text acquired from the document image to be identified is a text in the table or not (for example, the distance between the current text and the table is within a preset distance range) in response to that the identification element of the document image to be identified comprises the table, and dividing an area where the current text is located into layout areas corresponding to the table when the current text is the text in the table; and when the current text is not the text in the table, dividing the area where the current text is located into layout areas corresponding to the text.

For the identification element of the table, the table line can be obtained through operations such as erosion, expansion and the like in image identification, the row and column regions are divided, an empty table is obtained, and when the table has a text, the cells of the empty table and the text content can be combined and reconstructed into the table object.

Optionally, when the identification element is a table, dividing the layout area corresponding to the table includes: determining the area of the table, acquiring the maximum area occupied by the area of the table in the document image to be identified based on the position of the table, and taking the maximum area as the layout area corresponding to the table.

And 104, identifying the layout area corresponding to each identification element to obtain an identification result of the layout area corresponding to the identification element.

In this embodiment, after the identification elements and the layout areas corresponding to the identification elements are determined, the identification algorithm or model corresponding to each identification element may be acquired to identify the layout areas corresponding to each identification element, so as to obtain the identification result of each layout area. For example, the recognition algorithm of text includes: DocStrum (document spectra), Voronoi graph algorithms, and the like.

In this embodiment, the identification elements are different, and the corresponding identification results are different. For example, when the recognition element is a text, the recognition result is a word and/or symbol in the text and position information of the word and/or symbol in the layout area. When the identification element is a table, the identification result is an editable table, the editable table can be an electronic table, and a new electronic table can be obtained by performing operations of table grid deletion, grid addition and the like on the editable table.

In some optional implementations of this embodiment, the identifying element includes: a text; aiming at each identification element, identifying the layout area corresponding to the identification element to obtain the identification result of the layout area corresponding to the identification element, and the method comprises the following steps: and performing text recognition on the layout area corresponding to the text to obtain the characters and the position information of the characters in the document image to be recognized.

In the optional implementation mode, when the identification element is a text, the form identification is performed on the layout area corresponding to the text to obtain the characters and the position information of the characters in the document image to be identified, so that a reliable condition is provided for converting the text in the document image to be identified into the electronic text.

In some optional implementations of this embodiment, the identifying element includes: a table; for each identification element, identifying the layout area corresponding to the identification element to obtain the identification result of the layout area corresponding to the identification element, and the method comprises the following steps: performing form recognition on a layout area corresponding to the form to obtain an editable form, wherein the editable form can be an Excel form or a word form.

In the optional implementation mode, when the identification element is a table, the table identification is performed on the layout area corresponding to the table to obtain an editable table, so that a reliable condition is provided for converting the table in the document image to be identified into the electronic table.

After determining that the identification element is a table, it may be detected whether the table is an empty table or a table having text content. The above performing table identification on the layout area corresponding to the table to obtain an editable table includes: searching whether a layout area corresponding to the form has a text or not, and determining that the form is an empty form; identifying a table structure in a layout area corresponding to the table, the table structure comprising: the cells in the table, the location of each cell; and obtaining an editable table based on the table structure.

In some optional implementation manners of this embodiment, the performing table identification on the layout area corresponding to the table to obtain an editable table includes: performing single-line text detection on the layout area corresponding to the table to obtain the position information of the single text line on the layout area corresponding to the table; performing text recognition on the single text line to obtain characters and positions of the characters on a layout area corresponding to the table; identifying a table structure in a layout area corresponding to the table, the table structure comprising: the cells in the table, the location of each cell; aggregating the single text lines based on the positions of the cells and the position information of the single text lines to obtain the position corresponding relation between the single text lines and the cells; splicing the texts in the same cell based on the position of the cell corresponding to the single text line and the positions of the characters to obtain the text content in each cell; and combining the table structure with the text content in each cell to obtain an editable table.

In this optional implementation, the text and table structure detection may be implemented by a plurality of detection algorithms to obtain an editable table. Optionally, the text and table structure detection can be realized through a plurality of models, so as to obtain an editable table.

This alternative embodiment is described below in conjunction with fig. 2: as shown in fig. 2, single-line text detection (1) is performed on a single line of text in a layout area corresponding to a table to obtain four-point coordinates of the single text line in the layout area, and then text recognition (2) is performed on the position of the single text line in the layout area to obtain a text result. And (4) identifying the layout area corresponding to the table through a table structure (3) to obtain the four-point coordinates and the table structure information of each cell of the table. And (3) combining the result of the table structure identification (3) with the four-point coordinates of the single-line text line in the layout area obtained by the single-line text detection (1), jointly entering cell coordinate aggregation (4), and splicing the texts belonging to the same cell together through the cell text aggregation (5). Finally, by combining the table structure information, an editable table is obtained through an editable table export (6).

In the optional implementation manner, when the identification element is a table and the table has a text, the table identification is to identify complete table structure information and the text content in each cell based on the table area analyzed by the layout, so that the table picture is changed into an editable table, and effective conversion of the table image is effectively ensured.

In the optional implementation mode, when the identification element is a text, the text identification is performed on the layout area corresponding to the text to obtain the characters in the text and the position information of the characters in the document image to be identified, so that a reliable condition is provided for converting the text in the document image to be identified into the electronic text.

The document image identification method provided by the embodiment of the disclosure comprises the following steps of firstly, obtaining a document image to be identified; secondly, detecting whether the document image to be identified has at least one identification element; thirdly, in response to that the document image to be recognized has at least one recognition element, dividing the document image to be recognized into at least one layout area; finally, aiming at each identification element, identifying the layout area corresponding to the identification element to obtain the identification result of the layout area corresponding to the identification element. Therefore, when the document image to be recognized is determined to have the recognition elements, the layout areas of the document image to be recognized are divided according to the recognition elements, the layout areas are recognized respectively, the recognition result of each layout area is obtained, the integrated recognition of the document image to be recognized is realized, and the recognition efficiency of the document image in different scenes is improved.

FIG. 3 illustrates a flow 300 according to another embodiment of a document image recognition method of the present disclosure, comprising the steps of:

step 301, obtaining a document image to be identified.

In this embodiment, the execution body on which the document image recognition method operates may obtain the document image to be recognized through a plurality of ways, for example, obtain the image to be recognized in real time from a terminal, or directly obtain the document image to be recognized prestored in a database.

Step 302, inputting the document image to be recognized into the layout recognition model which is trained in advance, so that the layout recognition model detects whether the document image to be recognized has at least one recognition element.

In this embodiment, the layout recognition model may detect and classify areas such as texts, tables, pictures, titles, tables, and the like in the document image to be recognized. The specific detection or identification category may be defined according to actual requirements, for example, four categories of detection titles, pictures, texts and tables, or only tables. The layout recognition model can be obtained based on pre-constructed neural network model training. When the layout recognition model is trained, the recognition result of the layout recognition model is obtained by collecting samples of different layout types.

Step 303, in response to the document image to be recognized having at least one recognition element, obtaining at least one layout region output by the layout recognition model.

In this embodiment, at least one layout region may be a region corresponding to a different recognition element in the document image to be recognized, for example, a text image portion in the document image to be recognized is divided into layout regions corresponding to texts, and the layout region corresponding to the texts may be a text paragraph or a text line.

In this embodiment, the layout area identified by the layout identification model may or may not include the layout area corresponding to the identification element. The layout recognition model can output not only the layout region but also the type and information of the recognized recognition element.

In step 304, a pre-trained recognition element model corresponding to each recognition element is obtained.

In this embodiment, each recognition element model is used to recognize one kind of recognition element.

In this embodiment, after the layout identification model identifies the identification element, the identification element model corresponding to the identified identification element is obtained. Further, the identification element model is a model distinguished from the layout identification model, and the identification element model is configured to receive the layout information and determine information related to the identification element in the layout information, wherein the information related to the identification element may include: the content, type, location, etc. of the element is identified.

In this embodiment, whether an identification element exists is determined by the layout identification model, whether an identification element model is to be obtained is determined according to a result of the layout identification model, and if the layout identification model does not detect any identification element, the identification element model is not to be obtained.

In step 305, the layout area corresponding to each identification element is identified by using the acquired identification element model, and the identification result of the layout area corresponding to the identification element is obtained.

In this embodiment, the recognition element model is a model that is previously constructed and trained for each recognition element, for example, a character recognition model that is previously constructed and trained for characters, and a form recognition model that is previously constructed and trained for forms. And pre-constructing and training a finished character recognition model aiming at the title.

In some optional implementations of this embodiment, the identifying element includes: text, identifying the element model includes: the text recognition model is used for recognizing the layout area corresponding to each recognition element by adopting the acquired recognition element model, and obtaining the recognition result of the layout area corresponding to the recognition element, and comprises the following steps: and outputting the layout area corresponding to the text to a text recognition model to obtain the characters output by the text recognition model and the position information of the characters in the acquired document image.

In this embodiment, the text Recognition model may adopt an OCR (Optical Character Recognition) Recognition module, and the OCR Recognition module is used to perform text detection and Recognition on text regions such as titles and texts detected by the layout Recognition model, so as to obtain coordinates and text contents of text lines. The OCR recognition module may share the same OCR engine with the OCR engine in the table recognition module (when the table has text content, it needs to recognize by using a text recognition model), or may use different OCR engines respectively, such as models trained by using different training data, text recognition models trained by different OCR algorithms, and so on.

In the optional implementation mode, when the identification element comprises the text, the identification element model comprises a text identification model, the layout area corresponding to the text is input into the text identification model, the position information of the characters corresponding to the text is obtained, and the reliability of character identification is improved.

In some optional implementations of this embodiment, the identifying element includes: table, the recognition element model includes: the form recognition model is used for recognizing the layout area corresponding to each recognition element by adopting the acquired recognition element model to obtain the recognition result of the layout area corresponding to the recognition element, and comprises the following steps: and outputting the layout area corresponding to the form to a form identification model to obtain an editable form output by the form identification module.

In this optional implementation, when the identification element includes a form, the identification element model includes a form identification model, and the layout area corresponding to the form is input into the form identification model to obtain an editable form corresponding to the form, thereby improving the reliability of form identification.

In some optional implementations of this embodiment, the table identification model includes: the trained text detection submodel, the trained character recognition submodel and the trained table structure recognition submodel are used for recognizing the character; the text detection submodel is used for carrying out single-line text detection on the layout area corresponding to the form to obtain the position information of the single text line on the layout area corresponding to the form; the character recognition sub-model is used for carrying out text recognition on the single text line to obtain characters and positions of the characters on the layout area corresponding to the table; the table structure identifier model is used for identifying a table structure in a layout area corresponding to the table, and the table structure comprises: the cells in the table, the location of each cell; aggregating the single text lines based on the positions of the cells and the position information of the single text lines to obtain the position corresponding relation between the single text lines and the cells; splicing the texts in the same cell based on the position of the cell corresponding to the single text line and the positions of the characters to obtain the text content in each cell; and combining the table structure with the text content in each cell to obtain an editable table.

In this optional implementation, the text detection sub-model may adopt an OCR detection model, and perform single-line text detection on the table by using the OCR detection model to obtain coordinates of all text lines on the table region. The OCR detection model may be trained using detection algorithms such as EAST, DB, etc., without limitation.

The character recognition submodel may employ an OCR recognition module: and aiming at the text lines detected by the text detection submodel, carrying out text recognition by using an OCR recognition model to obtain the coordinates and the character contents of all the text lines on the table area. The OCR recognition model may be trained using a text recognition algorithm such as CRNN, without limitation.

In this embodiment, the table structure identifier model may include: the device comprises a table structure identification module, a cell coordinate aggregation module and a spreadsheet derivation module.

Wherein, the table structure recognition module: the structural information of the table is identified using a table structure identification model based on an attention mechanism, including the composition relationships between cells in the table (typically expressed as html strings), and the location coordinates of each cell.

A cell coordinate aggregation module: the method is mainly used for solving the problem of how to rejoin the texts of the cross-row cells in one cell. It does single line to multiple line aggregation by calculating the intersection-to-parallel ratio (IOU) and vertex distance between the text box coordinates obtained by the OCR engine and the cell coordinates obtained by the table structure recognition module. And judging which text lines belong to one cell by using the IOU, and judging the arrangement sequence of the text lines by using the vertex distance and the IOU.

A cell text aggregation module: and in each cell, according to the text row list and the text sequence obtained by the last step of aggregation, splicing the text contents recognized by the OCR engine from top to bottom and from left to right, so that the cell contents of a plurality of lines of texts can be spliced into a complete character string.

Spreadsheet export module: combining html (HyperText Markup Language) results of the form structure recognition results obtained by the form structure recognition module with the text contents in each cell obtained by the cell text aggregation module, and recovering the text contents to be output of the electronic forms.

In the optional implementation manner, when the identification element is a table and the table has a text, the table structure identification submodel is used to identify complete table structure information, and the text detection submodel and the text identification submodel are used to identify the text content in each cell of the table, so that the table picture is changed into an editable table, and effective conversion of the table image is effectively ensured.

The device corresponding to the document image recognition method provided by the embodiment can simultaneously complete layout analysis, OCR recognition and form recognition of the document, can conveniently complete document restoration in the document image based on the recognition result of each module, and greatly improves the usability of document structure analysis in the industry.

In the document image identification method provided by the embodiment, after the layout identification model determines that the document image to be identified has the identification elements, the layout area of the document image to be identified is divided according to the identification elements, and the layout areas are respectively identified through different identification element models to obtain the identification result of each layout area, so that the integrated identification of the document to be identified is realized, and the efficiency of identifying the document image in different scenes is improved.

In one example of the present disclosure, a document image is input to a layout recognition model, and the layout recognition model detects regions of text, tables, titles, and images in the document image; the title and the text area enter a text recognition model to carry out text detection recognition, and character coordinates and content are obtained; and the table area enters a table identification model, and table structure information is completely extracted in the table identification model, so that the table picture becomes an editable table file.

With further reference to FIG. 4, as an implementation of the methods illustrated in the above figures, the present disclosure provides one embodiment of a document image recognition apparatus, which corresponds to the method embodiment illustrated in FIG. 1.

As shown in fig. 4, the present embodiment provides a document image recognition apparatus 400 including: acquisition section 401, detection section 402, division section 403, and identification section 404. The acquiring unit 401 may be configured to acquire a document image to be identified. The above-mentioned detecting unit 402 may be configured to detect whether the document image to be recognized has at least one recognition element. The dividing unit 403 may be configured to divide the document image to be recognized into at least one layout region in response to at least one recognition element being present in the document image to be recognized. The identifying unit 404 may be configured to identify, for each identifying element, a layout region corresponding to the identifying element, and obtain an identification result of the layout region corresponding to the identifying element.

In the present embodiment, in the document image recognition apparatus 400: the specific processing and the technical effects of the obtaining unit 401, the detecting unit 402, the dividing unit 403, and the identifying unit 404 can refer to the related descriptions of step 101, step 102, step 103, and step 104 in the corresponding embodiment of fig. 1, which are not described herein again.

In some optional implementations of this embodiment, the identification element includes: a text; the recognition unit 404 includes: a text recognition module (not shown in the figures); the text recognition module can be configured to perform text recognition on a layout area corresponding to a text to obtain characters and position information of the characters in a document image to be recognized.

In some optional implementations of this embodiment, the identification element includes: a table; the recognition unit 404 includes: a table identification module (not shown). The form identification module can be configured to perform form identification on a layout area corresponding to a form to obtain an editable form.

In some optional implementation manners of this embodiment, the table identification module includes: a detection submodule (not shown), an identification submodule (not shown), a structure submodule (not shown), and a polymerization submodule (not shown) to obtain a submodule (not shown), and a combination submodule (not shown). The detection submodule can be configured to perform single-line text detection on the layout area corresponding to the table to obtain the position information of the single-line text on the layout area corresponding to the table. The recognition submodule can be configured to perform text recognition on a single text line to obtain characters and positions of the characters on the layout area corresponding to the table. The structure submodule may be configured to identify a table structure in a layout area corresponding to the table, where the table structure includes: cell, location of each cell in the table. The aggregation sub-module may be configured to aggregate the single text line based on the position of each cell and the position information of the single text line, so as to obtain the position corresponding relationship between the single text line and each cell. The obtaining sub-module may be configured to splice texts in the same cell based on the position of the cell corresponding to the single text line and the positions of the characters, so as to obtain the text content in each cell. The combining submodule may be configured to combine the table structure and the text content in each cell to obtain an editable table.

In the document image recognition apparatus provided by the embodiment of the present disclosure, first, the obtaining unit 401 obtains a document image to be recognized; next, the detection unit 402 detects whether the document image to be identified has at least one identification element; thirdly, the dividing unit 403 divides the document image to be identified into at least one layout area in response to at least one identification element in the document image to be identified; finally, the recognition section 404 recognizes the layout region corresponding to each recognition element, and obtains the recognition result of the layout region corresponding to the recognition element. Therefore, when the document image to be recognized is determined to have the recognition elements, the layout areas of the document image to be recognized are divided according to the recognition elements, the layout areas are recognized respectively, the recognition result of each layout area is obtained, the integrated recognition of the document image to be recognized is realized, and the recognition efficiency of the document image in different scenes is improved.

With further reference to FIG. 5, as an implementation of the methods illustrated in the above figures, the present disclosure provides another embodiment of a document image recognition apparatus, which corresponds to the method embodiment illustrated in FIG. 3.

As shown in fig. 5, the present embodiment provides a document image recognition apparatus 500 including: image acquisition section 501, input section 502, obtaining section 503, model acquisition section 504, and recognition section 505. The image acquiring unit 501 may be configured to acquire a document image to be identified. The input unit 502 may be configured to input the document image to be recognized into the layout recognition model trained in advance, so that the layout recognition model detects whether the document image to be recognized has at least one recognition element. The above-mentioned deriving unit 503 may be configured to derive at least one layout region of the output of the layout recognition model in response to the document image to be recognized having at least one recognition element. The model obtaining unit 504 may be configured to obtain recognition element models trained in advance corresponding to the respective recognition elements, each recognition element model being used for recognizing one kind of recognition element. The recognition unit 505 may be configured to recognize the layout region corresponding to each recognition element by using the acquired recognition element model, and obtain a recognition result of the layout region corresponding to the recognition element.

In the present embodiment, in the document image recognition apparatus 500: the detailed processing and the technical effects of the image obtaining unit 501, the input unit 502, the obtaining unit 503, the model obtaining unit 504, and the identification unit 505 can refer to the related descriptions of step 101, step 102, step 103, step 104, and step 105 in the corresponding embodiment of fig. 1, which are not described herein again.

In some optional implementations of this embodiment, the identification element includes: a text; the recognition element model includes: the text recognition model, the recognition unit 505 includes: a text recognition module (not shown in the figures); the text recognition module can be configured to output the layout area corresponding to the text recognition model, so as to obtain the characters output by the text recognition model and the position information of the characters in the acquired document image.

In some optional implementations of this embodiment, the identification element includes: a table; the identifying element model includes: the table recognition model, the recognition unit 505 includes: a table identification module (not shown). The form recognition module may be configured to output the layout area corresponding to the form recognition model, so as to obtain an editable form output by the form recognition module.

In some optional implementations of this embodiment, the table recognition model includes: a trained text detection submodel (not shown in the figure), a character recognition submodel (not shown in the figure), and a table structure recognition submodel (not shown in the figure); the text detection submodel is used for carrying out single-line text detection on the layout area corresponding to the form to obtain the position information of the single text line on the layout area corresponding to the form. The character recognition sub-model is used for carrying out text recognition on the single text line to obtain characters and positions of the characters on the layout area corresponding to the table. The table structure identifier model is used for identifying a table structure in a layout area corresponding to the table, and the table structure comprises: the cells in the table, the location of each cell; aggregating the single text lines based on the positions of the cells and the position information of the single text lines to obtain the position corresponding relation between the single text lines and the cells; splicing the texts in the same cell based on the position of the cell corresponding to the single text line and the positions of the characters to obtain the text content in each cell; and combining the table structure with the text content in each cell to obtain an editable table.

The document image recognition device provided by the embodiment determines that the document image to be recognized has the recognition elements through the layout recognition model, then performs layout area division on the document image to be recognized according to the recognition elements, and respectively recognizes each layout area through different recognition element models to obtain the recognition result of each layout area, so that the integrated recognition of the document to be recognized is realized, and the efficiency of document image recognition under different scenes is improved.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 executes the respective methods and processes described above, such as a document image recognition method. For example, in some embodiments, the document image recognition method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When a computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the document image recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the document image recognition method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable document image recognition apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A document image recognition method, the method comprising:

acquiring a document image to be identified;

detecting whether the document image to be identified has at least one identification element;

in response to at least one identification element in the document image to be identified, dividing the document image to be identified into at least one layout area;

and identifying the layout area corresponding to each identification element to obtain an identification result of the layout area corresponding to the identification element.

2. The method of claim 1, wherein the identification element comprises: a text; the method for identifying the layout area corresponding to each identification element to obtain the identification result of the layout area corresponding to the identification element comprises the following steps:

and performing text recognition on the layout area corresponding to the text to obtain characters and position information of the characters in the document image to be recognized.

3. The method of claim 1 or 2, wherein the identification element comprises: a table; the step of identifying the layout area corresponding to each identification element to obtain the identification result of the layout area corresponding to the identification element comprises the following steps:

and performing form identification on the layout area corresponding to the form to obtain an editable form.

4. The method of claim 3, wherein performing table identification on the layout area corresponding to the table to obtain an editable table comprises:

performing single-line text detection on the layout area corresponding to the form to obtain the position information of the single text line on the layout area corresponding to the form;

performing text recognition on the single text line to obtain characters and positions of the characters on a layout area corresponding to the table;

identifying a table structure in a layout area corresponding to the table, the table structure comprising: the cells in the table, the location of each cell;

aggregating the single text lines based on the positions of the cells and the position information of the single text lines to obtain the position corresponding relation between the single text lines and the cells;

splicing the texts in the same cell based on the position of the cell corresponding to the single text line and the positions of the characters to obtain the text content in each cell;

and combining the table structure with the text content in each cell to obtain an editable table.

5. A document image recognition method, the method comprising:

acquiring a document image to be identified;

inputting the document image to be recognized into a layout recognition model which is trained in advance, so that the layout recognition model detects whether the document image to be recognized has at least one recognition element;

responding to the document image to be identified with at least one identification element, and obtaining at least one layout area output by the layout identification model;

acquiring recognition element models which correspond to the recognition elements and are trained in advance, wherein each recognition element model is used for recognizing one recognition element;

and aiming at each identification element, identifying the layout area corresponding to the identification element by adopting the acquired identification element model to obtain an identification result of the layout area corresponding to the identification element.

6. The method of claim 5, wherein the identification element comprises: text, the recognition element model comprising: the text recognition model, aiming at each recognition element, adopting the obtained recognition element model to recognize the layout area corresponding to the recognition element, and obtaining the recognition result of the layout area corresponding to the recognition element, comprises:

and outputting the layout area corresponding to the text to a text recognition model to obtain characters output by the text recognition model and position information of the characters in the acquired document image.

7. The method of claim 5 or 6, wherein the identification element comprises: a table, the recognition element model comprising: the form recognition model, which adopts the obtained recognition element model to recognize the layout area corresponding to the recognition element and obtains the recognition result of the layout area corresponding to the recognition element, comprises:

and outputting the layout area corresponding to the form to a form identification model to obtain an editable form output by the form identification module.

8. The method of claim 7, wherein the table identification model comprises: the trained text detection submodel, the trained character recognition submodel and the trained table structure recognition submodel are used for recognizing the character;

the text detection submodel is used for carrying out single-line text detection on the layout area corresponding to the form to obtain the position information of the single text line on the layout area corresponding to the form;

the character recognition submodel is used for carrying out text recognition on the single text line to obtain characters on a layout area corresponding to the table and the positions of the characters;

the table structure identifier sub-model is used for identifying a table structure in a layout area corresponding to the table, and the table structure includes: the cells in the table, the location of each cell; aggregating the single text lines based on the positions of the cells and the position information of the single text lines to obtain the position corresponding relation between the single text lines and the cells; splicing the texts in the same cell based on the position of the cell corresponding to the single text line and the positions of the characters to obtain the text content in each cell; and combining the table structure with the text content in each cell to obtain an editable table.

9. An apparatus for document image recognition, the apparatus comprising:

an acquisition unit configured to acquire a document image to be recognized;

a detecting unit configured to detect whether the document image to be recognized has at least one recognition element;

the dividing unit is configured to respond to at least one identification element in the document image to be identified, and divide the document image to be identified into at least one layout area;

and the identification unit is configured to identify the layout area corresponding to each identification element and obtain the identification result of the layout area corresponding to the identification element.

10. The apparatus of claim 9, wherein the identification element comprises: a text; the identification unit includes: a text recognition module;

the text recognition module is configured to perform text recognition on a layout area corresponding to the text to obtain characters and position information of the characters in the document image to be recognized.

11. The apparatus of claim 9 or 10, wherein the identification element comprises: a table; the identification unit includes: a table identification module;

the form identification module is configured to perform form identification on a layout area corresponding to the form to obtain an editable form.

12. The apparatus of claim 11, wherein the table identification module comprises:

the detection submodule is configured to perform single-line text detection on the layout area corresponding to the table to obtain the position information of the single text line on the layout area corresponding to the table;

the recognition submodule is configured to perform text recognition on the single text line to obtain characters on a layout area corresponding to the table and positions of the characters;

a structure submodule configured to identify a table structure in a layout area corresponding to the table, the table structure including: the cells in the table, the location of each cell;

the aggregation sub-module is configured to aggregate the single text line based on the position of each cell and the position information of the single text line to obtain the position corresponding relation between the single text line and each cell;

the obtaining submodule is configured to splice texts in the same cell based on the position of the cell corresponding to the single text line and the positions of the characters to obtain text content in each cell;

and the combination sub-module is configured to combine the table structure and the text content in each cell to obtain an editable table.

13. An apparatus for document image recognition, the apparatus comprising:

an image acquisition unit configured to acquire a document image to be recognized;

an input unit configured to input the document image to be recognized into a pre-trained layout recognition model so that the layout recognition model detects whether the document image to be recognized has at least one recognition element;

the obtaining unit is configured to respond to the document image to be identified with at least one identification element, and obtain at least one layout area output by the layout identification model;

a model acquisition unit configured to acquire recognition element models trained in advance corresponding to respective recognition elements, each recognition element model being used for recognizing one kind of recognition element;

and the identifying unit is configured to identify the layout area corresponding to each identifying element by adopting the acquired identifying element model, and obtain the identification result of the layout area corresponding to the identifying element.

14. The apparatus of claim 13, wherein the identification element comprises: text, the recognition element model comprising: a text recognition model, the recognition unit comprising: a text recognition module;

and the text recognition module is configured to output the layout area corresponding to the text to a text recognition model, so as to obtain characters output by the text recognition model and position information of the characters in the acquired document image.

15. The apparatus of claim 13 or 14, wherein the identification element comprises: a table, the recognition element model comprising: a form recognition model, the recognition unit comprising: a table identification module;

and the form identification module is configured to output the layout area corresponding to the form to a form identification model to obtain an editable form output by the form identification module.

16. The apparatus of claim 15, wherein the table identification model comprises: the trained text detection submodel, the trained character recognition submodel and the trained table structure recognition submodel are used for recognizing the character;

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-7.