CN114863408A

CN114863408A - Document content classification method, system, device and computer readable storage medium

Info

Publication number: CN114863408A
Application number: CN202110648550.4A
Authority: CN
Inventors: 王明辉; 闾磊; 高阳; 黄甫毅; 樊淼淼
Original assignee: Sichuan Yishu Technology Co ltd
Current assignee: Sichuan Yishu Technology Co ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2022-08-05

Abstract

The application discloses a document content classification method, a system, a device and a computer readable storage medium, comprising: converting the document into a picture format to obtain a target picture corresponding to the target document; extracting content features from the target picture by using a preset document content classification model, and performing region division on the target picture according to the content features to obtain a plurality of segmentation regions to be sequenced; extracting the text type of each segmentation region by using a preset document layout analysis model, and sequencing according to the text type of each segmentation region to obtain a plurality of text regions with correct text sequence; and reordering the text regions to obtain the recombined document. According to the method, the document is divided into a plurality of regions according to the categories through image identification, each region is independently typeset, so that the typesetting is more flexible, the whole is not seriously affected by errors among the regions, and finally, the whole sorting is carried out to obtain the complete document.

Description

Document content classification method, system, device and computer readable storage medium

Technical Field

The present invention relates to the field of information retrieval, and in particular, to a method, a system, an apparatus, and a computer-readable storage medium for classifying document contents.

Background

The document content classification technology is used for labeling and classifying information contents under a certain classification system, belongs to a research field of information retrieval technology, has the function of helping people improve the efficiency of managing and processing text information, and is widely used in the fields of document structured processing, document organization, text filtering and the like. Through research, the traditional document content classification technology is realized based on a statistical and rule method, the statistical method is an uncertainty-based probabilistic reasoning method which is learned on a large-scale corpus, and the method has the defect that the coverage range of the corpus needs to be wide enough to obtain a good effect. The method based on the rules is to formulate a certain classification rule according to some rule constraints in linguistics, and the method is a deterministic inference method. With the development of deep neural network technology, in recent years, most of document content classification tasks are realized based on NLP related tasks, the basic implementation mode is that firstly, word segmentation is performed on a text, and Embedding operation is performed to extract feature vectors of the text, then, a series of convolution and pooling operations are performed, and finally, a classification result is obtained on an output result through Softmax (Softmax local regression). In summary, the above text content classification method is premised on that a large amount of text content conforming to correct semantics and having correct text sequence is needed to be used as basic support, and certain preprocessing is needed to be performed on text data, such as word segmentation processing, word frequency cleaning, processing of special symbols and stop words, construction of word vectors, and the like.

The sequence is a precondition for ensuring that the text semantics are correct, and no matter the result is the result after document classification or the result is detected and identified in each category, the returned result may be out of order, and the effect of the related tasks of the downstream NLP (Natural Language Processing) can be directly and seriously affected by the non-sequential Processing of the results, so that the returning of the correct sequence is very important. In the prior art, in the process of ordering the text, judgment errors are easy to occur, so that the document layout is disordered.

Therefore, there is a need for a document content classification method that is more accurate, more flexible and more efficient to identify

Disclosure of Invention

In view of the above, the present invention provides a method, system, device and computer readable storage medium for classifying document contents, which are more flexible and efficient. The specific scheme is as follows:

a document content classification method, comprising:

acquiring a target document, converting the document into a picture format, and acquiring a target picture corresponding to the target document;

extracting content features from the target picture by using a preset document content classification model according to a preset classification standard, and performing region division on the target picture according to the content features to obtain a plurality of segmentation regions to be sequenced;

extracting the text type of each segmentation region by using a preset document layout analysis model, and sequencing the text sequence in each segmentation region according to a preset layout rule according to the text type of each segmentation region to obtain a plurality of text regions with correct text sequence;

reordering the text regions by using the content characteristics and the text sequence of the text regions to obtain a recombined document;

the document content classification model is obtained by performing segmentation training on a historical picture according to preset classification standards in advance; the document layout analysis model is obtained by performing layout training on historical pictures in advance according to preset layout rules.

Optionally, the document content classification model uses ResNet + FPN as a backbone network, and a Feature Map generated by each ResBlock structure in the ResNet network is fused with a channel attention model and then a spatial attention model, so that the Feature Map generated by the whole backbone network and fused with an attention mechanism is obtained.

Optionally, the classification criteria include: text, title, form body, form title, form annotation, list, image, annotation, header, and footer.

Optionally, the process of extracting the text type of each partition area by using a preset document layout analysis model, and sorting the text sequence in each partition area according to a preset layout rule according to the text type of each partition area to obtain a plurality of text regions with correct text sequences includes:

analyzing the text type of the segmentation area by using a document layout analysis model;

calculating a BoundingBox coordinate area corresponding to the partitioned area by using the text type of the partitioned area;

determining the sorting longitudinal sorting sequence of the partition areas by using the width of the bounding Box coordinate area and the width of the corresponding partition area;

and judging the text space in the partition area by using the height of the BoundingBox coordinate area.

The invention also discloses a document content classification system, which comprises:

the image conversion module is used for acquiring a target document, converting the document into an image format and obtaining a target image corresponding to the target document;

the region classification module is used for extracting content features from the target picture according to a preset classification standard by using a preset document content classification model, and performing region division on the target picture according to the content features to obtain a plurality of segmentation regions to be sequenced;

the document layout module is used for extracting the text type of each segmentation region by using a preset document layout analysis model, and sequencing the text sequence in each segmentation region according to a preset layout rule according to the text type of each segmentation region to obtain a plurality of text regions with correct text sequence;

the document recombination module is used for reordering the text regions by utilizing the content characteristics and the text sequence of the text regions to obtain a recombined document;

Optionally, the document layout module includes:

the text type analysis unit is used for analyzing the text type of the segmentation area by using a document layout analysis model;

the BoundingBox calculating unit is used for calculating a BoundingBox coordinate area corresponding to the segmentation area by using the text type of the segmentation area;

the vertical sorting unit is used for determining the sorting vertical sorting sequence of the partition areas by utilizing the width of the bounding Box coordinate area and the width of the corresponding partition area;

and the space sorting unit is used for judging the text space in the partition area by utilizing the height of the bounding Box coordinate area.

The invention also discloses a document content classification device, which comprises:

a memory for storing a computer program;

a processor for executing the computer program to implement the document content classification method as described above.

The invention also discloses a computer readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a document content classification method as described above.

In the invention, the document content classification method comprises the following steps: acquiring a target document, converting the document into a picture format, and acquiring a target picture corresponding to the target document; extracting content features from the target picture by using a preset document content classification model according to a preset classification standard, and performing region division on the target picture according to the content features to obtain a plurality of segmentation regions to be sequenced; extracting the text type of each segmentation region by using a preset document layout analysis model, and sequencing the text sequence in each segmentation region according to a preset layout rule according to the text type of each segmentation region to obtain a plurality of text regions with correct text sequence; reordering the text regions by using the content characteristics and the text sequence of the text regions to obtain a recombined document; the document content classification model is obtained by performing segmentation training on a historical picture according to preset classification standards in advance; the document layout analysis model is obtained by performing layout training on historical pictures in advance according to preset layout rules.

The image recognition of the invention divides the document into a plurality of areas according to the category, each area is typeset independently, the typesetting is more flexible, finally, the whole sorting is carried out to obtain the complete document, by sorting the single area, even if the sorting in the individual area is wrong, the layout influence on the whole document can be reduced, and the fault tolerance rate is higher.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart illustrating a document content classification method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a document content classification system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a document content classification method, which is shown in figure 1 and comprises the following steps:

s11: and acquiring a target document, converting the document into a picture format, and acquiring a target picture corresponding to the target document.

Specifically, in order to classify the document contents by using the image recognition technology, the document in the non-picture format is subjected to picture format conversion, and of course, the document in the picture format is not converted again and can be directly used as the target picture.

S12: extracting content features from the target picture by using a preset document content classification model according to a preset classification standard, and performing region division on the target picture according to the content features to obtain a plurality of segmentation regions to be sequenced.

Specifically, the classification criteria may include: text, title, table body, table title, table annotation, list, image, annotation, header and footer, and dividing the picture into regions according to the classification standard to obtain each picture region corresponding to the classification standard, such as a title region. The table area, the footer area and the like, in the process, only various contents in the picture are identified, and the contents are not sorted, so that each area is a partition area to be sorted.

S13: and extracting the text type of each segmentation region by using a preset document layout analysis model, and sequencing the text sequence in each segmentation region according to a preset layout rule according to the text type of each segmentation region to obtain a plurality of text regions with correct text sequence.

Specifically, by extracting the text type of each partition area, the text type may include a paragraph pitch, whether the document layout is in a layout manner of one column, two columns, three columns, mixed multiple columns, and the like, and determining the text sequence in each partition area, for example, determining whether the text content corresponding to the partition area is in an upper-lower segment, whether the text content corresponds to a picture or a table, and the like.

S14: and reordering the text regions by using the content characteristics and the text sequence of the text regions to obtain the recombined document.

Specifically, the text regions correspond to the segmentation regions, and under the condition that the text sequence in each text region is normal, the text regions can be reordered by using the content features and the text sequence among the text regions, so that the recombined document can be obtained finally.

The document content classification model is obtained by performing segmentation training on a historical picture in advance according to a preset classification standard; the document layout analysis model is obtained by performing layout training on the historical pictures in advance according to preset layout rules.

Therefore, the document is divided into the plurality of areas according to the categories through image identification, each area is independently typeset, the typesetting is more flexible, the whole document is finally obtained through whole sequencing, the influence on the layout of the whole document can be reduced even if the sequencing in the individual area is wrong through sequencing of the single area, and the fault tolerance rate is higher.

The embodiment of the invention discloses a specific document content classification method, and compared with the previous embodiment, the embodiment further explains and optimizes the technical scheme. Specifically, the method comprises the following steps:

furthermore, the document content classification model can adopt ResNet + FPN as a backbone network, and a Feature Map generated by each ResBlock structure in the ResNet network is fused with a channel attention model and then a space attention model, so that the Feature Map generated by the whole backbone network and fused with an attention mechanism is obtained.

Specifically, the document content classification model may include a training data set construction stage, a feature extraction backbone network construction stage, and a model training stage; wherein the content of the first and second substances,

and constructing a training data set stage, wherein the training data set stage comprises the steps of performing model training by adopting an open source data set on one hand, and fusing data labeled by a labeling system developed by a data labeling person on the other hand, wherein the types of the current document mainly comprise 10 types, namely text (text), title (title), table body (table _ body), table title (table _ title), table annotation (table _ annotation), list (list), image (figure), annotation (annotation), page header (page _ header), page footer (footer) and the like, and the data labeling type adopts a COCO data set format.

The construction stage of the feature extraction backbone network comprises the step of adopting an example segmentation idea, compared with a target detection series model, the example segmentation model carries out segmentation calculation on the basis of detection, pixel-level identification is achieved, the identified bounding box coordinate position is more accurate, the accuracy of character detection and identification in the later period is guaranteed, and meanwhile, the model with high generalization capability can be trained only by a small number of data sets.

In order to better extract features, the embodiment of the present invention uses a Feature Pyramid Network (FPN) to perform a multi-scale target detection method, where the Network implements fusion of features of each hierarchical structure. On the other hand, a large number of blank regions may exist in a partial image page, the layout of each category region in different images has diversity, and meanwhile, certain positional relation exists between partial categories, for example, for the table categories including table titles, table bodies, table annotations, and the like, that is, certain relation exists between features of different categories in space, which may cause the identification performance of the model to be degraded, therefore, in order to avoid the above disadvantages, the features between different categories are first fully mined, and the method can be improved by increasing the depth of the network, increasing the number of channels of the feature map, and adopting a multi-scale feature fusion technology, and the like, in combination with the features of the deep neural network structure; then, spatial position information is fused in the model training process to inhibit the common characteristics of all the analogs, and the identification accuracy of the model is improved by improving the representation capability of a specific area. Therefore, ResNet + FPN is used as a backbone network, and meanwhile, a channel attention model and a space attention model are fused for the Feature Map generated by each ResBlock structure in the ResNet network, so that an attention mechanism is fused for the Feature Map generated by the whole backbone network, the importance of different Feature channels and each Feature space is automatically learned, the Feature extraction of the image includes the Feature weight on the space and the Feature weight between different channels, and the Feature extraction capability of the model is improved.

If the size of a Feature Map generated by an original image in a backbone network is c x w h, c represents the number of channels of the Feature Map, w represents the width of the Feature Map, h represents the height of the Feature Map, global maximum pooling (global max pooling) and global uniform pooling (global average pooling) processing are respectively carried out on the Feature Map in the spatial dimension, an output value is subjected to full-link layer and softmax activation function, finally, adding operation is carried out on respectively output Feature vectors to obtain the weight of a channel attention model, and point multiplication operation is carried out on the weight and input features to obtain the Feature Map passing through the channel attention model.

And (3) respectively performing global maximum pooling and global uniform pooling on Feature maps output by the channel attention model in channel dimensions, merging the obtained Feature maps, finally obtaining the weight of the spatial attention model with the size of 1 w h through a convolution layer and a sofxmax activation layer, and performing dot product operation on the weight and the output features of the channel attention model.

And in the model training stage, the RPN module is responsible for generating a fixed number of candidate regions ROIs, performing front and back background classification and regressing the positions and the sizes of target detection frames to obtain filtered ROI regions, the RoIAlign module adopts a bilinear interpolation method to correspond the feature map of the ROI regions with the original image regions, rounding operation is cancelled, the problem of position deviation of the feature map and the original image caused by RoIPooling is solved, the detection precision is improved, the MASK module classifies the ROI regions, performs regression calculation of the candidate frames and generates a MASK, and an example segmentation task is completed.

Further, the step of S13 extracting the text type of each divided region by using a preset document layout analysis model, and sorting the text sequence in each divided region according to a preset layout rule according to the text type of each divided region to obtain a plurality of text regions with correct text sequence may include steps S131 to S134; wherein the content of the first and second substances,

s131: analyzing the text type of the segmentation area by using a document layout analysis model;

s132: and calculating a BoundingBox coordinate area corresponding to the divided area by using the text type of the divided area.

Specifically, according to the image text type, a text type in the divided area and a bounding box coordinate area corresponding to the text type are calculated, the returned bounding box coordinate area is further processed, an intersection ratio (IoU ═ 0, 1]) between each bounding box is calculated, if there is an intersection between a plurality of bounding boxes, if a IoU value of some two or more bounding boxes is greater than a fixed threshold (for example, 0.98), it is considered that there is a complete coincidence between the bounding boxes, and the bounding box coordinates and the category which are completely contained are removed.

S133: and determining the sorting longitudinal sorting order of the partition areas by using the width of the bounding box coordinate area and the width of the corresponding partition area.

Specifically, the ratio of the width of the entire picture to the width of the bounding box is calculated according to the width of each bounding box returned in the previous step, all bounding box coordinate regions in which the ratio of the width of the bounding box coordinates to the width of the entire picture is greater than a fixed threshold (for example, 0.5, which is more than half of the width of the picture) are found, and the bounding boxes of the portion are sorted from top to bottom according to the Y axis.

S134: and judging the text space in the partition area by using the height of the BoundingBox coordinate area.

Specifically, the layout in the whole image is divided into a plurality of areas based on the calculated BoundingBox coordinate area, the rest BoundingBox coordinates are classified according to the calculated plurality of areas, then the layout of how many columns the categories in the area belong to is judged, all boundingboxes in each area are sorted according to left-to-right priority and sorted according to coordinate values from top to bottom, and then the corresponding boundingboxes are sequentially added into the corresponding layout list; and sequentially inserting the boundingBox coordinate areas of which the calculated width ratio to the whole image is greater than a fixed threshold value into corresponding positions (sorted according to the Y axis).

Further, the layout between the rows is adjusted, based on the Bounding Box identified by the OCR module, a pre-sorting process according to Y-axis coordinates is performed on all Bounding boxes first, whether the difference value between the center coordinate of the current Bounding Box and the center coordinate of the next Bounding Box is greater than half of the height of the current Bounding Box is judged, if the difference value is greater than half, the current Bounding Box is judged to be at the position of line change, after the line change position is found, the Bounding boxes of each row are sorted according to the X-axis, and then the line level sorting rule in the paragraph is completed. Finally, some details are processed, for example, a ' connector problem exists in a character at the end of a line, the character is directly connected with a next line character by deleting the character too directly, for example, the ' 5060 ' is called after the ' 50-60 ' is deleted, semantic errors are directly caused, the details are processed according to the rule of whether the character is an letter or a number at present, and the details can be judged by means of subtasks under NLP.

Correspondingly, the embodiment of the present invention further discloses a document content classification system, as shown in fig. 2, the system includes:

the image conversion module 11 is configured to acquire a target document, convert the document into an image format, and obtain a target image corresponding to the target document;

the region classification module 12 is configured to extract content features from the target picture according to a preset classification standard by using a preset document content classification model, and perform region division on the target picture according to the content features to obtain a plurality of segmentation regions to be ordered;

the document layout module 13 is configured to extract the text type of each partition area by using a preset document layout analysis model, and sort the text sequence in each partition area according to a preset layout rule according to the text type of each partition area to obtain a plurality of text regions with correct text sequence;

the document reorganization module 14 is configured to reorder the text regions by using the content features and the text sequence of the text regions to obtain a reorganized document;

Specifically, the document layout module 13 includes a text type analysis unit, a bounding box calculation unit, a longitudinal sorting unit, and a space sorting unit; wherein the content of the first and second substances,

The document content classification model adopts ResNet + FPN as a backbone network, and a Feature Map generated by each ResBlock structure in the ResNet network is fused with a channel attention model firstly and then a space attention model, so that the Feature Map generated by the whole backbone network and fused with an attention mechanism is obtained.

Wherein the classification criteria include: text, title, form body, form title, form annotation, list, image, annotation, header, and footer.

In addition, the embodiment of the invention also discloses a document content classification device, which comprises:

a memory for storing a computer program;

a processor for executing a computer program to implement the document content classification method as described above.

In addition, the embodiment of the invention also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when being executed by a processor, the computer program realizes the document content classification method.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The technical content provided by the present invention is described in detail above, and the principle and the implementation of the present invention are explained in this document by applying specific examples, and the above description of the examples is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for classifying document contents, comprising:

2. The method for classifying the document contents according to claim 1, wherein the document content classification model adopts ResNet + FPN as a backbone network, and for the Feature Map generated by each ResBlock structure in the ResNet network, a channel attention model is fused first, and then a space attention model is fused, so that the Feature Map generated by the whole backbone network and fused with an attention mechanism is obtained.

3. The method of classifying document contents according to claim 2, wherein the classification criteria include: text, title, form body, form title, form annotation, list, image, annotation, header, and footer.

4. The method for classifying document contents according to claim 2, wherein the process of extracting the text type of each divided region by using a preset document layout analysis model, and sorting the text sequence in each divided region according to a preset layout rule according to the text type of each divided region to obtain a plurality of text regions with correct text sequence comprises:

calculating a bounding Box coordinate area corresponding to the partitioned area by using the text type of the partitioned area;

5. A document content classification system, comprising:

the document layout module is used for extracting the text type of each partition area by using a preset document layout analysis model, and sequencing the text sequence in each partition area according to a preset layout rule according to the text type of each partition area to obtain a plurality of text areas with correct text sequence;

6. The document content classification system of claim 5, wherein the document layout module comprises:

7. A document content classification apparatus, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the document content classification method of any one of claims 1 to 4.

8. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, realizes the document content classification method according to any one of claims 1 to 4.