CN117173730A

CN117173730A - Document image intelligent analysis and processing method based on multi-mode information

Info

Publication number: CN117173730A
Application number: CN202311078756.3A
Authority: CN
Inventors: 杨彤; 李雪; 陈其宾; 姜凯
Original assignee: Shandong Inspur Science Research Institute Co Ltd
Current assignee: Shandong Inspur Science Research Institute Co Ltd
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-12-05

Abstract

The invention relates to the technical field of image processing, in particular to a document image intelligent analysis and processing method based on multi-mode information, which comprises the following steps: constructing a multi-modal document training data set; fine tuning training a pre-training model; multi-mode information extraction; multi-modal information analysis: text analysis, image analysis and layout analysis; multi-mode information fusion; document processing; the beneficial effects are as follows: the intelligent analysis and processing method for the document image based on the multi-mode information adopts the method of extracting and fusing the multi-mode information, comprehensively utilizes various information such as text information, image characteristics and layout structures, fully utilizes the relevance between the text and the image information, improves the accuracy and efficiency of document processing, and reduces manual operation and human errors.

Description

Document image intelligent analysis and processing method based on multi-mode information

Technical Field

The invention relates to the technical field of image processing, in particular to a document image intelligent analysis and processing method based on multi-mode information.

Background

The intelligent analysis and processing of the document image converts the text in the image into the computer readable text, automatically recognizes and extracts text information, and can automatically complete a plurality of tasks requiring a large number of manual recognition operations, thereby reducing labor cost and improving production efficiency.

With the intensive research and development of multi-modal large models in the prior art, it is made easier and more accurate to process and analyze data having multiple types and sources. A large number of pdf, docx, ppt documents are natural multi-modal, data inside the documents are provided with layout and format besides texts and pictures, and through the use of multi-modal information (such as texts, vision and audio) to realize content understanding and classification of document images, richer and diversified application scenes can be realized.

However, in a real scene, the multi-modal annotation data based on the document is very few and very difficult to annotate, the diversity of layout and format, and the complexity of low-quality scanning and template structure also increases the difficulty of document understanding tasks.

Disclosure of Invention

The invention aims to provide a document image intelligent analysis and processing method based on multi-mode information, which aims to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: a document image intelligent analysis and processing method based on multi-mode information comprises the following steps:

s1: constructing a multi-modal document training data set;

s2: fine tuning training a pre-training model;

s3: multi-mode information extraction;

s4: multi-modal information analysis: text analysis, image analysis and layout analysis;

s5: multi-mode information fusion;

s6: and (5) document processing.

Preferably, in step S1, scanned document images under government systems are collected, including but not limited to document types of files, letters, notes;

labeling the data set by using a LabelImg labeling tool to obtain a labeled data set; labeling categories include, but are not limited to: text, seals, signatures, lists, graphics and tables.

Preferably, in step S2, the model is additionally trained using the task-specific training dataset on the basis of the LayoutLMv3 model on the basis of the pre-training model.

Preferably, in step S3, a plurality of modality information including text information, image characteristics, and layout structure is extracted from the document image.

Preferably, in step S4,

text analysis: performing semantic understanding, keyword extraction and entity recognition tasks on the text by using the BERT model;

image analysis: target detection, image recognition and image classification are completed by utilizing a YOLO model, and image features extracted through CNN can be input into a transducer model for further semantic understanding and relevance analysis;

layout analysis: the layout structure information is used for analyzing the layout of the document, including paragraph division, title extraction and header and footer detection.

Preferably, in step S5, the results of the text analysis, the image analysis, and the layout analysis are fused based on the transducer model to obtain comprehensive document information.

Preferably, in step S6, automated document processing is performed based on the fused document information, including text digest generation, text classification, and document structuring.

Compared with the prior art, the invention has the beneficial effects that:

the intelligent analysis and processing method for the document image based on the multi-mode information adopts the method of extracting and fusing the multi-mode information, comprehensively utilizes various information such as text information, image characteristics and layout structures, fully utilizes the relevance between the text and the image information, improves the accuracy and efficiency of document processing, and reduces manual operation and human errors.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a cascade group attention module according to the present invention.

Detailed Description

In order to make the objects, technical solutions, and advantages of the present invention more apparent, the embodiments of the present invention will be further described in detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are some, but not all, embodiments of the present invention, are intended to be illustrative only and not limiting of the embodiments of the present invention, and that all other embodiments obtained by persons of ordinary skill in the art without making any inventive effort are within the scope of the present invention.

Referring to fig. 1 to 2, the present invention provides a technical solution: a document image intelligent analysis and processing method based on multi-mode information includes the following steps:

s1: constructing a multi-modal document training data set;

s2: fine tuning training a pre-training model;

s3: multi-mode information extraction;

s5: multi-mode information fusion;

s6: and (5) document processing.

In step S1, scanned document images under government affairs system are collected, including but not limited to various document types such as files (contracts, bulletins, reports), letters, notes, etc.;

to enhance the diversity and number of data sets, data enhancement techniques are employed to augment the image, such as random rotation, translation, scaling, flipping, etc.

Labeling the data set by using a LabelImg labeling tool to obtain a labeled data set; labeling categories include, but are not limited to: text, seal, signature, list, graphics, tables, etc.;

specifically, a set of document images containing different layout structures is selected and annotated. The labeling process comprises the steps of selecting target areas such as paragraphs, titles, headers, footers and the like in the document by a frame, and distributing corresponding category labels for each target area.

Specifically, the image preprocessing operation includes tilt correction, edge clipping, image denoising, wrinkle removal, shadow removal, moire removal, watermark removal, and the like;

in step S2, the model is additionally trained on the basis of the LayoutLMv3 model using the task-specific training dataset on the basis of the pre-training model to adapt it to the requirements of the specific task.

Specifically, performing fine tuning training by combining the constructed multi-modal document training data set of the government affair system;

through fine tuning training, the model can learn multi-modal information in the document image and provide more accurate feature representations for subsequent analysis and processing steps.

Specifically, given an input document image and its corresponding text, image, layout position information, using a transform modeling, the masked token is predicted using the layout information and the masked text, image context sequence, thereby modeling the correlation between the layout, text, image modalities.

In step S3, extracting a plurality of modality information including text information, image features, layout structures, and the like from the document image;

text: the method comprises word embedding, one-dimensional position embedding and two-dimensional position embedding; respectively adding one-dimensional position embedding and two-dimensional position embedding to the text sequence characteristics to obtain text characteristics;

specifically, word embedding is based on a pre-training model RoBERTa; embedding a marked text sequence index in a one-dimensional position;

embedding boundary frame coordinates (x, y, w, h) of the marked text sequence in two-dimensional positions, and embedding words in one segment by adopting segment-level layout positions;

specifically, for extraction of text information, a document image is recognized and 2D position information is acquired using PaddleOCR, and the information is input into a trained model for prediction.

Image: image coding is carried out by using a ResNeXt-FPN network, and a document image is segmented into a picture sequence;

specifically, extracting feature images of an original image, carrying out average pooling operation to obtain feature images with the size of W×H, dividing the image into a series of uniform P×P image blocks, expanding according to rows, linearly layering and mapping to D dimension, and tiling to obtain the image with the length of W×H/P ² The vector sequence of (2) and the image sequence feature are embedded in a one-dimensional position to obtain the image feature;

layout: and normalizing the acquired character coordinate position recognized by OCR to [0,1000] and rounding, mapping to a corresponding vector, and finally connecting the vectors corresponding to the abscissa and the ordinate.

Specifically, embedding based on one-dimensional position and two-dimensional position; visual features and text features are fused into a unified sequence, distinguished by category embedding, and summed with corresponding layout features, respectively.

In the step S4 of the process of the present invention,

text analysis: and carrying out tasks such as semantic understanding, keyword extraction, entity recognition and the like on the text by utilizing the BERT model.

Image analysis: target detection, image recognition, image classification and the like are completed by utilizing the YOLO model. Image features extracted through the CNN can be input into a transducer model for further semantic understanding and relevance analysis.

Layout analysis: the layout structure information is utilized to analyze the layout of the document, including paragraph division, header extraction, header footer detection, and the like.

In step S5, the results of the text analysis, the image analysis, and the layout analysis are fused based on the transducer model, to obtain comprehensive document information.

In particular, the basic transducer model is modified to introduce a cascading group attention (Cascaded Group Attention, CGA) module in a multi-head self-attention Mechanism (MHSA) to enhance the diversity of features of the input attention header. The CGA provides a different input split for each header and concatenates output features between the headers. Model capacity is increased by increasing network depth.

Wherein the j-th head calculates X _ij Self-attention on, i.e. input of feature X _i The j-th split, i.e. xi= [ X _i1 ，X _i2 ,…,X _ih ]And j is more than or equal to 1 and h is more than or equal to 1. h is the total number of heads and,and->Is a projection layer for splitting input features into different subspaces, W _i ^P Is a linear layer that projects connected output features back into a dimension consistent with the input. Query Q, key K and value V;

in step S6, automated document processing is performed based on the fused document information, including text digest generation, text classification, document structuring, and the like.

Specifically, taking government documents as an example: judging text information, extracting information such as titles, appendices, tables and the like;

the image recognition direction is used for judging whether a seal exists, a personnel signature exists or not; extracting a seal area and identifying seal characters; and extracting a signature area and identifying signature information.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A document image intelligent analysis and processing method based on multi-mode information is characterized in that: the method comprises the following steps:

s1: constructing a multi-modal document training data set;

s2: fine tuning training a pre-training model;

s3: multi-mode information extraction;

s5: multi-mode information fusion;

s6: and (5) document processing.

2. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in step S1, collecting scanned document images under government affairs system, including but not limited to document types of files, letters and notes;

3. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in step S2, the model is additionally trained using the task-specific training dataset on the basis of the LayoutLMv3 model on the basis of the pre-training model.

4. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in step S3, a plurality of modality information including text information, image characteristics, and layout structure is extracted from the document image.

5. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in the step S4 of the process of the present invention,

6. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in step S5, the results of the text analysis, the image analysis, and the layout analysis are fused based on the transducer model, to obtain comprehensive document information.

7. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in step S6, automated document processing is performed based on the fused document information, including text digest generation, text classification, and document structuring.