CN117173730A - Document image intelligent analysis and processing method based on multi-mode information - Google Patents

Document image intelligent analysis and processing method based on multi-mode information Download PDF

Info

Publication number
CN117173730A
CN117173730A CN202311078756.3A CN202311078756A CN117173730A CN 117173730 A CN117173730 A CN 117173730A CN 202311078756 A CN202311078756 A CN 202311078756A CN 117173730 A CN117173730 A CN 117173730A
Authority
CN
China
Prior art keywords
analysis
document
image
text
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311078756.3A
Other languages
Chinese (zh)
Inventor
杨彤
李雪
陈其宾
姜凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Science Research Institute Co Ltd
Original Assignee
Shandong Inspur Science Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Science Research Institute Co Ltd filed Critical Shandong Inspur Science Research Institute Co Ltd
Priority to CN202311078756.3A priority Critical patent/CN117173730A/en
Publication of CN117173730A publication Critical patent/CN117173730A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of image processing, in particular to a document image intelligent analysis and processing method based on multi-mode information, which comprises the following steps: constructing a multi-modal document training data set; fine tuning training a pre-training model; multi-mode information extraction; multi-modal information analysis: text analysis, image analysis and layout analysis; multi-mode information fusion; document processing; the beneficial effects are as follows: the intelligent analysis and processing method for the document image based on the multi-mode information adopts the method of extracting and fusing the multi-mode information, comprehensively utilizes various information such as text information, image characteristics and layout structures, fully utilizes the relevance between the text and the image information, improves the accuracy and efficiency of document processing, and reduces manual operation and human errors.

Description

Document image intelligent analysis and processing method based on multi-mode information
Technical Field
The invention relates to the technical field of image processing, in particular to a document image intelligent analysis and processing method based on multi-mode information.
Background
The intelligent analysis and processing of the document image converts the text in the image into the computer readable text, automatically recognizes and extracts text information, and can automatically complete a plurality of tasks requiring a large number of manual recognition operations, thereby reducing labor cost and improving production efficiency.
With the intensive research and development of multi-modal large models in the prior art, it is made easier and more accurate to process and analyze data having multiple types and sources. A large number of pdf, docx, ppt documents are natural multi-modal, data inside the documents are provided with layout and format besides texts and pictures, and through the use of multi-modal information (such as texts, vision and audio) to realize content understanding and classification of document images, richer and diversified application scenes can be realized.
However, in a real scene, the multi-modal annotation data based on the document is very few and very difficult to annotate, the diversity of layout and format, and the complexity of low-quality scanning and template structure also increases the difficulty of document understanding tasks.
Disclosure of Invention
The invention aims to provide a document image intelligent analysis and processing method based on multi-mode information, which aims to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: a document image intelligent analysis and processing method based on multi-mode information comprises the following steps:
s1: constructing a multi-modal document training data set;
s2: fine tuning training a pre-training model;
s3: multi-mode information extraction;
s4: multi-modal information analysis: text analysis, image analysis and layout analysis;
s5: multi-mode information fusion;
s6: and (5) document processing.
Preferably, in step S1, scanned document images under government systems are collected, including but not limited to document types of files, letters, notes;
labeling the data set by using a LabelImg labeling tool to obtain a labeled data set; labeling categories include, but are not limited to: text, seals, signatures, lists, graphics and tables.
Preferably, in step S2, the model is additionally trained using the task-specific training dataset on the basis of the LayoutLMv3 model on the basis of the pre-training model.
Preferably, in step S3, a plurality of modality information including text information, image characteristics, and layout structure is extracted from the document image.
Preferably, in step S4,
text analysis: performing semantic understanding, keyword extraction and entity recognition tasks on the text by using the BERT model;
image analysis: target detection, image recognition and image classification are completed by utilizing a YOLO model, and image features extracted through CNN can be input into a transducer model for further semantic understanding and relevance analysis;
layout analysis: the layout structure information is used for analyzing the layout of the document, including paragraph division, title extraction and header and footer detection.
Preferably, in step S5, the results of the text analysis, the image analysis, and the layout analysis are fused based on the transducer model to obtain comprehensive document information.
Preferably, in step S6, automated document processing is performed based on the fused document information, including text digest generation, text classification, and document structuring.
Compared with the prior art, the invention has the beneficial effects that:
the intelligent analysis and processing method for the document image based on the multi-mode information adopts the method of extracting and fusing the multi-mode information, comprehensively utilizes various information such as text information, image characteristics and layout structures, fully utilizes the relevance between the text and the image information, improves the accuracy and efficiency of document processing, and reduces manual operation and human errors.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a cascade group attention module according to the present invention.
Detailed Description
In order to make the objects, technical solutions, and advantages of the present invention more apparent, the embodiments of the present invention will be further described in detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are some, but not all, embodiments of the present invention, are intended to be illustrative only and not limiting of the embodiments of the present invention, and that all other embodiments obtained by persons of ordinary skill in the art without making any inventive effort are within the scope of the present invention.
Referring to fig. 1 to 2, the present invention provides a technical solution: a document image intelligent analysis and processing method based on multi-mode information includes the following steps:
s1: constructing a multi-modal document training data set;
s2: fine tuning training a pre-training model;
s3: multi-mode information extraction;
s4: multi-modal information analysis: text analysis, image analysis and layout analysis;
s5: multi-mode information fusion;
s6: and (5) document processing.
In step S1, scanned document images under government affairs system are collected, including but not limited to various document types such as files (contracts, bulletins, reports), letters, notes, etc.;
to enhance the diversity and number of data sets, data enhancement techniques are employed to augment the image, such as random rotation, translation, scaling, flipping, etc.
Labeling the data set by using a LabelImg labeling tool to obtain a labeled data set; labeling categories include, but are not limited to: text, seal, signature, list, graphics, tables, etc.;
specifically, a set of document images containing different layout structures is selected and annotated. The labeling process comprises the steps of selecting target areas such as paragraphs, titles, headers, footers and the like in the document by a frame, and distributing corresponding category labels for each target area.
Specifically, the image preprocessing operation includes tilt correction, edge clipping, image denoising, wrinkle removal, shadow removal, moire removal, watermark removal, and the like;
in step S2, the model is additionally trained on the basis of the LayoutLMv3 model using the task-specific training dataset on the basis of the pre-training model to adapt it to the requirements of the specific task.
Specifically, performing fine tuning training by combining the constructed multi-modal document training data set of the government affair system;
through fine tuning training, the model can learn multi-modal information in the document image and provide more accurate feature representations for subsequent analysis and processing steps.
Specifically, given an input document image and its corresponding text, image, layout position information, using a transform modeling, the masked token is predicted using the layout information and the masked text, image context sequence, thereby modeling the correlation between the layout, text, image modalities.
In step S3, extracting a plurality of modality information including text information, image features, layout structures, and the like from the document image;
text: the method comprises word embedding, one-dimensional position embedding and two-dimensional position embedding; respectively adding one-dimensional position embedding and two-dimensional position embedding to the text sequence characteristics to obtain text characteristics;
specifically, word embedding is based on a pre-training model RoBERTa; embedding a marked text sequence index in a one-dimensional position;
embedding boundary frame coordinates (x, y, w, h) of the marked text sequence in two-dimensional positions, and embedding words in one segment by adopting segment-level layout positions;
specifically, for extraction of text information, a document image is recognized and 2D position information is acquired using PaddleOCR, and the information is input into a trained model for prediction.
Image: image coding is carried out by using a ResNeXt-FPN network, and a document image is segmented into a picture sequence;
specifically, extracting feature images of an original image, carrying out average pooling operation to obtain feature images with the size of W×H, dividing the image into a series of uniform P×P image blocks, expanding according to rows, linearly layering and mapping to D dimension, and tiling to obtain the image with the length of W×H/P 2 The vector sequence of (2) and the image sequence feature are embedded in a one-dimensional position to obtain the image feature;
layout: and normalizing the acquired character coordinate position recognized by OCR to [0,1000] and rounding, mapping to a corresponding vector, and finally connecting the vectors corresponding to the abscissa and the ordinate.
Specifically, embedding based on one-dimensional position and two-dimensional position; visual features and text features are fused into a unified sequence, distinguished by category embedding, and summed with corresponding layout features, respectively.
In the step S4 of the process of the present invention,
text analysis: and carrying out tasks such as semantic understanding, keyword extraction, entity recognition and the like on the text by utilizing the BERT model.
Image analysis: target detection, image recognition, image classification and the like are completed by utilizing the YOLO model. Image features extracted through the CNN can be input into a transducer model for further semantic understanding and relevance analysis.
Layout analysis: the layout structure information is utilized to analyze the layout of the document, including paragraph division, header extraction, header footer detection, and the like.
In step S5, the results of the text analysis, the image analysis, and the layout analysis are fused based on the transducer model, to obtain comprehensive document information.
In particular, the basic transducer model is modified to introduce a cascading group attention (Cascaded Group Attention, CGA) module in a multi-head self-attention Mechanism (MHSA) to enhance the diversity of features of the input attention header. The CGA provides a different input split for each header and concatenates output features between the headers. Model capacity is increased by increasing network depth.
Wherein the j-th head calculates X ij Self-attention on, i.e. input of feature X i The j-th split, i.e. xi= [ X i1 ,X i2 ,…,X ih ]And j is more than or equal to 1 and h is more than or equal to 1. h is the total number of heads and,and->Is a projection layer for splitting input features into different subspaces, W i P Is a linear layer that projects connected output features back into a dimension consistent with the input. Query Q, key K and value V;
in step S6, automated document processing is performed based on the fused document information, including text digest generation, text classification, document structuring, and the like.
Specifically, taking government documents as an example: judging text information, extracting information such as titles, appendices, tables and the like;
the image recognition direction is used for judging whether a seal exists, a personnel signature exists or not; extracting a seal area and identifying seal characters; and extracting a signature area and identifying signature information.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (7)

1. A document image intelligent analysis and processing method based on multi-mode information is characterized in that: the method comprises the following steps:
s1: constructing a multi-modal document training data set;
s2: fine tuning training a pre-training model;
s3: multi-mode information extraction;
s4: multi-modal information analysis: text analysis, image analysis and layout analysis;
s5: multi-mode information fusion;
s6: and (5) document processing.
2. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in step S1, collecting scanned document images under government affairs system, including but not limited to document types of files, letters and notes;
labeling the data set by using a LabelImg labeling tool to obtain a labeled data set; labeling categories include, but are not limited to: text, seals, signatures, lists, graphics and tables.
3. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in step S2, the model is additionally trained using the task-specific training dataset on the basis of the LayoutLMv3 model on the basis of the pre-training model.
4. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in step S3, a plurality of modality information including text information, image characteristics, and layout structure is extracted from the document image.
5. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in the step S4 of the process of the present invention,
text analysis: performing semantic understanding, keyword extraction and entity recognition tasks on the text by using the BERT model;
image analysis: target detection, image recognition and image classification are completed by utilizing a YOLO model, and image features extracted through CNN can be input into a transducer model for further semantic understanding and relevance analysis;
layout analysis: the layout structure information is used for analyzing the layout of the document, including paragraph division, title extraction and header and footer detection.
6. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in step S5, the results of the text analysis, the image analysis, and the layout analysis are fused based on the transducer model, to obtain comprehensive document information.
7. The intelligent analysis and processing method for document images based on multi-modal information according to claim 1, wherein: in step S6, automated document processing is performed based on the fused document information, including text digest generation, text classification, and document structuring.
CN202311078756.3A 2023-08-25 2023-08-25 Document image intelligent analysis and processing method based on multi-mode information Pending CN117173730A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311078756.3A CN117173730A (en) 2023-08-25 2023-08-25 Document image intelligent analysis and processing method based on multi-mode information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311078756.3A CN117173730A (en) 2023-08-25 2023-08-25 Document image intelligent analysis and processing method based on multi-mode information

Publications (1)

Publication Number Publication Date
CN117173730A true CN117173730A (en) 2023-12-05

Family

ID=88946180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311078756.3A Pending CN117173730A (en) 2023-08-25 2023-08-25 Document image intelligent analysis and processing method based on multi-mode information

Country Status (1)

Country Link
CN (1) CN117173730A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117573839A (en) * 2024-01-12 2024-02-20 阿里云计算有限公司 Document retrieval method, man-machine interaction method, electronic device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117573839A (en) * 2024-01-12 2024-02-20 阿里云计算有限公司 Document retrieval method, man-machine interaction method, electronic device and storage medium
CN117573839B (en) * 2024-01-12 2024-04-19 阿里云计算有限公司 Document retrieval method, man-machine interaction method, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN110334705B (en) Language identification method of scene text image combining global and local information
CN111160343B (en) Off-line mathematical formula symbol identification method based on Self-Attention
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
WO2023065617A1 (en) Cross-modal retrieval system and method based on pre-training model and recall and ranking
CN110990597B (en) Cross-modal data retrieval system based on text semantic mapping and retrieval method thereof
CN113378815B (en) Scene text positioning and identifying system and training and identifying method thereof
CN115424282A (en) Unstructured text table identification method and system
CN110866129A (en) Cross-media retrieval method based on cross-media uniform characterization model
CN117173730A (en) Document image intelligent analysis and processing method based on multi-mode information
CN117010500A (en) Visual knowledge reasoning question-answering method based on multi-source heterogeneous knowledge joint enhancement
CN114596566A (en) Text recognition method and related device
Arafat et al. Urdu signboard detection and recognition using deep learning
Yu et al. Text-image matching for cross-modal remote sensing image retrieval via graph neural network
CN115953788A (en) Green financial attribute intelligent identification method and system based on OCR (optical character recognition) and NLP (non-line-segment) technologies
CN113807218B (en) Layout analysis method, device, computer equipment and storage medium
CN112801099A (en) Image processing method, device, terminal equipment and medium
CN113806472B (en) Method and equipment for realizing full-text retrieval of text picture and image type scanning piece
CN107633259B (en) Cross-modal learning method based on sparse dictionary representation
Poornima et al. Multi-modal features and correlation incorporated Naive Bayes classifier for a semantic-enriched lecture video retrieval system
CN116756363A (en) Strong-correlation non-supervision cross-modal retrieval method guided by information quantity
CN116595008A (en) Automatic mapping method and system for page form and database form
Tian et al. Research on image classification based on a combination of text and visual features
CN115410185A (en) Method for extracting specific name and unit name attributes in multi-modal data
CN116204673A (en) Large-scale image retrieval hash method focusing on relationship among image blocks
CN109657691A (en) A kind of linguistic indexing of pictures method based on energy model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination