CN116580411B

CN116580411B - Instruction-based document image processing method and system

Info

Publication number: CN116580411B
Application number: CN202310843671.3A
Authority: CN
Inventors: 陈清财; 范玚; 吴湘平; 李恒
Original assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Current assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-10-20
Anticipated expiration: 2043-07-11
Also published as: CN116580411A

Abstract

The invention discloses a document image processing method and a system based on instructions, wherein the method comprises the following steps: acquiring a document image, and inputting the document image into a document image coding model to acquire visual characteristics of the document image; acquiring a document processing operation instruction, and inputting the document processing operation instruction into a document processing instruction analysis model to obtain a simple operation instruction sequence; inputting the simple operation instruction sequence into a document processing instruction coding model to obtain document instruction semantic features; inputting the visual features of the document image and the semantic features of the document instruction into a document multi-mode large model to obtain an image transformation operation sequence and mode output content; and acquiring a document processing revision instruction, and completing the document image processing based on the document processing revision instruction. The invention effectively interacts with the user by deeply understanding the document format and the content, accurately completes the customized document operation by analyzing the user instruction, and carries out iterative revision according to the user feedback.

Description

Instruction-based document image processing method and system

Technical Field

The invention relates to the technical field of document processing, in particular to a document image processing method and system based on instructions.

Background

Document processing is a technique that understands documents through artificial intelligence techniques and assists users in accomplishing a variety of different tasks on electronic documents. Document processing tasks include, but are not limited to, text detection and recognition, document layout analysis, form detection and structure restoration, warp detection and correction, stamp detection and elimination, and the like. In conventional document processing, for different document processing tasks, because of different input and output modalities of the model and different requirements for the model capacity to complete the task, it is necessary to design a personalized model for each task, which results in fragmentation of the model capacity.

In recent years, with the development of multi-mode technology, the parameter scale of the artificial intelligent model is continuously enlarged, and the performance is improved, so that the implementation can be compatible with multi-mode input and output, can simultaneously complete various document processing tasks, and can understand user personalized instructions to autonomously define multi-mode complex document processing of a document operation sequence. Most of the existing document processing methods aim at specific tasks, and still cannot effectively understand user instruction texts, so that free interaction with users is realized. For example, in the patent with application number CN202211669026.6, document processing model training method, document processing method, device and equipment, the problem of layout knowledge is only constructed for layout information and preset templates, the method is designed for layout related tasks, does not involve wider document processing tasks, the problem that can be solved is limited, and the method has poor processing performance for user instructions other than template styles.

Accordingly, there is a need for improvement and development in the art.

Disclosure of Invention

The invention aims to solve the technical problems that the document image processing method and the document image processing system based on the instructions aim to solve the problems that the user instruction text cannot be effectively understood and the processing performance of the user instruction is poor in the prior art.

The technical scheme adopted by the invention for solving the problems is as follows:

in a first aspect, an embodiment of the present invention provides a method for processing a document image based on an instruction, where the method includes:

acquiring a document image, inputting the document image into a document image coding model, and acquiring document image visual characteristics corresponding to the document image, wherein the document image visual characteristics are used for representing image visual content information of the document image;

acquiring a document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model, and obtaining a simple operation instruction sequence, wherein the document processing operation instruction is a language description of processing operation to be performed on a document image by a user, and the simple operation instruction sequence is a document image processing operation list to be performed on the document image;

Inputting the simple operation instruction sequence into a document processing instruction coding model to obtain document instruction semantic features corresponding to the simple operation instruction sequence, wherein the document instruction semantic features are used for representing text content information corresponding to the simple operation instruction sequence;

inputting the visual features of the document image and the semantic features of the document instruction into a document multi-mode large model to obtain an image transformation operation sequence of the document image and mode output content corresponding to the document processing operation instruction;

and acquiring a document processing revision instruction, if the document processing revision instruction is not empty and is not in a confirmation state, acquiring a fused document processing instruction operation, continuously executing the acquired document processing operation instruction based on the fused document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model to obtain an operation flow of a simple operation instruction sequence, and if the document processing revision instruction is empty or is in the confirmation state, ending the processing flow of the document image.

In one implementation manner, the inputting the simple operation instruction sequence into a document processing instruction coding model to obtain the document instruction semantic feature corresponding to the simple operation instruction sequence includes:

Embedding the text of the simple operation instruction sequence to obtain a text embedding feature;

and inputting the text embedded features into the document processing instruction coding model to perform feature extraction to obtain document instruction semantic features corresponding to the simple operation instruction sequence.

In one implementation, the document multi-mode big model includes a document image understanding module, a document instruction semantic understanding module, a multi-mode fusion module, a visual thinking chain module and a decoding module, the document image visual feature and the document instruction semantic feature are input into the document multi-mode big model to obtain an image transformation operation sequence of the document image and a mode output content corresponding to the document processing operation instruction, and the method includes:

inputting the visual characteristics of the document image to the document image understanding module to obtain visual intermediate characteristics;

inputting the document instruction semantic features to the document instruction semantic understanding module to obtain text intermediate features;

inputting the visual intermediate features and the text intermediate features into the document multi-mode fusion module to obtain multi-mode intermediate features, wherein the multi-mode intermediate features fuse two kinds of mode information;

Inputting the multi-modal intermediate features to the visual thinking chain module to obtain an image transformation operation sequence of the document image;

and inputting the multi-mode intermediate features to the decoding module to obtain single-mode output content or multi-mode output content corresponding to the document processing operation instruction.

In one implementation, the simple sequence of operation instructions includes at least one document image processing operation comprising: document image preprocessing, text detection, text content identification, document layout analysis, table detection, table structure restoration, defect detection and correction, object detection, object extraction and object elimination.

In one implementation, the acquiring the fused document processing operation instruction includes:

and marking the document processing operation instruction as a document history operation, marking the document processing revision instruction as a document current operation, and sequentially splicing the document processing operation instruction and the document processing revision instruction to obtain a fused document processing operation instruction.

In one implementation, the training process of the document multi-modal large model includes:

inputting the single-mode document image data and document text content data into the document multi-mode large model, and carrying out single-mode pre-training on the document multi-mode large model;

Inputting multimodal document image-text content pair data into the document multimodal big model, and performing multimodal pre-training on the document multimodal big model;

generating a document processing instruction trimming data set based on the document processing task data, wherein the document processing instruction trimming data set comprises a preset document processing operation instruction, a document image and an image transformation operation visual thinking chain output label of the document image, and a document processing result label of the document image;

inputting the preset document processing operation instruction and the document image into the document multi-mode large model to generate an image transformation operation visual thinking chain prediction result and a document processing prediction result of the document image;

and performing instruction fine tuning training on the document multi-mode large model based on the image transformation operation visual thinking chain output label, the document processing result label, the image transformation operation visual thinking chain prediction result and the document processing prediction result.

In a second aspect, an embodiment of the present invention further provides an instruction-based document image processing system, where the system includes:

the document image visual characteristic acquisition module is used for acquiring a document image, inputting the document image into the document image coding model, and acquiring document image visual characteristics corresponding to the document image, wherein the document image visual characteristics are used for representing image visual content information of the document image;

The simple operation instruction sequence acquisition module is used for acquiring a document processing operation instruction, inputting the document processing operation instruction into the document processing instruction analysis model, and acquiring a simple operation instruction sequence, wherein the document processing operation instruction is a language description of a processing operation to be performed on the document image by a user, and the simple operation instruction is a document image processing operation list to be performed on the document image;

the document instruction semantic feature acquisition module is used for inputting the simple operation instruction sequence into a document processing instruction coding model to acquire document instruction semantic features corresponding to the simple operation instruction sequence, wherein the document instruction semantic features are used for representing text content information corresponding to the simple operation instruction sequence;

the result output module is used for inputting the visual characteristics of the document image and the semantic characteristics of the document instruction into a document multi-mode large model to obtain an image transformation operation sequence of the document image and mode output content corresponding to the document processing operation instruction;

and the revising module is used for acquiring a document processing revising instruction, acquiring a fused document processing instruction operation if the document processing revising instruction is not empty and is not in a confirmation state, continuously executing the acquired document processing operation instruction based on the fused document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model to obtain an operation flow of a simple operation instruction sequence, and ending the processing flow of the document image if the document processing revising instruction is empty or in the confirmation state.

In one implementation, the document instruction semantic feature acquisition module includes:

the embedded feature acquisition unit is used for embedding the text of the simple operation instruction sequence to obtain text embedded features;

the semantic feature acquisition unit is used for inputting the text embedded features into the document processing instruction coding model to perform feature extraction, so as to obtain document instruction semantic features corresponding to the simple operation instruction sequence.

In one implementation, the document multi-mode large model comprises a document image understanding module, a document instruction semantic understanding module, a multi-mode fusion module, a visual thinking chain module and a decoding module, wherein the document image understanding module is used for extracting visual intermediate features, the document instruction semantic module is used for extracting text intermediate features, the multi-mode fusion module is used for extracting multi-mode intermediate features, the visual thinking chain module is used for generating image transformation operation of the document image item by item, and the decoding module is used for generating modal output content meeting the requirements of the document processing operation instruction.

In one implementation, the result output module includes:

the visual intermediate feature acquisition unit is used for inputting the visual features of the document image to the document image understanding module to obtain the visual intermediate features;

The text intermediate feature acquisition unit is used for inputting the document instruction semantic features to the document instruction semantic understanding module to obtain text intermediate features;

the multi-mode intermediate feature acquisition unit is used for inputting the visual intermediate feature and the text intermediate feature into the multi-mode fusion module to obtain a multi-mode intermediate feature, and the multi-mode intermediate feature fuses two mode information;

the image transformation operation sequence acquisition unit is used for inputting the multi-mode intermediate features to the visual thinking chain module to obtain an image transformation operation sequence of the document image;

and the modal output content acquisition unit is used for inputting the multi-modal intermediate characteristics to the decoding module to obtain single-modal output content or multi-modal output content corresponding to the document processing operation instruction.

In one implementation, the revision module includes:

and the fusion document processing operation instruction acquisition unit is used for marking the document processing operation instruction as document history operation, marking the document processing revision instruction as document current operation, and sequentially splicing the document processing operation instruction and the document processing revision instruction to obtain the fusion document processing operation instruction.

In a third aspect, the present invention provides a terminal device, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and configured to be executed by the one or more processors, the one or more programs comprising instructions for executing the document image processing method according to any one of the above.

In a fourth aspect, embodiments of the present invention also provide a non-transitory computer-readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the instruction-based document image processing method according to any one of the above.

The invention has the beneficial effects that: compared with the prior art, the invention provides a document image processing method and a document image processing system based on instructions, which are characterized in that firstly, a document image is acquired, the document image is input into a document image coding model, and the document image visual characteristics corresponding to the document image are obtained, wherein the document image visual characteristics are used for representing the image visual content information of the document image; acquiring a document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model, and obtaining a simple operation instruction sequence, wherein the document processing operation instruction is a language description of processing operation to be performed on the document image by a user, and the simple operation instruction sequence is a document image processing operation list to be performed on the document image; inputting the simple operation instruction sequence into a document processing instruction coding model to obtain document instruction semantic features corresponding to the simple operation instruction sequence, wherein the document instruction semantic features are used for representing text content information corresponding to the simple operation instruction sequence; finally, inputting the visual features of the document image and the semantic features of the document instruction into a document multi-mode large model to obtain an image transformation operation sequence of the document image and mode output content corresponding to the document processing operation instruction; and acquiring a document processing revision instruction, if the document processing revision instruction is not empty and is not in a confirmation state, acquiring a fused document processing operation instruction, continuously executing the acquired document processing operation instruction based on the fused document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model to obtain an operation flow of a simple operation instruction sequence, and if the document processing revision instruction is empty or is in the confirmation state, ending the processing flow of the document image. The invention can understand the format and content of the document by extracting the characteristics of the document, then analyzes the document processing operation instruction input by the user by effectively interacting with the user, generates the customized processing operation instruction of the document, and carries out multiple iterative revisions according to the feedback of the user until the requirement of the user is met.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for instruction-based document image processing according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a multimodal document processing model provided by an embodiment of the invention.

FIG. 3 is a functional block diagram of a system for instruction-based document image processing provided by an embodiment of the present invention.

Fig. 4 is a schematic block diagram of an internal structure of a terminal device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

It should be noted that, if directional indications (such as up, down, left, right, front, and rear … …) are included in the embodiments of the present invention, the directional indications are merely used to explain the relative positional relationship, movement conditions, etc. between the components in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indications are correspondingly changed.

In the prior art, when a document is processed, the user cannot effectively interact with the document, but the instruction text of the user cannot be effectively understood, so that the instruction processing performance of the user on the document is poor.

In order to solve the problems in the prior art, the present embodiment provides a document image processing method based on instructions, by which improvement of the performance of document processing can be achieved. When the method is implemented, firstly, a document image is obtained, the document image is input into a document image coding model, and document image visual characteristics corresponding to the document image are obtained, wherein the document image visual characteristics are used for representing image visual content information of the document image; then, acquiring a document processing operation instruction, and inputting the document processing operation instruction into a document processing instruction analysis model to obtain a simple operation instruction sequence, wherein the document processing operation instruction is a language description of a processing operation to be performed on the document image by a user, and the simple operation instruction sequence is a document image processing operation list to be performed on the document image; further, inputting the simple operation instruction sequence into a document processing instruction coding model to obtain document instruction semantic features corresponding to the simple operation instruction sequence, wherein the document instruction semantic features are used for representing text content information corresponding to the simple operation instruction sequence; inputting the visual features of the document image and the semantic features of the document instruction into a document multi-mode large model to obtain an image transformation operation sequence of the document image and mode output content corresponding to the document processing operation instruction; and finally, acquiring a document processing revision instruction, acquiring a fused document processing operation instruction if the document processing revision instruction is not empty and is not in a confirmation state, continuously executing the acquired document processing operation instruction based on the fused document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model to obtain an operation flow of a simple operation instruction sequence, and ending the processing flow of the document image if the document processing revision instruction is empty or in the confirmation state. Therefore, the invention realizes deep understanding of the document by extracting the characteristics of the document, then effectively interacts with the user to acquire the user operation instruction, generates the customized processing operation instruction for the document by analyzing the document processing operation instruction input by the user, and carries out multiple iterative revisions according to the user feedback until the requirement of the user is met.

For example, a user inputs a document image, a server acquires the document image, inputs the document image into a document image coding model for feature extraction, and obtains a document image visual feature corresponding to the document image; then, acquiring a document processing operation instruction input by a user, namely acquiring the processing operation which the user wants to perform on the document, inputting the operation into a document instruction analysis model to deeply analyze the instruction of the user, and obtaining a simple operation instruction sequence; further, inputting the simple operation instruction sequence into a document processing instruction coding model to obtain corresponding document instruction semantic features, namely text content information corresponding to the simple operation instruction sequence; inputting the obtained visual features of the document image and the semantic features of the document instruction into a document multi-mode large model to obtain an image transformation operation sequence of the document image and mode output contents corresponding to the document processing operation instruction; and finally, acquiring a document processing revision instruction, if the document processing revision instruction is not empty and is not in a confirmation state, acquiring a fused document processing operation instruction, inputting the fused document processing operation instruction into a document processing instruction analysis model to continue document image processing, and if the document processing revision instruction is empty or is in a confirmation state, ending the processing flow of the document image, wherein the current output content is indicated to reach the user requirement. The invention can realize deep understanding of the document and the user instruction, generate the customized document operation instruction based on the extracted characteristics after model understanding, iterate for a plurality of times according to the document processing revision instruction fed back by the user until the user requirement is met, and improve the processing performance of the document.

Exemplary method

The embodiment of the invention provides a document image processing method based on instructions, which can be applied to terminal equipment. As shown in fig. 1, the method includes:

step S100, a document image is obtained, the document image is input into a document image coding model, and document image visual characteristics corresponding to the document image are obtained, wherein the document image visual characteristics are used for representing image visual content information of the document image.

In this embodiment, the document image is an image containing any text, and may contain any image such as optical characters, handwritten text, seals, tables, watermarks, artistic trademark text, and the like.

In the implementation, firstly, a document image is acquired, the document image is input into the document image coding model to extract the characteristics of the document image, and the visual characteristics of the document image are obtained. The visual features of the document image are used for representing the visual information of the document image, and deep understanding of the document image can be realized by extracting the features of the document image, so that the document image can be operated by better combining with a document processing operation instruction input by a user.

Step 200, acquiring a document processing operation instruction, and inputting the document processing operation instruction into a document processing instruction analysis model to obtain a simple operation instruction sequence, wherein the document processing operation instruction is a language description of a processing operation to be performed on the document image by a user, and the simple operation instruction sequence is a document image processing operation list to be performed on the document image.

In this embodiment, the document operation instruction is a text instruction sent when the user interacts with the model, which includes information such as an operation expected to be performed on the document image to be processed or an expected result, that is, what operation processing is expected to be performed on the document image by the user, or what operation processing result is expected to be obtained by the user, for example, the document operation instruction input by the user is "watermark removal", that is, an operation indicating that the user wants to perform watermark removal on the document image to be processed, and what is expected is a watermark-free document image. The document processing operation instruction may be spoken, written, chinese, or english, and the form of the document processing operation instruction is not limited. The document processing instruction parsing model is various, for example, a large-scale pre-training language model Pythia.

In the specific implementation, firstly, a document processing operation instruction sent by a user is obtained, and then the document processing operation instruction is input into the document processing analysis model for instruction analysis, so that a simple operation instruction sequence is obtained. The simple operation instruction sequence is a list of document image processing operations to be performed on the document image, for example, a plurality of operations to be performed on the document image are performed, and the sequence is '1. Document layout analysis 2. Table structure reduction 3. Object elimination'. The model can conveniently identify the operation to be processed through a simple operation instruction sequence.

In one implementation, the simple sequence of operation instructions includes at least one document image processing operation comprising: document image preprocessing, text detection, text content identification, document layout analysis, table detection, table structure restoration, defect detection and correction, object detection, object extraction and object elimination. Wherein the kind and order of the document image processing operations correspond to the desired operations or desired results in the document processing operation instructions input by the user.

And step 300, inputting the simple operation instruction sequence into a document processing instruction coding model to obtain document instruction semantic features corresponding to the simple operation instruction sequence, wherein the document instruction semantic features are used for representing text content information corresponding to the simple operation instruction sequence.

In this embodiment, the document processing instruction encoding model may have various forms, including a pre-training language model, and the document processing instruction encoding model may be a single language model for a certain language, or may be a multi-language model capable of processing multiple languages simultaneously. If the simple operation instruction sequence which is explicitly input is single language, the single language model can be directly called to improve the processing speed, and if the simple operation instruction sequence is mixed text of multiple languages, the multi-language model processing is called, so that the processing speed of the document image can be improved while the meaning of the simple operation instruction is considered.

During specific implementation, the text of the simple operation instruction sequence is embedded, so that text embedded characteristics are obtained; and inputting the text embedded features into the document processing instruction coding model to perform feature extraction to obtain document instruction semantic features corresponding to the simple operation instruction sequence.

And step 400, inputting the visual features of the document image and the semantic features of the document instruction into a document multi-mode large model to obtain an image transformation operation sequence of the document image and mode output content corresponding to the document processing operation instruction.

In this embodiment, as shown in fig. 2, the document multi-modal large model includes a document image understanding module, a document instruction semantic understanding module, a multi-modal fusion module, a visual thinking chain module, and a decoding module. And finishing the processing of the document image through each embedded module in the document multi-mode model. The multi-module collaboration processing can improve the processing power of document images.

In the implementation, firstly, inputting the visual characteristics of the document image to the document image understanding module to obtain visual intermediate characteristics; then, inputting the document instruction semantic features to the document instruction semantic understanding module to obtain text intermediate features; further, inputting the visual intermediate feature and the text intermediate feature into the multi-mode fusion module to obtain a multi-mode intermediate feature, wherein the multi-mode intermediate feature fuses two kinds of mode information; inputting the multi-mode intermediate features to the visual thinking chain module to obtain an image transformation operation sequence of the document image; and finally, inputting the multi-mode intermediate features to the decoding module to obtain single-mode output content or multi-mode output content corresponding to the document processing operation instruction. And through the cooperation of all the modules in the document multi-mode large model, the extracted visual characteristics of the document image and the semantic characteristics of the document instruction are fused to generate an image transformation operation sequence and mode output content. The multi-module collaboration can improve the processing effect and the processing speed of the document image.

In one implementation, the document image understanding module may be implemented by a Vision Transformer (ViT) structure, the document instruction semantic understanding module may be implemented by a Transformer Encoder structure, the multimodal fusion module may be implemented by a bit-3 structure, the visual chain of thought module may be implemented by a Transformer Decoder structure, and the decoding module may be implemented by SentencePiece Decoder and VQ-VAE Decoder structures.

In one implementation, the sequence of document image transformation operations consists of document image transformation operations. The image transformation operation includes: noise reduction, rotation detection, distortion correction, watermark removal, seal removal, image sharpening, image binarization, super resolution and the like.

In one implementation, the training process of the document multi-modal large model includes: firstly, inputting single-mode document image data and document text content data into the document multi-mode large model, and carrying out single-mode pre-training on the document multi-mode large model; then, inputting multi-mode document image-text content pair data into the document multi-mode large model, and performing multi-mode pre-training on the document multi-mode large model; further, a document processing instruction trimming data set is generated based on the document processing task data, and the document processing instruction trimming data set comprises a preset document processing operation instruction, the document image, an image transformation operation visual thinking chain output label of the document image and a document processing result label; inputting the preset document processing operation instruction and the document image into the document multi-mode large model to generate an image transformation operation visual thinking chain prediction result and a document processing prediction result of the document image; and finally, performing instruction fine tuning training on the document multi-mode big model based on the image transformation operation visual thinking chain output label, the document processing result label, the image transformation operation visual thinking chain prediction result and the document processing prediction result to realize training on the document multi-mode big model.

And S500, acquiring a document processing revision instruction, acquiring a fused document processing operation instruction if the document processing revision instruction is not empty and is not in a confirmation state, continuously executing the acquired document processing operation instruction based on the fused document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model to obtain an operation flow of a simple operation instruction sequence, and ending the processing flow of the document image if the document processing revision instruction is empty or in the confirmation state.

In this embodiment, the document processing revision instruction is obtained, if the document processing revision instruction is not empty and is not in a confirmation state, which indicates that the current output result does not reach the user expectation, and further processing is required, first, the document processing operation instruction is marked as a document history operation, and the document processing revision instruction is marked as a document current operation; then, the document processing operation instruction and the document processing revision instruction are spliced in sequence to obtain the fused document processing operation instruction, namely the document processing revision instruction which needs to be executed in a new round of document image processing; and finally, continuously executing the acquired document processing operation instruction based on the fused document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model to obtain an operation flow of a simple operation instruction sequence, and processing the document image again according to a new revision instruction fed back by a user. By identifying the historical operation of the document and the current operation of the document, the operation which is the last operation can be clarified, and the operation is determined to not reach the expectation, and the operation instruction is updated in the current document operation instruction so that the user expectation can be better reached. And if the document processing revision instruction is empty or in the confirmation state, ending the document image processing, and indicating that the current output has reached the user expectation. And the confirmation state is a user reply containing positive semantics in the document processing revision instruction, which indicates that the modal output content meets the requirement of the user, and the user is satisfied with the current document processing result. For example, a positive reply such as "satisfactory", "no objection" is included in the document processing revision instruction. Whether the obtained result meets the user expectation is confirmed by obtaining a document processing revision instruction input by a user, if the obtained result does not meet the user expectation, the document image is further processed based on the document processing revision instruction fed back by the user until the document processing revision instruction is empty or in a confirmation state, the obtained result is satisfied by the user, and the processing effect of the document image can be improved by timely interaction with the user.

In one implementation, the document processing revision instructions contain zero, one, or more, and are user-required text for revising the resulting document processing results. If the obtained document image processing result is not satisfied, the user further inputs a document processing revision instruction to generate a new fused document processing operation instruction for further processing the document image.

The invention obtains the operation instruction expected by the user by understanding the format and the content of the document, then effectively interacting with the user, analyzing the operation instruction, finally fusing the image characteristics of the document and the text characteristics of the analyzed operation instruction to generate the middle characteristics of the fused multi-mode information, generating the image transformation operation of the document image item by item in a visual thinking chain (Visual Chain of Thought, VCoT) mode, decoding the image transformation operation into a corresponding mode by using a decoding module to output, timely receiving the document processing revision instruction fed back by the user to determine whether the output result accords with the expected or not, iterating for many times until the user requirement is met, realizing the document image processing based on the instruction, improving the performance of the document processing model, flexibly combining with the user, generating the customized document operation instruction meeting the user requirement, and greatly improving the performance of the document processing.

Exemplary System

Based on the above embodiment, the present invention further provides an instruction-based document image processing system, as shown in fig. 3, where the system in this embodiment includes a document image visual feature obtaining module S10 configured to obtain a document image, and input the document image into a document image coding model to obtain a document image visual feature corresponding to the document image, where the document image visual feature is used to characterize image visual content information of the document image; a simple operation instruction sequence obtaining module S20, configured to obtain a document processing operation instruction, and input the document processing operation instruction to a document processing instruction analysis model, to obtain a simple operation instruction sequence, where the document processing operation instruction is a language description of a processing operation to be performed on the document image by a user, and the simple operation instruction is a document image processing operation list to be performed on the document image; the document instruction semantic feature acquisition module S30 is used for inputting the simple operation instruction sequence into a document processing instruction coding model to acquire document instruction semantic features corresponding to the simple operation instruction sequence, wherein the document instruction semantic features are used for representing text content information corresponding to the simple operation instruction sequence; the result output module S40 is used for inputting the visual characteristics of the document image and the semantic characteristics of the document instruction into a document multi-mode large model to obtain an image transformation operation sequence of the document image and mode output contents corresponding to the document processing operation instruction; and the revising module S50 is used for acquiring a document processing revising instruction, acquiring a fused document processing operation instruction if the document processing revising instruction is not empty and is not in a confirmation state, continuously executing the acquired document processing operation instruction based on the fused document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model to obtain an operation flow of a simple operation instruction sequence, and ending the processing flow of the document image if the document processing revising instruction is empty or in the confirmation state.

In one implementation, the system includes a document instruction semantics acquisition module that includes:

In one implementation, the document multi-modal model includes a document image understanding module, a document instruction semantic understanding module, a multi-modal fusion module, a visual thinking chain module and a decoding module, wherein the document image understanding module is used for extracting visual intermediate features, the document instruction semantic module is used for extracting text intermediate features, the multi-modal fusion module is used for extracting multi-modal intermediate features, the visual thinking chain module is used for generating an image transformation operation sequence of the document image item by item, and the decoding module is used for generating modal output content meeting the requirement of the document processing operation instruction.

In one implementation, the system includes a result output module that includes:

In one implementation, the revision module includes:

Based on the above embodiment, the present invention further provides a terminal device, and a schematic block diagram of the terminal device may be shown in fig. 4. The terminal device is an upper computer in the above embodiment, such as an ATM. The terminal device may comprise one or more processors 100 (only one shown in fig. 4), a memory 101 and a computer program 102 stored in the memory 101 and executable on the one or more processors 100, for example a program of an instruction-based document image processing method. The one or more processors 100, when executing the computer program 102, may implement the various steps in embodiments of an instruction-based document image processing method. Alternatively, the functions of the templates/elements of the system embodiments of instruction-based document image processing may be implemented by one or more processors 100 when executing computer program 102, without limitation.

In one embodiment, the processor 100 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In one embodiment, the memory 101 may be an internal storage unit of the electronic device, such as a hard disk or a memory of the electronic device. The memory 101 may also be an external storage device of the electronic device, such as a plug-in hard disk, a smart card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the electronic device. Further, the memory 101 may also include both an internal storage unit and an external storage device of the electronic device. The memory 101 is used to store computer programs and other programs and data required by the terminal device. The memory 101 may also be used for temporarily storing data that has been output or is to be output.

It will be appreciated by persons skilled in the art that the functional block diagram shown in fig. 4 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the terminal device to which the present inventive arrangements are applied, and that a particular terminal device may include more or fewer components than shown, or may combine some of the components, or may have a different arrangement of components.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of computer programs, which may be stored on a non-volatile computer readable storage medium, which when executed may comprise the steps of the above described embodiments of the methods, wherein any reference to memory, storage, operational database, or other medium used in the embodiments of the invention may comprise non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual operation data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

In summary, the invention discloses a method and a system for processing a document image based on instructions, wherein the method comprises the steps of obtaining a document image, inputting the document image into a document image coding model, and obtaining document image visual characteristics corresponding to the document image, wherein the document image visual characteristics are used for representing image visual content information of the document image; acquiring a document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model, and obtaining a simple operation instruction sequence, wherein the document processing operation instruction is a language description of processing operation to be performed on a document image by a user, and the simple operation instruction sequence is a document image processing operation list to be performed on the document image; inputting the simple operation instruction sequence into a document processing instruction coding model to obtain document instruction semantic features corresponding to the simple operation instruction sequence, wherein the document instruction semantic features are used for representing text content information corresponding to the simple operation instruction sequence; inputting the visual features of the document image and the semantic features of the document instruction into a document multi-mode large model to obtain an image transformation operation sequence of the document image and mode output content corresponding to the document processing operation instruction; and acquiring a document processing revision instruction, if the document processing revision instruction is not empty and is not in a confirmation state, acquiring a fused document processing operation instruction, continuously executing the acquired document processing operation instruction based on the fused document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model to obtain an operation flow of a simple operation instruction sequence, and if the document processing revision instruction is empty or is in the confirmation state, ending the processing flow of the document image. According to the invention, through extracting the characteristics of the document, realizing deep understanding of the document, then effectively interacting with a user, analyzing the document processing operation instruction input by the user to generate the processing operation instruction of the document, and carrying out multiple iterative revisions according to the user feedback until the user requirement is met, the speed and the capability of processing the document can be greatly improved.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for processing a document image based on instructions, the method comprising:

acquiring a document processing revision instruction, if the document processing revision instruction is not empty and is not in a confirmation state, acquiring a fused document processing operation instruction, continuously executing the acquired document processing operation instruction based on the fused document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model to obtain an operation flow of a simple operation instruction sequence, and if the document processing revision instruction is empty or is in the confirmation state, ending the processing flow of the document image;

the document multi-mode big model comprises a document image understanding module, a document instruction semantic understanding module, a multi-mode fusion module, a visual thinking chain module and a decoding module, wherein the document image visual characteristics and the document instruction semantic characteristics are input into the document multi-mode big model to obtain an image transformation operation sequence of the document image and mode output content corresponding to the document processing operation instruction, and the document multi-mode big model comprises:

inputting the visual intermediate features and the text intermediate features into the multi-modal fusion module to obtain multi-modal intermediate features, wherein the multi-modal intermediate features fuse two modal information;

2. The method of claim 1, wherein inputting the simple sequence of operation instructions into a document processing instruction encoding model results in document instruction semantic features corresponding to the simple sequence of operation instructions, comprising:

3. The method of claim 1, wherein the sequence of simple operation instructions comprises at least one document image processing operation comprising: document image preprocessing, text detection, text content identification, document layout analysis, table detection, table structure restoration, defect detection and correction, object detection, object extraction and object elimination.

4. The method of claim 1, wherein the obtaining fused document processing operation instructions comprises:

and marking the document processing operation instruction as a document history operation, marking the document processing revision instruction as a document current operation, and sequentially splicing the document processing operation instruction and the document processing revision instruction to obtain the fused document processing operation instruction.

5. The method of claim 1, wherein the training process of the document multi-modal large model comprises:

6. An instruction-based document image processing system, the system comprising:

the revising module is used for acquiring a document processing revising instruction, acquiring a fused document processing operation instruction if the document processing revising instruction is not empty and is not in a confirmation state, continuously executing the acquired document processing operation instruction based on the fused document processing operation instruction, inputting the document processing operation instruction into a document processing instruction analysis model to obtain an operation flow of a simple operation instruction sequence, and ending the document image processing if the document processing revising instruction is empty or in the confirmation state;

The document multi-mode large model comprises a document image understanding module, a document instruction semantic understanding module, a multi-mode fusion module, a visual thinking chain module and a decoding module;

the result output module is specifically configured to:

7. A terminal device comprising a memory, a processor and a program of an instruction-based document image processing method stored in the memory and executable on the processor, the processor implementing the steps of the instruction-based document image processing method according to any one of claims 1 to 5 when executing the program of the instruction-based document image processing method.

8. A computer-readable storage medium, on which a program of a method of instruction-based document image processing is stored, which program of instruction-based document image processing method, when executed by a processor, implements the steps of the instruction-based document image processing method according to any one of claims 1 to 5.