CN110020646B

CN110020646B - File archiving method and device, electronic equipment and storage medium

Info

Publication number: CN110020646B
Application number: CN201910304382.XA
Authority: CN
Inventors: 赵岩; 黄业博; 李�杰
Original assignee: Hundsun Technologies Inc
Current assignee: Hundsun Technologies Inc
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2021-07-27
Anticipated expiration: 2039-04-16
Also published as: CN110020646A

Abstract

The application provides a file filing method, a file filing device, electronic equipment and a storage medium, wherein the file filing method comprises the following steps: splitting the file image to obtain a text area block set of the file image and a linear structure set of the file image; the text region block set of the file image comprises a first line of text region blocks and a last line of text region blocks; respectively matching characters in a character area block set of a file image and linear structures in a linear structure set of the file image in an archive task library to obtain a matched archive task; wherein, the matching file task is as follows: an archive task matched with the characters in the character area block set of the file image and/or an archive task matched with the linear structure in the linear structure set of the file image in the archive task library; and the file image is recorded into the matched archive task, and the file image is recorded by adopting the mode, so that the accuracy and the automation degree of a recording system are improved.

Description

File archiving method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a file archiving method, a file archiving device, electronic equipment and a storage medium.

Background

Nowadays, many paper documents need to be archived, and the traditional way of storing the documents by paper cannot meet the requirements of the present society, and a series of storage problems are derived in the storage process.

With the development of science and technology, the electronic filing mode of the recording system is mainly adopted for filing nowadays, but due to many factors, the recording system cannot achieve complete paperless and datamation during file recording and storage. In the recording process, the automation degree of the system is too low, an operator is required to photograph or scan the recorded files, and the recorded files can be recorded only by manually checking the recorded files one by the operator in the photographing process.

As the operation steps of the whole recording system are complex and complicated human-computer interaction, the acquisition process with overhigh labor-time cost and low fault-tolerant rate is brought, and the automation degree and the accuracy of the recording system are urgently needed to be improved in order to reduce the labor-time cost and improve the recording fault-tolerant rate.

Disclosure of Invention

In view of this, the invention provides a file archiving method, device, electronic device, and storage medium, and the automation degree and accuracy of the entry system are improved by adopting the entry mode of character feature matching and linear feature matching.

In order to achieve the above object, the embodiment of the present invention provides the following technical solutions:

the invention discloses a file filing method in a first aspect, which comprises the following steps:

splitting a file image to obtain a text area block set of the file image and a linear structure set of the file image; the text region block set of the file image comprises a first line of text region blocks and a last line of text region blocks;

respectively matching the characters in the character area block set of the file image and the linear structures in the linear structure set of the file image in an archive task library to obtain matched archive tasks; wherein the matching archive task is: an archive task matched with the characters in the character area block set of the file image and/or an archive task matched with the linear structure in the linear structure set of the file image in the archive task library;

and inputting the file image into the matched archive task.

Optionally, in the file archiving method, splitting the file image includes:

carrying out binarization processing on the file image to obtain a linear structure set of the file image and a binarized file image;

and carrying out image processing operation on the binary file image to obtain a processed file image, wherein the image processing operation comprises the following steps: at least one of an expansion operation and a corrosion operation;

and intercepting the first line of characters of the processed file image as a first line of character area block of the file image, and intercepting the last line of characters of the processed file image as a last line of character area block of the file image.

Optionally, in the above document filing method, after performing an image processing operation on the binarized document image to obtain a processed document image, the method further includes:

if the first line character region block is not intercepted from the processed file image, intercepting the characters closest to the first line region from the processed file image as the first line character region block;

if the tail line character area block is not intercepted from the processed file image, intercepting the character closest to the tail line area from the processed file image as the tail line character area block.

Optionally, in the file archiving method, matching the text in the text area block set of the file image in an archive task library includes:

respectively performing first row matching and last row matching on first row character area blocks and last row character area blocks in a character area block set of the file image in the file task library;

wherein the matching archive task matches the text in the set of text region blocks of the document image, comprising: and the matching file task is matched with the characters in the first line character region block and the last line character region block.

Optionally, in the file filing method, after performing first row matching and last row matching respectively on the first row text area block and the last row text area block in the text area block set of the file image in the archive task library, the method further includes:

if a specific file task is matched in the file task library, tail line matching is carried out on tail line text region blocks in the text region block set of the file image in the text universe of the specific file task; the specific archive task is matched with a first line of character area blocks in a character area block set of the file image and is not matched with a last line of character area blocks in the character area block set of the file image;

and if the character universe of the specific archive task is matched with the tail row character area block in the character area block set of the file image, recording the file image into the specific archive task.

Optionally, in the above document filing method, after the document image is entered into the matched archive task, the method further includes:

if the linear structure in the linear structure set of the file image is matched with the matched archive task and the characters in the character area block set of the file image cannot be matched with the matched archive task, extracting the character area block set of the file image;

identifying characters in the character area block set to obtain a character identification result of the character area block set;

correcting the character recognition result by utilizing characters in a matched file task matched with the linear structure of the file image in the file task library;

and updating the recognition character library in the file task library by using the corrected character recognition result.

Optionally, in the above document filing method, matching a linear structure in the linear structure set of the document image in an archive task library includes:

extracting the linear features of the linear structure of the file image to obtain the linear features of the linear structure of the file image;

identifying the linear characteristics of the file image to obtain a network model of the linear characteristics;

matching the network model of the linear characteristic in the archive task library;

wherein matching the archive matching task with linear structures in the linear structure set of the document images comprises: the matching profile task is matched with the network model of the linear features.

Optionally, in the file filing method, extracting the linear feature of the linear structure of the file image to obtain the linear feature of the linear structure of the file image includes:

extracting linear features of a linear structure of the file image by using Hough transform to obtain Hough transform linear features of the file image;

selecting feature points for the Hough transform linear features to obtain a feature point set of the file image, and taking the feature point set of the file as linear features of the linear structure of the extracted file image, wherein the feature point set of the file image comprises the following steps: the method comprises a longitudinal line segment midpoint characteristic point set, a transverse line segment midpoint characteristic point set and a line segment intersection point characteristic point set.

A second aspect of the present invention discloses a file filing apparatus, comprising:

the file processing device comprises a splitting unit, a storage unit and a processing unit, wherein the splitting unit is used for splitting a file image to obtain a text area block set of the file image and a linear structure set of the file image; the text region block set of the file image comprises a first line of text region blocks and a last line of text region blocks;

the matching unit is used for respectively matching the characters in the character area block set of the file image and the linear structures in the linear structure set of the file image in an archive task library to obtain a matched archive task; wherein the matching archive task is: an archive task matched with the characters in the character area block set of the file image and/or an archive task matched with the linear structure in the linear structure set of the file image in the archive task library;

and the first entry unit is used for entering the file image into the matched archive task.

Optionally, in the file filing apparatus, the splitting unit includes:

a binarization processing unit, configured to perform binarization processing on the file image to obtain a linear structure set of the file image and a binarized file image;

the image processing operation unit is used for carrying out image processing operation on the binary file image to obtain a processed file image; wherein the image processing operation comprises: at least one of an expansion operation and a corrosion operation;

and the first intercepting unit is used for intercepting the first line of characters of the processed file image as a first line of character area block of the file image and intercepting the last line of characters of the processed file image as a last line of character area block of the file image.

Optionally, the file filing apparatus further includes:

a second intercepting unit, configured to intercept, to the processed file image, a word closest to a head line region as a head line word region block if the head line word region block is not intercepted from the processed file image;

and a third intercepting unit, configured to intercept, for the processed file image, a text closest to the tail line region as the tail line text region block if the tail line text region block is not intercepted from the processed file image.

Optionally, in the above document filing apparatus, when the matching unit performs matching of the text in the text area block set of the document image in the archive task library, the matching unit includes:

the first matching subunit is used for respectively performing head line matching and tail line matching on head line character area blocks and tail line character area blocks in the character area block set of the file image in the file task library;

wherein the matching archive task matches the text in the set of text region blocks of the document image, comprising: and the characters in the first line character area block and the last line character area block are matched.

Optionally, the file filing apparatus further includes:

a specific matching unit, configured to perform tail line matching on tail line text region blocks in a text region block set of the file image in a text full domain of the specific archive task; obtaining a specific matching file task matched with the tail line text area block in the text area block set of the file image;

and the second entry unit is used for entering the file image into the specific archive task.

Optionally, the file filing apparatus further includes:

a first extracting unit, configured to extract a text region block set of the file image if a linear structure in the linear structure set of the file image matches the matching archive task and a text in the text region block set of the file image cannot match the matching archive task;

the first identification unit is used for identifying characters in the character area block set to obtain a character identification result of the character area block set;

the correction unit is used for correcting the character recognition result by utilizing characters in a matched file task matched with the linear structure of the file image in the file task library;

and the updating unit is used for updating the recognition character library in the file task library by using the corrected character recognition result.

Optionally, in the above document filing apparatus, when the matching unit extracts the linear feature of the linear structure of the document image to obtain the linear feature of the linear structure of the document image, the matching unit includes:

the second extraction unit is used for extracting the linear features of the linear structure of the file image to obtain the linear features of the linear structure of the file image;

the second identification unit is used for identifying the linear characteristics of the file image to obtain a network model of the linear characteristics;

the second matching subunit is used for matching the network model of the linear characteristic in the archive task library to obtain the matched archive task matched with the network model of the linear characteristic;

Optionally, in the file filing apparatus, the second extracting unit includes:

the third extraction unit is used for extracting the linear features of the linear structure of the file image by using Hough transform to obtain the Hough transform linear features of the file image;

the selecting unit is used for selecting characteristic points for the Hough transform linear characteristics to obtain a characteristic point set of the file image, and taking the file characteristic point set as the linear characteristics of the linear structure of the extracted file image;

wherein, the feature point set of the file image comprises: the method comprises a longitudinal line segment midpoint characteristic point set, a transverse line segment midpoint characteristic point set and a line segment intersection point characteristic point set.

A third aspect of the invention discloses an electronic device comprising a processor and a memory; wherein:

the memory is to store computer instructions;

the processor is configured to execute the computer instructions stored in the memory, and in particular, to execute the file archiving method according to any one of the above items.

A fourth aspect of the present invention discloses a storage medium storing a program for implementing the file archiving method according to any one of the above-described embodiments when the program is executed.

According to the scheme, a document image needing to be archived is split to obtain a text area block set and a linear structure set of the document image, and characters in the text area block set of the document image and linear structures in the linear structures of the document image are respectively matched in an archive task library; and obtaining a matching archive task matched with the text area block set of the file image or the linear structure set of the file image, and then inputting the file image into the matching archive task. In the filing process of the file image, the input mode of matching archive tasks matched with the linear structures in the text region block set of the file image and/or the linear structure set of the file image is obtained by respectively matching the texts in the text region block set of the file image or the linear structures in the linear structure set of the file image in the archive task library, so that the accuracy of the input system is improved, the time cost for manual check participation is reduced, and the high automation of the input system is fundamentally realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a file archiving method disclosed in an embodiment of the present application;

FIG. 2 is a flowchart illustrating a process flow of splitting an image of a file in a file archiving method according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a process flow of splitting an image of a file in a file archiving method according to an embodiment of the present application;

fig. 4 is a schematic view illustrating hough transform rotation in a file archiving method according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating matching of document images in a document archiving method according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating a text correction procedure in a file archiving method according to an embodiment of the present disclosure;

fig. 7 is a schematic view illustrating a correction flow of characters in a file archiving method according to an embodiment of the present application;

FIG. 8 is a flow chart of linear structure set matching in a file archiving method disclosed in the embodiments of the present application;

fig. 9 is a schematic diagram of a network model obtained in a linear structure set matching process in a file image matching process in the file archiving method disclosed in the embodiment of the present application;

FIG. 10 is a flowchart illustrating linear feature extraction of a document image in a document archiving method according to an embodiment of the present application;

FIG. 11 is a schematic diagram illustrating feature point selection in linear features of a document image in a document archiving method according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a file filing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the present application document discloses a file archiving method, referring to fig. 1, including:

s101, splitting the file image to obtain a text area block set of the file image and a linear structure set of the file image.

The text region block set of the file image comprises a first line of text region blocks and a last line of text region blocks.

Note that the document image in step S101 is in a picture format to be entered into an archived paper document. Before step S101, it is necessary to acquire a file to be archived and acquire a file image by an archive system that records the acquired information. In the process of entering, the files can be entered out of order, and the files can not be classified during entering. The collection mode may be scanning or photographing, or other modes capable of acquiring the file content.

When a file image of a paper file in a picture format is obtained, the file image is split, and a text area block set and a linear structure set of the file image are obtained. The set of text region blocks is constituted by the entire region of all the text regions in the document image, and includes all the text contents of the document image. The character region block set comprises a first row of character region blocks and a last row of character region blocks, wherein the first row of character region blocks are character region blocks formed by characters located in a first row in the character region block set, and the last row of character region blocks are character region blocks formed by characters located in a last row in the character region block set. The linear structure set is composed of all linear regions in the document image, which contain all linear structures of the document image.

S102, matching the characters in the character area block set of the file image and the linear structures in the linear structure set of the file image in an archive task library respectively to obtain a matched archive task.

Wherein, the matching file task is as follows: an archive task in the archive task repository that matches text in the set of text region blocks of the document image, and/or an archive task that matches a linear structure in the set of linear structures of the document image.

It should be noted that the archive task library in step S102 includes a plurality of archive tasks, and the archive tasks are: before the file image matching is carried out, the input system inputs the filed file image according to the requirement, extracts the first line character area block and the last line character area block of the file image and other image characteristics such as the linear structure of the file image and the like as the characteristics of the filed file image needing to be input. And forming an archive task according to the first line character area block and the last line character area block belonging to the same type of file image characteristics in the extracted characteristics, and forming an archive task by the linear structure of the same type of file images.

It should be further noted that when the attribution types of the document images are the same, the file task formed by the first line of text area blocks and the last line of text area blocks and the file task formed by the linear structure are attributed to the same file task.

And S103, recording the file image into a matched archive task.

In the embodiment, the method of respectively matching the characters in the character area block set of the file image and the linear structures in the linear structure set of the file image in the archive task library to obtain the matched archive tasks matched with the characters in the character area block set of the file image and/or the matched archive tasks matched with the linear structures in the linear structure set of the file image is adopted, so that the accuracy of the entry system is improved, the time cost of manual participation in checking is reduced, and the high automation of the entry system is realized.

Optionally, in another embodiment of the present application, an implementation manner of step S101, as shown in fig. 2, includes:

s201, performing binarization processing on the file image to obtain a linear structure set of the file image and a binarized file image.

The binary file image is the file image obtained by performing binary processing on the file image. The binarization processing is a process of setting the gray value of a pixel point on the file image to be 0 or 255, namely, the whole file image shows an obvious black and white effect. And finding all character areas in the binary image, filling all the character areas as background colors, obtaining a complete linear structure characteristic diagram of the file image, and then taking the complete linear structure characteristic diagram as a linear structure set.

S202, carrying out image processing operation on the binary file image to obtain a processed file image. Wherein the image processing operation comprises: at least one of an expansion operation and a corrosion operation.

And carrying out image processing operation on the binary image, wherein the image processing operation comprises the steps of expanding and corroding the binary image under a specific threshold value. After the binary image is subjected to expansion and corrosion operations under a specific threshold value, the linear structure in the file image can be eliminated, and the binary image only containing the character area is obtained. The threshold values for the expansion and corrosion operations can be set differently according to the file image.

S203, intercepting the first line of characters of the processed file image as a first line of character area block of the file image, and intercepting the last line of characters of the processed file image as a last line of character area block of the file image.

According to the coordinate method, a head line header area or a head line character of the file image is intercepted as a head line character area block, and a tail line header area or a tail line character of the file image is intercepted as a tail line character area block.

In conjunction with the above description, referring to fig. 3, how to intercept the first line text region block is further described in a specific case:

first, a binarization process is performed on the document image to obtain a binarized image 301.

And expanding the binary image 301 to obtain a first expanded binary image 302, wherein the expanded threshold value can be set according to requirements.

And performing multiple expansion corrosion operations on the first expansion binary image 302 to obtain a multiple expansion corroded binary image 303, wherein the times of the expansion and corrosion operations can be set according to requirements, and the corroded threshold value can also be set according to requirements.

And (3) intercepting the head line header of the binarized image 303 after multiple times of expansion corrosion to be used as a head line character area block 304 of the file image.

Optionally, in order to improve the accuracy of entering the file image, before splitting the file image, the file image may be processed in advance by using a hough transform principle, and an offset angle of the file image in the acquisition process is corrected.

The method for processing the file image by adopting the Hough transform principle mainly comprises the following steps:

and according to the Hough transform principle, checking whether the file image is in a horizontal state, and if the file image is in the horizontal state, correcting the file image without adopting the Hough transform principle.

If the file image is in a non-horizontal state, the picture is rotated to achieve the effect of correcting the deviation according to the deviation angle of the maximum confidence interval set in the Hough transformation principle.

The rectification process of the specific document image is shown in fig. 4.

Reference numeral 401 denotes a file image before rotation, 402 denotes a hough transform straight-line effect graph, and 403 denotes a file image after rotation obtained by hough transform.

Optionally, in another embodiment of the present application, the file archiving method further includes:

and if the first row character region block is not intercepted from the processed file image, intercepting the characters closest to the first row region from the processed file image as the first row character region block.

The condition that the first line character region block is not intercepted is that when the first line characters of the file image are intercepted as the first line character region block of the file image, the first line characters of the file image do not contain Chinese character contents, and the processed file image is regarded as that the first line character region block is not intercepted. At this time, the characters closest to the head line area are intercepted from the processed file image to be used as the head line character area block.

It should be noted that, a character recognition detection technology may be used to determine whether the first line of characters of the document image contains the content of the Chinese characters, and another detection technology with a character recognition function may be used to determine whether the first line of characters of the document image contains the content of the Chinese characters.

If the tail line character area block is not intercepted from the processed file image, the character closest to the tail line area is intercepted from the processed file image to be used as the tail line character area block.

When the last line text block is not intercepted, the last line text of the file image is regarded as not intercepting the last line text block when the last line text of the file image is intercepted as the last line text block of the file image and the last line text of the file image does not contain Chinese character content. At this time, the characters closest to the tail line region are intercepted from the processed file image to be used as tail line character region blocks.

It should be noted that, a character recognition detection technology may be used to determine whether the last line characters of the file image contain the content of the Chinese characters, and another detection technology with a character recognition function may be used to determine whether the last line characters of the file image contain the content of the Chinese characters.

It should be further noted that, the character recognition detection mentioned in this embodiment may use OCR character recognition technology or other detection technology with character recognition to perform character recognition detection on it.

Optionally, in another embodiment of the present application, in step S102, an implementation manner of matching the text in the text area block set of the document image in the archive task library is as shown in fig. 5, and includes:

s501, performing first line matching on first line character area blocks in the character area block set of the file image in an archive task library.

The first line matching refers to matching of characters in the first line character area blocks in the character area block set of the file image.

S502, tail line matching is carried out on tail line text area blocks in the text area block set of the file image in the archive task library.

The tail line matching refers to matching of characters in tail line character area blocks in a character area block set of the file image.

When the first line matching is performed on the characters of the first line character area block in the character area block set of the document image in the document task library, and the last line matching is performed on the characters of the last line character area block in the character area block set of the document image in the document task library, the document image satisfies the first line matching and the last line matching, step S503 is executed.

And S503, obtaining matched file tasks which are matched with the first line character area block and the last line character area block in the file task library.

The task of obtaining the matching file in step S503 is obtained after step S501 or step S502 is executed. It should be noted that the execution sequence of step S501 and step S502 does not affect the implementation of step S503.

And executing the steps S504-S506 when the file image only satisfies the first line matching and does not satisfy the last line matching after the characters in the first line character area block in the character area block set of the file image are subjected to the first line matching in the file task library and the characters in the last line character area block in the character area block set of the file image are subjected to the last line matching in the file task library.

S504, obtaining the file task which is matched with the first line character area block and is not matched with the last line character area block in the file task library.

The step S504 is implemented regardless of the execution sequence of the steps S501 and S502.

When the file image is matched with the file task which is matched with the first line text area block and is not matched with the last line text area block in the file task library, the file image is matched with the specific file task.

And S505, carrying out tail line matching on tail line text area blocks in the text area block set of the file image in the text universe of the specific file task.

The specific archive task matches a first line of text region blocks in the set of text region blocks of the document image and does not match a last line of text region blocks in the set of text region blocks of the document image.

S506, if the character universe of the specific archive task is matched with the last line character area block in the character area block set of the file image, the file image is recorded into the specific archive task.

It should be further noted that the present embodiment describes the matching situation when the file image type is a multi-page scene in the archive task library. When the file images are matched in the file task library, the first row of character areas and the last row of character areas in the character area block set of the file images only meet the condition that the first row matching does not meet the condition that the last row matching does not meet the condition that the file images belong to a single page, the specific file task matching is carried out on the file images, and the last row of character area blocks in the character area block set of the file images are subjected to the last row matching in the character universe of the specific file task in the matched specific file task. The method aims to orderly file the document images which do not belong to a single-page document task in the document task library, and sequentially input the document images which belong to the same matched document task but not belong to the single-page document task.

In this embodiment, the execution sequence of steps S501 and S502 is not sequential, and may also be performed simultaneously, and no matter step S501 or step S502 is executed first, the specific implementation of this embodiment is not affected.

It should be further noted that, in this embodiment, it is described that, when the first line matching and the last line matching are performed on the document image, the execution manner when the characters in the first line character region block of the document image satisfy the first line matching and the characters in the last line character region block of the document image satisfy the last line matching, and the feasible manner when the characters in the first line character region block of the document image satisfy the first line matching but the characters in the last line character region block do not satisfy the last line matching, but in the implementation process of the present invention, when any one of the characters in the first line character region block and the characters in the last line character region block is matched to a corresponding matching archive task, the purpose reached by the present invention can be achieved, and only when one matching manner is used to match a corresponding matching archive, the matching accuracy is not high, and the entry accuracy of the document image is reduced. The specific implementation process that only meets one of the matching conditions of the first row matching or the last row matching is not much different from the embodiment, and thus, the detailed description is omitted here.

Optionally, in another embodiment of the present application, after the step S103 of entering the file image into the matching archive task, as shown in fig. 6, the file archiving method may further include:

s601, if the linear structure in the linear structure set of the file image is matched with the matched archive task and the characters in the character area block set of the file image cannot be matched with the matched archive task, extracting the character area block set of the file image.

S602, identifying characters in the character area block set of the file image to obtain a character identification result of the character area block set.

The method comprises the steps of obtaining a text region block set of a document image, wherein characters in the text region block set of the document image can be recognized by adopting an OCR recognition technology to obtain a character recognition result of the text region block set.

S603, correcting the character recognition result by using characters in the matched file task matched with the linear structure of the file image in the file task library.

The characters in the matched file task matched with the linear structure of the document image comprise all characters of the universe of the document image.

S604, updating the recognition character library in the file task library by using the corrected character recognition result.

The identification word stock in the archive task library is used for recording characters of file images to be recorded and archived.

In the embodiment, a character training step is added, and the self-training in the system is completed by character training by adopting the principle of summarizing experience, so that the matching process is effective and reliable in a real application scene, and the vertical expansion of the word stock language in a specific scene is completed.

This embodiment is explained below by way of an embodiment, and the specific process is shown in fig. 7.

The training unit 701 in fig. 7 may be understood to be used for training a recognition corpus. Specifically, the training process includes:

s7011, extracting characters in the character area block set of the file image.

S7012, identifying the characters in the character area block set to obtain a character identification result of the character area block set.

S7013, correcting the character recognition result by using the title characters in the matched file task matched with the linear structure of the file image in the file task library.

S7014, updating the recognition character library in the file task library by using the corrected character recognition result.

The specific manner of S7011 to S7014 executed by the training unit 701 in fig. 7 corresponds to steps S601 to S604 in the foregoing embodiment, which can be referred to above, and is not described herein again.

The training example 702 in fig. 7 illustrates, by way of example, the above proposed training process for a recognition corpus. Examples of this include:

s7021, the result of the successful linear matching in the matching archive task matched with the linear structure in the linear structure set in the file image is obtained.

S7022, the header content of the matching file task of the successful result of linear matching is a financing coupon application form.

S7023, the original recognition result is the result of recognizing the characters in the first row character area block of the document image.

The original result is obviously different from the identification result recorded in the file task library.

S7024, the result of character recognition on the first line character region block of the document image in the original recognition result is: the financing voucher is listed.

When the original recognition result is known to be relative to the successful result of linear matching, the original recognition result can be corrected according to the difference between the original recognition result and the successful result of linear matching.

S7025, inputting the original result of the corrected original recognition result according to the successful result of the linear matching into a recognition character library so as to improve the recognition rate of character recognition in the character area of the first line next time.

The main purpose of adding the character training step is that under the influence of some external factors, the recognition result of the file image is not accurate, so that the characters in the first row character region block of the file image need to be taken out under the condition that the linear structure in the file image is successfully matched, the recognition result is corrected by utilizing the information in the matching file task successfully matched by the linear structure in the file image, and then the recognition character library is updated and optimized according to the corrected result, so that the system can circularly update and optimize the recognition character library in a specific environment, and the recognition rate is pertinently improved in the Chinese character language field in a specific application scene. Besides the characters in the first row character region block of the file image, the characters in the whole file image can be taken out, and compared and corrected with the information in the matched file task successfully matched with the linear structure.

Optionally, in another embodiment of the present application, in step S102, the linear structures in the linear structure set of the document image are matched in the archive task library to obtain an implementation manner of the matched archive task, as shown in fig. 8, including:

s801, extracting the linear features of the linear structure of the file image to obtain the linear features of the linear structure of the file image.

It should be noted that the linear structure features of the file image may be extracted in a hough transform manner, and then the extracted features are subjected to principal component analysis to obtain the linear features of the linear structure of the file image.

S802, identifying the linear characteristics of the file image to obtain a network model of the linear characteristics.

It should be noted that the linear features of the document image are identified, specifically, a comprehensive identification mode is adopted to perform classification identification on the hierarchical features of the linear structure of the obtained document image one by one, so as to obtain a network model of the linear features.

The network models of the linear characteristics corresponding to the file images belonging to different archive tasks are different. Therefore, the network model with linear characteristics can be used as a matching condition to match the corresponding matched file task.

And S803, matching the network model of the linear characteristics in the archive task library.

Wherein, the linear structure phase-match in the linear structure set of matching archives task and file image includes: the matching archive task is matched with the network model of the linear features.

In the embodiment, the file image is recorded in the mode of establishing the network model for matching by using the linear features in the linear structure set, so that the accuracy of the recording system is improved, the time cost for manually participating in verification and checking is reduced, and the high automation of the recording system is realized.

The following explains the construction of the network model in step S802 by an embodiment, and the specific process refers to fig. 9.

S901, performing noise reduction processing on linear features in a linear structure set of the file image to obtain the number of straight lines, line segment set features and intersection point set features in the linear structure set of the file image.

It should be noted that the number of straight lines is the number of all straight lines in the linear structure set, the line segment set features are the features of all line segments in the linear structure set, and the intersection set features are the features of all intersections in the linear structure set.

S902, classifying according to quantity values in the quantity of the straight lines of the file image to construct a straight line network model.

Specifically, the linear network model may be constructed according to the magnitude of the number value of the number of the lines, or the interval of the number value of the number of the lines. For example, the number of straight lines is classified into a first class when the number of straight lines is in a range of 10 to 20, and into a second class when the number of straight lines is in other ranges. By analogy, different intervals can be set, and different linear network models can be constructed.

And S903, classifying according to the line segment set characteristics in the line segment characteristic set of the file image to construct a line segment set network model.

Specifically, a segment set network model is constructed in a classified manner according to the magnitude of the Euclidean distance between the segment features in the segment feature set.

And S904, classifying according to the intersection set characteristics in the intersection set characteristic set of the file image to construct an intersection set network model.

Specifically, the intersection point set network model is constructed in a classified manner according to the magnitude of the numerical value of the Euclidean distance between the intersection point features in the intersection point feature set.

Through the steps S901 to S904, a network model of the linear features of the file image can be constructed, and the linear features of the file image can be matched by using the constructed network model. By adopting the mode to record and file the file image, the accuracy of the recording system is improved, the time cost of manual participation in verification and check is reduced, and the high automation of the recording system is realized.

It should be noted that, in this embodiment, the execution sequence of steps S902 and S904 is not sequential, and may also be performed simultaneously, and no matter step S902, step S903, or step S904 is executed first, the specific implementation of this embodiment is not affected.

Optionally, in another embodiment of the present application, in step S801, the linear feature of the linear structure of the document image is extracted, and a specific implementation manner of obtaining the linear feature of the linear structure of the document image is as follows:

s1001, extracting linear features of the linear structure of the file image by using Hough transform to obtain Hough transform linear features of the file image.

The method comprises the steps of extracting features of a linear structure feature map of a file image in a Hough transform mode according to the linear structure feature map of the file image, and converting the linear structure feature map of the file image into vector data information to obtain Hough transform linear features of the file image.

It should be noted that the linear structure feature map of the document image is obtained by finding all the text regions in the text image and filling the text regions with background color by using a binarization processing method. The Hough transform linear features cover all linear features in the file image obtained by means of Hough transform.

S1002, selecting characteristic points for the Hough transform linear characteristics to obtain a characteristic point set of the file image, and taking the file characteristic point set as the linear characteristics of the linear structure of the extracted file image.

It should be noted that, a principal component analysis method may be adopted, and a feature point which can best reflect the linear feature of the file image in the linear features after hough transform is selected as the feature point set of the file image.

It should be further described that the number of the feature points which can best reflect the linear features of the document image in the linear features of the document image is not limited, and can be set according to the user's requirements. The feature points with larger quantity are certainly selected as the feature points of the file image, the higher the matching degree of the file image is, the higher the accuracy rate of the natural input filing is, but the more complicated and complicated the process of analyzing and calculating is also implied.

The linear characteristics of the linear structure of the file image obtained by the Hough transform mode and the principal component analysis method are subjected to parameter calculation, and the calculation result directly reflects the accuracy of matching the linear structure in the linear structure set of the file image in the archive task library to obtain the matched archive task.

Firstly, obtaining linear features of a linear structure of a file image by adopting a Hough transform mode, then selecting a feature point set which can best reflect the linear features of the file image from the linear features of the file image as linear features of the extracted file image by adopting a principal component analysis method, converting the extracted linear features into vector data information, and using the converted vector data information as an important matching reference coefficient.

From the mathematical relationship between the linear characteristics obtained by the above method, the following formula can be summarized:

S＝{α·n+β·a+γ·b+δ·c}

n in the formula represents the total number of effective straight lines obtained by detection, a refers to a longitudinal line segment midpoint characteristic point set, b refers to a transverse line segment midpoint characteristic point set, and c refers to a line segment intersection point characteristic point set.

The point feature point set a in the longitudinal line segment is: and the main component analysis and selection are carried out according to the Euclidean distance relationship of the midpoints of all the longitudinal line segments in the linear feature of the linear structure of the file image.

The set b of the midpoint feature points in the transverse line segment is: and the main component analysis and selection are carried out according to the Euclidean distance relationship of the midpoints of all transverse line segments in the linear features of the linear structure of the file image.

The line segment intersection feature point set c is: and the main component analysis and selection are carried out according to the Euclidean distance relation of line segment intersection points in the linear features of the linear structure of the file image. The line segment intersection points are characteristic parameters which can reflect text formats most in the file images, wherein the number of the intersection points and the coordinate relation are important parameters.

α, β, γ, δ in the formula represent weight coefficients. Specifically, α represents a weight coefficient of the total number of valid straight lines, β represents a weight coefficient of a point feature point set a in a longitudinal line segment, γ represents a weight coefficient of a point feature point set b in a transverse line segment, and δ represents a weight coefficient of a line segment intersection point feature point set c.

It should be further noted that, in the above calculation process of the linear feature, in addition to the total number n of valid straight lines, other feature parameters are represented by line segment features, where the line segment features of the representative points are analyzed by using euclidean distance.

For the line segment feature point set represented by a, b, and c in the above formula S ═ α · n + β · a + γ · b + δ · c, a matrix representation may be adopted, where the matrix representation is as follows:

in this embodiment, a method for obtaining linear features of a document image is described in detail, wherein, by performing euclidean distance calculation on feature points selected from linear features of the document image, not only is the number of calculations reduced, but also the accuracy of linear matching of the document image can be ensured in terms of the manner of selecting the feature points, and unnecessary calculation cost and unrealistic amount of calculation waste are reduced. In the embodiment, the input method mainly used for the case of a large file image sample is adopted, the feature points are selected according to the number of the file image samples, and the matching mode of selecting the feature points is adopted, so that the input efficiency of the system can be improved.

The specific implementation of the embodiment is described below by way of a specific implementation:

and (4) selecting the linear features of the linear structure of the file image according to the linear features of the linear structure of the file image obtained in the step (S1001), wherein the selection mode can adopt a 'Hui' character feature selection method, the intersection points of line segments at the upper left, lower left, upper right and lower right of the outermost side of the linear structure of the file image are selected as a feature point set I, and then the same four points at the secondary outer side are selected as a feature point set II. The result of selecting the feature points is shown in fig. 11.

Referring to fig. 11, it can be seen that when the outermost P1(X1, Y1), P2(X2, Y2), P3(X3, Y3), and P4(X4, Y4) and the second outer Q1(X1, Y1), Q2(X2, Y2), Q3(X3, Y3), and Q4(X4, Y4) sets constitute a set of feature points, the euclidean distances of the 8 point coordinates are calculated and then matched with the archive task library.

And if the number of file image samples is huge, the file form styles need to be input under the conditions of thousands and tens of thousands. The number of feature point sets can be increased. When the number is large, a PCA algorithm mode, namely a principal component analysis method, can be adopted to calculate the characteristic vector on a two-dimensional plane so as to improve the entry accuracy of the file image, reduce the time cost for manually participating in verification and check and realize high automation of the entry system.

In summary, the document image filing method provided by the present invention matches the document image by using the way of matching the characters in the character region block set and the way of matching the linear features, and then files and inputs the matched archive task, and the matching accuracy formula obtained by using the above two ways is as follows:

wherein S^TA feature value representing a header in OCR character recognition,

ω represents the respective weight coefficient. The accuracy of matching classification can be greatly improved in the mode, the time cost is saved, and the high automation of the recording system is fundamentally realized.

By the method, the characters in the character area block set of the file image and the linear structures in the linear structure set of the file image are respectively matched in the archive task library to obtain the matched archive tasks matched with the characters in the character area block set of the file image and/or the matched archive tasks matched with the linear structures in the linear structure set of the file image, so that the accuracy of the entry system is improved, the time cost of manual participation in checking is reduced, and the high automation of the entry system is realized.

Another embodiment of the present invention further discloses a file filing apparatus, as shown in fig. 12, including:

the splitting unit 1201 is configured to split the file image to obtain a text region block set of the file image and a linear structure set of the file image.

The matching unit 1202 is configured to match the text in the text area block set of the file image with the linear structure in the linear structure set of the file image in the archive task library, respectively, to obtain a matched archive task.

A first entering unit 1203 is used for entering the file image into the matching archive task.

In the file filing apparatus disclosed in this embodiment, the matching unit 1202 is adopted to match the characters in the character area block set of the file image split by the splitting unit 1201 and the linear structures in the linear structure set of the file image in the archive task library respectively to obtain a matching archive task matching the characters in the character area block set of the file image and/or a matching archive task matching the linear structures in the linear structure set of the file image, and the recording unit 1203 records the file image according to the matching archive task library matched with the file image. The accuracy of the recording system is improved, the time cost of manual participation in verification and check is reduced, and the high automation of the recording system is realized.

For the specific working process of each unit disclosed in this embodiment, reference may be made to the content of the method embodiment corresponding to fig. 1, which is not described herein again.

Optionally, in another embodiment of the present application, the splitting unit includes:

and the binarization processing unit is used for carrying out binarization processing on the file image to obtain a linear structure set of the file image and a binarized file image.

And the image processing operation unit is used for carrying out image processing operation on the binary file image to obtain a processed file image.

Wherein the image processing operation comprises: at least one of an expansion operation and a corrosion operation.

For the specific working process of each unit disclosed in this embodiment, reference may be made to the content of the corresponding method embodiment, and details are not described here.

Optionally, in another embodiment of the present application, the file archiving apparatus further includes:

and the second intercepting unit is used for intercepting the characters closest to the first line area from the processed file image as the first line character area block if the first line character area block is not intercepted from the processed file image.

And the third intercepting unit is used for intercepting the characters closest to the tail line area from the processed file image as the tail line character area block if the tail line character area block is not intercepted from the processed file image.

Optionally, in another embodiment of the present application, when the matching unit performs matching of the text in the text region block set of the document image in the archive task library, the matching includes:

and the first matching subunit is used for respectively performing first line matching and last line matching on the first line character area blocks and the last line character area blocks in the character area block set of the file image in the file task library.

Wherein, the characters in the text region block set that matches archives task and file image match, include: matching the characters in the first line character area block and the last line character area block.

the specific matching unit is used for matching the tail lines of the text region blocks in the text region block set of the file image in the text universe of the specific file task; and obtaining a specific matching file task matched with the tail line text area block in the text area block set of the file image.

Wherein the specific archive task matches with a first line of text region blocks in the set of text region blocks of the document image and does not match with a last line of text region blocks in the set of text region blocks of the document image.

and the first extraction unit is used for extracting the text area block set of the file image if the linear structure in the linear structure set of the file image is matched with the matched archive task and the text in the text area block set of the file image cannot be matched with the matched archive task.

And the first identification unit is used for identifying the characters in the character area block set to obtain a character identification result of the character area block set.

And the correction unit is used for correcting the character recognition result by utilizing the characters in the matched file task matched with the linear structure of the file image in the file task library.

Optionally, in another embodiment of the present application, the extracting, by the matching unit, the linear feature of the linear structure of the document image to obtain the linear feature of the linear structure of the document image includes:

and the second extraction unit is used for extracting the linear features of the linear structure of the file image to obtain the linear features of the linear structure of the file image.

And the second identification unit is used for identifying the linear characteristics of the file image to obtain a network model of the linear characteristics.

And the second matching subunit is used for matching the network model of the linear characteristic in the file task library to obtain a matched file task matched with the network model of the linear characteristic.

Optionally, in another embodiment of the present application, the second extracting unit includes:

and the third extraction unit is used for extracting the linear features of the linear structure of the file image by using Hough transform to obtain the Hough transform linear features of the file image.

And the selecting unit is used for selecting the characteristic points of the Hough transform linear characteristics to obtain a characteristic point set of the file image, and taking the file characteristic point set as the linear characteristics of the linear structure of the extracted file image.

The embodiment of the application discloses a storage medium, which comprises stored instructions, wherein when the instructions are executed, a device where the storage medium is located is controlled to execute the file archiving method shown in the embodiment. The specific implementation process of the file archiving method is consistent with the implementation principle and the file archiving method shown in the foregoing embodiment, and reference may be made to the content of the corresponding method embodiment, which is not described herein again.

The embodiment of the application discloses an electronic device, which comprises a memory and one or more than one instruction, wherein the one or more than one instruction is stored in the memory and is configured to be executed by one or more than one processor to execute the file archiving method as shown in the embodiment.

In particular implementations, the electronic device may include, but is not limited to, a cell phone, a tablet computer, other Universal Serial Bus (USB) interface devices, and the like.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention.

Claims

1. A method of archiving files, comprising:

entering the file image into the matching archive task;

wherein, matching the linear structures in the linear structure set of the document images in an archive task library comprises:

selecting feature points for the Hough transform linear features to obtain a feature point set of the file image, and taking the feature point set of the file image as the linear features of the linear structure of the extracted file image, wherein the feature point set of the file image comprises: a longitudinal line segment midpoint characteristic point set, a transverse line segment midpoint characteristic point set and a line segment intersection point characteristic point set;

matching the network model of the linear features in the archive task repository, wherein the matching archive task matches linear structures in the set of linear structures of the document image, comprising: matching the matched archive task with the network model of the linear characteristic;

wherein, the construction process of the network model of the linear characteristics comprises the following steps:

performing noise reduction on linear features in linear structures in a linear structure set of the file image to obtain the number of straight lines, line segment set features and intersection point set features in the linear structure set of the file image;

constructing a linear network model according to the magnitude of the quantity value in the number of the straight lines of the file image or the quantity value interval of the number of the straight lines;

classifying and constructing a line segment set network model according to the numerical value of Euclidean distance between line segment features in a line segment feature set of the file image;

and classifying and constructing an intersection point set network model according to the magnitude of the numerical value of the Euclidean distance between the intersection point features in the intersection point feature set of the file image.

2. The file archiving method according to claim 1, wherein said splitting the file image comprises:

3. The document filing method according to claim 2, wherein said performing an image processing operation on said binarized document image to obtain a processed document image further comprises:

4. The method of filing according to claim 1, wherein said matching text in the set of text region blocks of the document image in an archive task repository comprises:

5. The method of filing according to claim 4, wherein said step of performing first line matching and last line matching respectively on the first line text area block and the last line text area block in the text area block set of the document image further comprises:

6. The document archiving method according to claim 1, wherein said entering the document image into the matching archive task further comprises:

7. A file filing apparatus, comprising:

the first entry unit is used for entering the file image into the matched archive task;

wherein, the matching unit specifically comprises:

wherein matching the archive matching task with linear structures in the linear structure set of the document images comprises: matching the matched archive task with the network model of the linear characteristic;

the second extraction unit includes:

the selecting unit is used for selecting characteristic points for the Hough transform linear characteristics to obtain a characteristic point set of the file image, and taking the characteristic point set of the file image as the linear characteristics of the linear structure of the file image obtained by extraction;

wherein, the feature point set of the file image comprises: a longitudinal line segment midpoint characteristic point set, a transverse line segment midpoint characteristic point set and a line segment intersection point characteristic point set;

wherein the construction process of the network model of the linear features obtained by the second identification unit comprises:

8. The file archive of claim 7, the split unit comprising:

9. The file archiving device according to claim 8, further comprising:

10. The file archiving device according to claim 7, wherein the matching unit performs matching of the text in the set of text region blocks of the file image in an archive task repository, including:

11. The file archiving device according to claim 10, further comprising:

the specific matching unit is used for matching the tail lines of the character area blocks in the character area block set of the file image in the whole character area of the specific archive task; obtaining a specific matching file task matched with the tail line text area block in the text area block set of the file image;

12. The file archiving device according to claim 7, further comprising:

13. An electronic device comprising a processor and a memory; wherein:

the memory is to store computer instructions;

the processor is configured to execute the computer instructions stored in the memory, and in particular, to perform the file archiving method according to any one of claims 1 to 6.

14. A storage medium storing a program which, when executed, implements the file archiving method according to any one of claims 1 to 6.