CN114529932A

CN114529932A - Credit investigation report identification method

Info

Publication number: CN114529932A
Application number: CN202210145731.XA
Authority: CN
Inventors: 何倩倩; 饶顶锋; 陶坚坚; 刘伟
Original assignee: Beijing Yitu Zhixun Technology Co ltd
Current assignee: Beijing Yitu Zhixun Technology Co ltd
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-05-24

Abstract

The invention relates to a credit investigation report identification method, which comprises the steps of acquiring image data of a credit investigation report to be identified; preprocessing image data; performing full-text recognition on the preprocessed image data to acquire text line contents and frame line information; analyzing a format structure in the image data, and judging single and double pages in the image data; performing information type matching on the text line content and the template information; extracting the content of the matched text line according to the information type; checking and summarizing the extraction result; outputting the extracted result after the inspection and the summary to an xml file according to a specific format; the method can efficiently extract related information from the image, perform structured extraction and field verification on the matched field type, classify and integrate all information by applying a group structure mode, and finally restore the content distribution of the whole credit investigation report, and has the advantages of high identification accuracy, complete identification result, support for various scene identification and strong adaptability.

Description

Credit investigation report identification method

Technical Field

The invention belongs to the technical field of data processing, and relates to a credit investigation report identification method.

Background

The credit investigation report is a record of recording personal credit information issued by the credit investigation center of the people's bank of China, and is divided into a personal credit report and an enterprise credit report, and is used for inquiring the social credit of individuals or enterprises.

The personal credit report contains personal basic information, credit transaction information and other information. In reality, the credit investigation report information of people who like to consume in advance is more tedious and complicated, and the manual information entry mode consumes a lot of manpower and time. In contrast, an automated approach would greatly improve the efficiency of operation.

The output result obtained by the OCR recognition is also a line output, so that the intuitiveness is not strong, and the credit status of the person cannot be analyzed easily. The traditional OCR recognition cannot well support recognition under a non-single scene, and deep learning-based OCR recognition has great advantages on the basis. In principle, the deep learning method can take as input any recognition object that can be converted into an image, including PDF, scanned document, etc. The method can better and accurately identify the characters in the complex scene by preprocessing methods such as image direction judgment, inclination correction, watermark detection and filtering and the like on the information obtained by line detection and identification of the characters in the image. The subsequent processing mode needs to utilize a table analysis method, and tables have different formats and internal structures. The important index "line/row dividing line" used for judgment is actually partially or completely absent. Due to the difficulties, the manufacturers providing credit investigation report identification on the market are very few, only complete scanned pdf files are supported, the supporting scene is single, and the identification rate of the whole copy is not high. The existing credit investigation report identification software on the market can only process complete scanned PDF files and processes identification with less credit transaction information.

Disclosure of Invention

Aiming at the problems in the prior art, the invention discloses a credit investigation report identification method, which can efficiently extract text information from an image by using a mode of mutually matching text line content and template information, perform structured extraction and field verification on the matched text field type, classify and integrate all information by using a group structure mode, finally realize the reduction of the content distribution of the whole credit investigation report, has high identification accuracy, complete identification result, supports various scene identification and strong adaptability, and solves the problems that the identification method in the prior art has higher image requirement and can not accurately identify under complex conditions.

The invention provides a credit investigation report identification method for solving the technical problem, which comprises the following steps:

s1 acquires image data of the credit report waiting for recognition: the credit investigation report image data comprises jpg, bmp, png, pdf and tiff file formats, single page splitting is firstly carried out on multi-page files in the pdf format, and then the split single page is converted into image data;

s2, preprocessing the image data to obtain preprocessed image data;

s3, carrying out full-text recognition on the preprocessed image data to acquire text content and frame line information;

s4, analyzing the layout structure in the image data, and judging single and double pages in the image data;

s5, performing information type matching on the text line content and the template information: segmenting, clustering and classifying the content of the text lines according to the single-page or double-page judgment result of the image data and the identified frame line information in the S4, and performing information type matching according to the content of the text lines obtained by sorting and template information, wherein the template information comprises the text information of keywords, the position information of the keywords, whether the keywords have multi-line attributes, whether the result values have the multi-line attributes, the type values of the keywords, whether the keywords are main columns or not and the type values of the lines where the keywords are located;

s6, extracting the matched text line content according to the information type, wherein the information type comprises group name extraction, general line extraction, general table extraction, repayment record extraction, subgroup name extraction and single line extraction;

s7 checks and summarizes the extracted results: performing data verification on the extraction result according to the data type obtained by matching, wherein the data type comprises money amount, date, proportion and repayment record, and filtering the extraction result in a regular expression mode according to the different matched data types; integrating the checked extraction results according to the group structure, and keeping the information which does not form the group structure in the current image data;

s8, outputting the extracted result after the test is summarized to an xml file according to a specific format.

Further, the preprocessing process in S2 includes:

s21, image direction judgment: detecting text lines of text contents on the image by using a deep learning model, and judging the current image direction by using an OCR (optical character recognition) technology;

s22, image inclination correction: text line detection is carried out on text content on the image through a deep learning model, the inclination angle of the current image is calculated by utilizing an OCR recognition technology, and the image is rotated by a corresponding angle for correction;

s23, image watermark detection and filtering: watermark position detection and watermark removal are carried out on the image through a deep learning model,

s231, image preprocessing, namely performing normalization processing on the image under the condition of ensuring that the aspect ratio of the image is unchanged;

s232, performing down-sampling on the preprocessed image by using a biomedical image segmentation technology and combining a convolution network model, namely obtaining feature maps and feature values of different scales through convolution and pooling, and then performing up-sampling and deconvolution, wherein the up-sampling part comprises the step of up-sampling the feature values back to be matched with a ground channel to finish pixel-level classification, finally obtaining an image with the same size as the preprocessed image, and finishing image segmentation containing watermark information by using a classification result;

s233, the whole watermark in the image is completely presented by adjusting the size of the receptive field, and the watermark is obtained according to the mean square error of the regression loss function: minimum loss from inputting the watermarked image to outputting the watermarked image;

and S234, generating a training sample through the direction, the size and the angle of the watermark for training, and then removing the watermark of the image by using the trained deep learning model.

Further, the specific method for analyzing the layout structure of the image data in S4 is as follows:

s41, identifying continuous pages in the image data: judging whether continuous pages exist or not by reading the pages obtained by splitting in the image data and judging an end mark in the image data;

s42, judging whether the page of the single-page image data is a single page or multiple pages, wherein the judging method aiming at the single-page image data comprises a deep learning classification method and/or a template matching method;

the deep learning classification method comprises the following specific steps:

preprocessing image data, and normalizing the image data on the basis of ensuring that the length-width ratio of the image is not changed;

obtaining feature maps with different sizes by using a visual geometry group network, and constructing a plurality of default boxes with different sizes by using each point in the feature maps;

combining the default boxes generated by different feature maps, continuously matching with a ground router by a non-maximum suppression method, and filtering out overlapped or incorrect default boxes;

generating a training sample through the size, the position and the angle of a bounding box in the image data for training;

judging the page structure by detecting the size and the position of a bounding box in the image data: if the sizes of the bounding boxes in the detected image data are similar and distributed left and right, judging that the image data is double pages, otherwise, judging that the image data is single page.

The template matching method comprises the following specific steps:

judging whether the text blocks belong to the same line or not by calculating the contact ratio of the text blocks in the vertical direction, carrying out line aggregation on the text blocks, and correcting line aggregation results by using the detected frame line information;

traversing the whole template content by using the content of the text lines after line aggregation, and judging whether the matching is successful or not by calculating a specific threshold value; the determination of the specific threshold value is related to the number of the text lines, and generally, the number of the successfully matched text lines is more than half of the total number, so that the successful matching can be regarded as the successful matching;

and judging whether the image is a single page or a double page according to the distribution condition of the successfully matched text lines in the template.

Further, the segmenting the content of the text line in S5 includes left-right segmenting and/or up-down segmenting;

when the text line content is cut left and right, image data needs to be judged to be double pages firstly, and the corresponding distribution condition is obtained by utilizing the obtained text content: the more dense the characters are, the larger the calculated distribution value is, the more sparse the characters are, and the smaller the calculated distribution value is; finding a middle blank area position according to the distribution value of the text content, and classifying the text content into a left part and a right part by using the blank area position;

when the text line content is cut up and down, firstly, the text blocks are gathered, and then whether the text lines obtained by gathering the lines need to be cut up and down is judged by utilizing whether the frame line information obtained by identification and the keywords or the result values in the template information are in multi-line attributes or not.

Further, the specific method for extracting the text content extracted when the matching type is the group name in S6 is as follows: and matching the sorted text lines with the template information by calculating a specific threshold, wherein the specific threshold is determined to be related to the number of the text lines, generally, the number of the successfully matched text lines is more than half of the total number, namely, the successfully matched text lines are regarded as being successfully matched, if the successfully matched text lines are matched, a new group is generated, if an end word of the credit investigation report is matched, the end mark identified in the step S41 is set to be true, and an extraction result is output according to a specific format.

Further, the specific method for extracting the text content of which the matching type is the common line extraction in S6 is as follows: matching the sorted text lines with template information by calculating a specific threshold, wherein the specific threshold is determined to be related to the number of the text lines, generally, if the number of the successfully matched text lines is more than half of the total number, the matching is successful, and if the matching is successful, a new line of data is generated;

s621, reading the sorted text lines sequentially, adding marks to the tail text line and the continuous page text line of each page, and converting the position of the continuous page text line, namely calculating the distance between the horizontal position and the vertical position of the continuous page text line and the last tail text line, and adding the distance to the position of the continuous page text line to ensure that all the text lines are read from top to bottom, the content cannot be lost or repeated, and the subsequent position calculation is facilitated;

s622, the size and the position of each cell in the credit investigation report are fixed, after the template matching is successful, in order to calculate the left and right boundaries of the field region needing to be extracted, the left and right boundary distance is obtained by multiplying the number of the left and right cells configured in the template information by the height of the single character matched to the text line, namely the distance is subtracted from the left position of the current line region to obtain a left boundary, and the distance is added from the right position of the current text line region to obtain a right boundary; obtaining the upper and lower boundaries of the extracted field area by continuously judging whether the next text line is a result value, namely if the next text line is the result value, extracting the upper boundary of the field area as the lower boundary position of the current line, if not, continuously judging whether the next text line is the result value until the result value is found, then judging whether to continuously extract the subsequent text line by utilizing whether the result value set in the template information is multi-line attribute or not, and extracting the lower boundary of the field area as the upper boundary position of the newly matched text line until the template matching is successful again;

s623, extracting texts by using the current text line region position obtained through calculation, and if frame line information is detected in the current region, extracting texts by combining the frame line position to finally obtain a result value corresponding to each keyword;

and S624, storing the extraction result in the line data by a minimum identification unit, after the matching extraction of the current text line is completed, if the previous text line is empty, inserting the current text line data into the result of the previous text line, otherwise, adding a new line of data, and then performing the matching extraction of the next text line.

Further, the specific method for extracting the matched text content according to the information type in S6 is as follows: when the matching type is general table extraction: matching the sorted text lines with template information by calculating a specific threshold, wherein the specific threshold is determined to be related to the number of the text lines, the matching is successful if the number of the text lines which are successfully matched is more than half of the total number, and a new list data is generated if the matching is successful;

s631, reading the sorted text lines in sequence, adding marks to the tail and the continuation page text lines of each page, and converting the positions of the continuation page text lines, namely, calculating the distance between the horizontal position and the vertical position of the continuation page text line and the last tail text line, and adding the distance to the position of the continuation page text line to ensure that all the text lines are read from top to bottom, the content cannot be lost or repeated, and the calculation of the later position is facilitated;

s632, the size and the position of each cell in the credit investigation report are fixed, after the template matching is successful, in order to calculate the left and right boundaries of the field region needing to be extracted, the left and right boundary distance is obtained by multiplying the number of the left and right cells configured in the template information by the height of the single character matched to the text line, namely the distance is subtracted from the left position of the current line region to obtain the left boundary, and the distance is added from the right position of the current text line region to obtain the right boundary; obtaining all result values of the table area by continuously judging whether the next text row is the result value, namely if the next text row is the result value, storing the data of the row where the result value is located until the template matching is successful again, and then extracting the number and the position of the main columns according to whether the keywords set in the template information are the main column attributes;

s633, dividing the rows by using the extracted main column number and position to obtain the upper and lower boundaries of each row result value, extracting by using the left and right boundaries calculated by each keyword, and finally obtaining the result value corresponding to each keyword by using the position of a frame line if the frame line is detected in the area;

s634, the extraction result is stored in the line data by the minimum identification unit, after the matching extraction of the current text line is completed, if the previous text line is empty, the current text line data is inserted into the result of the previous text line, otherwise, a line of data needs to be added, the line data is stored in the list data, and then the matching extraction of the next text line is performed.

Further, the specific method for extracting the text content of which the matching type is the extraction of the payment record in S6 is as follows: : matching the sorted text lines with template information by calculating a specific threshold, wherein the specific threshold is determined to be related to the number of the text lines, the matching is successful if the number of the text lines which are successfully matched is more than half of the total number, and a new list data is generated if the matching is successful; wherein the repayment record extraction is a table structure meeting upper and lower keywords and result values, when the repayment record extraction is carried out, the repayment record extraction comprises left year data, repayment record data distributed in an upper column and a lower column and money amount data,

s641, sequentially reading the sorted text lines, adding marks to the tail text line and the continuous page text line of each page, and converting the positions of the continuous page text lines, namely, reading all the text lines from top to bottom by calculating the distance between the horizontal position and the vertical position of the continuous page text line and the last tail text line and adding the distance to the position of the continuous page text line, wherein the content cannot be lost or repeated, and the subsequent position calculation is facilitated;

s642, finding the right boundary position of the year data by calculating the position of the first keyword, calculating to obtain a candidate item of the year by using the right boundary position, and filtering by the attribute of the year to obtain the positions and the number of all the years;

s643, calculating upper and lower boundaries of repayment record data and money amount data by using position information between years and keywords, calculating left and right boundaries of the repayment record data and the money amount data by using the position information between the keywords, and judging a boundary of a current text line region by combining the position of frame line information if the frame line information is detected in a current region; obtaining candidate items of the repayment record data and the amount data through the position information, and filtering by using attributes of the repayment record data and the amount data to obtain the repayment record data and the amount data corresponding to each keyword and the year data;

s644, storing the extraction result into line data by a minimum identification unit, after matching extraction of the current text line is completed, if the previous text line is empty, inserting the current text line data into the result of the previous text line, otherwise, adding a new line of data, storing the line data into list data, and then performing matching extraction of the next text line.

Further, the specific method for extracting the text line content whose matching type is the sub-group name in S6 is as follows: matching the sorted text lines with template information by calculating a specific threshold, wherein the specific threshold is determined to be related to the number of the text lines, the matching is successful when the number of the text lines which are successfully matched is more than half of the total number, and a new subgroup is generated if the matching is successful; and when the matched subgroup name contains the account and the account number is not identified, calculating the number of the current subgroup to calculate the account number.

Further, the specific method for extracting the text line content of which the matching type is extracted in a single line in S6 is as follows: matching the sorted text lines with template information by calculating a specific threshold, wherein the specific threshold is determined to be related to the number of the text lines, the matching is successful if the number of the text lines which are successfully matched is more than half of the total number, and a new line of data is generated if the matching is successful; the single-line extraction refers to a structure which meets the requirement of extracting the whole line, a text structure which needs to be extracted in a single line is extracted by using several known fixed keywords, an extraction result is stored in line data by a minimum identification unit, and the next text line is matched and extracted after the matching and extraction of the current text line are completed.

In addition, the invention also discloses a credit investigation report recognition system, which adopts the credit investigation report recognition method to realize credit investigation report recognition, and the system comprises:

the image data acquisition module is used for acquiring the image data of the credit investigation report to be identified;

the image preprocessing module is used for carrying out direction judgment, inclination correction, watermark detection and filtering on the acquired image data of the credit investigation report;

the OCR recognition module is used for carrying out full-text OCR recognition on the image data of the credit investigation report;

the format analysis module is used for carrying out format analysis on the image data of the credit investigation report;

the template matching module is used for matching according to the content of the text line and the template information;

the data extraction module is used for extracting data according to the matched information type;

the checking and summarizing module is used for checking and summarizing the data of the extraction result;

and the structured output module is used for outputting the extraction result to the xml file in a structured manner.

Compared with the prior art, the invention has the following advantages:

1) the credit investigation report recognition method of the invention preprocesses the image data of the credit investigation report in the recognition process, automatically judges and rotates the image data for recognition, detects and filters the watermark position of the credit investigation report through the deep learning technology, can recognize the image data in different directions or missing pages, can eliminate the factors influencing the recognition as much as possible, improves the recognition accuracy, can better support the scene of the photographed image, supports the recognition under multiple scenes of single page, double page, missing page and the like, solves the problem that the prior market only supports the input of a complete pdf scanning piece, and realizes the recognition at any moment.

2) The credit investigation report identification method can identify the typesetting of the image data of the credit investigation report, can identify the format of the credit investigation report through a deep learning technology, and can better match in a template matching stage by segmenting the text according to the distribution condition of the text content and the outline information, thereby improving the accuracy of template matching and further improving the identification accuracy.

3) In order to better support scenes such as partial missing or complete missing of a frame line, the credit investigation report identification method can efficiently extract related information from image data by using a matching mode of text line content and template information, and carries out structured extraction and field verification aiming at the matched field type, so that the identification accuracy of the field is higher, the identification rate of the whole part is higher, the matched field can be stored in a group structure form in an extraction stage according to the type, information which does not form a group structure in the image data is reserved in an extraction result summarizing stage, extraction under the condition of continuous pages is facilitated, and the richness and the integrity of an identification result are improved.

Drawings

Fig. 1 is a flowchart illustrating a credit investigation report identification method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a credit investigation report identification system according to an embodiment of the present invention;

3a-3c are schematic structural diagrams of different information types in the credit report according to the embodiment of the invention;

fig. 4 is an effect diagram of removing a credit investigation report watermark by using a credit investigation report identification method according to an embodiment of the present invention, where the left side of the diagram is an original diagram containing a watermark, and the right side of the diagram is an effect diagram after removing the watermark.

Reference numbers in the drawings indicate: 0 is group name extraction; 1, general extraction; 2, extracting a general table; 3, extracting repayment records; 4, extracting subgroup names; and 5, single-row extraction.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be obtained by a person skilled in the art without inventive efforts based on the embodiments of the present invention, are within the scope of protection of the present invention.

Example (b):

as shown in fig. 1, this embodiment discloses a method for identifying a credit investigation report, which includes the following steps:

s1 acquires image data of the credit report waiting for recognition: the credit investigation report image data comprises jpg, bmp, png, pdf and tiff file formats, single page splitting is firstly carried out on multi-page files in the pdf format, and then the split single page is converted into image data; the step can improve the identification accuracy by splitting the image data with different formats, avoid multi-page identification,

s2, preprocessing the image data to obtain preprocessed image data;

for the image preprocessing process in step S2, the present embodiment performs direction determination, tilt correction, watermark detection and filtering on the image data, and includes the following specific steps:

s231, image preprocessing, namely performing normalization processing on the image under the condition of ensuring that the aspect ratio of the image is not changed;

and S234, generating a training sample through the direction, the size and the angle of the watermark for training, and then removing the watermark of the image by using the trained deep learning model, as shown in FIG. 4.

In the actual image recognition process, the image sources are various, such as scanning, photographing and the like, the image content has the conditions of different angles, different directions or missing pages and the like, and in the image preprocessing stage, the text content on the image is subjected to line detection and OCR recognition through a deep learning technology to judge the image direction and the inclination angle and correct the image direction and the inclination angle, so that better matching can be performed in the template matching stage.

in this embodiment, a specific method for analyzing the layout structure of the image data in step S4 is as follows:

the deep learning classification method comprises the following specific steps:

The template matching method comprises the following specific steps:

traversing the whole template content by using the content of the text lines after line aggregation, and judging whether the matching is successful or not by calculating a specific threshold value; in this embodiment, the specific threshold is related to the number of text lines, and generally, a matching success can be regarded as a successful matching if the number of text lines successfully matched is more than half of the total number. As shown by number 1 in fig. 3, the text includes 8 text lines, and the matching is successful as long as the number of matching successes is greater than 4.

According to the method and the device for field recognition, the text content and the template are matched, the related information can be efficiently extracted from the image, and the structured extraction and the field verification are performed on the matched field type, so that the field recognition accuracy is higher, and the whole recognition rate is higher.

in the embodiment, the setting of the keyword text information attribute is to match text lines obtained by combining with a deep learning technology by calculating a specific threshold. The position information attribute of the keyword is set to calculate the relevant position in the stage of extracting the matched fields according to the types. Whether the keyword has a multi-line attribute is set so as to judge whether up-down segmentation and line aggregation are needed in the text line sorting stage. Whether the result value is set for multiple lines of attributes is to judge whether the next line needs to be extracted or not in the extraction stage of the matched fields according to types. The keyword type attribute is set so as to be verified according to different types in the stage of extracting the result and verifying. Whether the key words are main column attributes or not is set so as to judge whether the column can be used as a branch line or not when common table extraction is carried out on the matched fields according to types. The setting of the line type value attribute of the keyword is to extract fields with different formats and finally completely reserve the credit report content in a group structure form.

when the text line content is cut left and right, image data needs to be judged to be double pages firstly, and the corresponding distribution condition is obtained by utilizing the obtained text content: the more dense the characters are, the larger the calculated distribution value is, and the more sparse the characters are, the smaller the calculated distribution value is; finding a middle blank area position according to the distribution value of the text content, and classifying the text content into a left part and a right part by using the blank area position;

when the text line content is cut up and down, firstly, the text blocks are gathered, and then whether the text lines obtained by gathering the lines need to be cut up and down is judged by utilizing whether the frame line information obtained by identification and the keywords or the result values in the template information are in multi-line attributes or not. The format in the single-page image can be identified through the segmentation mode, for example, two or more pages exist in a single-page image picture obtained by shooting, in order to improve the identification accuracy, whether the multiple pages exist in the image needs to be judged, and the identification is carried out in a targeted manner, so that the identification accuracy and the integrity are improved.

S6, extracting the matched text line content according to the information type, wherein the information type comprises group name extraction 0, general line extraction 1, general table extraction 2, repayment record extraction 3, subgroup name extraction 4 and single line extraction 5;

specifically, as shown in fig. 3a to 3c, the method for extracting different information types includes:

further, the specific method for extracting the text content with the matching type of group name extraction 0 in S6 is as follows: the sorted text lines and the template information are matched by calculating a specific threshold, the specific threshold is related to the number of the text lines in the embodiment, generally, the number of the successfully matched text lines is more than half of the total number, the successfully matched text lines are regarded as successfully matched, if the successfully matched text lines are matched, a new group is generated, if an end word of the credit investigation report is matched, the end mark identified in the step S41 is set to be true, and the extraction result is output according to a specific format.

The specific method for extracting the text content of which the matching type is the general line extraction 1 comprises the following steps: matching the sorted text lines with the template information by calculating a specific threshold, wherein the specific threshold is related to the number of the text lines in the embodiment, generally, if the number of the successfully matched text lines is more than half of the total number, the successfully matched text lines can be regarded as successfully matched, and if the successfully matched text lines are successfully matched, a new line of data is generated;

s621, reading the sorted text lines in sequence, adding marks to the tail text line and the continuous page text line of each page, and converting the positions of the continuous page text lines, namely, calculating the distance between the horizontal position and the vertical position of the continuous page text line and the last tail text line, and adding the distance to the position of the continuous page text line to ensure that all the text lines are read from top to bottom, the content cannot be lost or repeated, and the subsequent position calculation is facilitated;

s622, since the size and the position of each cell of the credit investigation report are fixed, after the template matching is successful, in order to calculate the left and right boundaries of the field region which need to be extracted, the distance between the left and right boundaries is obtained by multiplying the number of the left and right cells configured in the template information by the height of the single character matched with the text line, namely, the distance is subtracted from the left position of the current line region to obtain the left boundary, and the distance is added from the right position of the current text line region to obtain the right boundary. And get the upper and lower borders of the extracted field area by continuously judging whether the next text line is the result value, that is, if the next text line is the result value, then the upper border of the extracted field area is the current and lower border position of the text line, if not, then continuously judging whether the next text line is the result value until finding the result value, then judging whether the next text line is extracted by using whether the result values set in the template information are multi-line attributes, until the template matching is successful again, then the lower border of the extracted field area is the current and upper border position of the text line;

s623, extracting texts by using the current text line region position obtained through calculation, and if outline information is detected in the current region, extracting texts by combining the outline position to finally obtain a result value corresponding to each keyword;

and S624, storing the extraction result into the line data by a minimum identification unit, wherein the minimum identification unit comprises keywords, result values, position information of the keywords and the result values and the like which are in one-to-one correspondence with each cell in the image data, and after the matching extraction of the current text line is completed, if the previous text line is empty, the current text line data is inserted into the result of the previous text line, otherwise, a new line of data needs to be added, and then the matching extraction of the next text line is performed.

Further, the specific method for extracting the text content when the matching type is the general table extraction 2 is as follows: matching the sorted text lines with template information by calculating a specific threshold, wherein the specific threshold is related to the number of the text lines in the embodiment, generally, the number of the successfully matched text lines is more than half of the total number, which is regarded as successful matching, and if the matching is successful, a new list data is generated;

s632, multiplying the number of the left and right cells set in the template information by the height of the single character matched with the text line to obtain the left and right boundaries of the current text line area, namely subtracting the distance from the left position of the current text line area to obtain the left boundary, and adding the distance from the right position of the current text line area to obtain the right boundary; obtaining all result values of the table area by continuously judging whether the next text row is the result value, namely if the next text row is the result value, storing the data of the row where the result value is located until the template matching is successful again, and then extracting the number and the position of the main columns according to whether the keywords set in the template information are the main column attributes;

s634, storing the extraction result in line data by a minimum identification unit, wherein the minimum identification unit comprises keywords, result values, position information of the keywords and the result values and the like which are in one-to-one correspondence with each cell in the image data, after the matching extraction of the current text line is completed, if the previous text line is empty, the current text line data is inserted into the result of the previous text line, otherwise, a line of data needs to be added, storing the line data in list data, and then performing the matching extraction of the next text line.

Further, the specific method for extracting the text content of which the matching type is the repayment record extraction 3 is as follows: matching the sorted text lines with template information by calculating a specific threshold, wherein the specific threshold is related to the number of the text lines in the embodiment, generally, if the number of the successfully matched text lines is more than half of the total number, the matching is regarded as successful, and if the matching is successful, new list data is generated; wherein the repayment record extraction refers to a table structure meeting upper and lower keywords and result values, when the repayment record extraction is carried out, the repayment record extraction comprises left year data, repayment record data distributed in an upper column and a lower column and money amount data,

s644, storing the extraction result in line data by a minimum identification unit, wherein the minimum identification unit comprises keywords, result values, position information of the keywords and the result values and the like which are in one-to-one correspondence with each cell in the image data, after the matching extraction of the current text line is completed, if the previous text line is empty, the current text line data is inserted into the result of the previous text line, otherwise, a line of data needs to be added, storing the line data into list data, and then performing the matching extraction of the next text line.

The specific method for extracting the text line content whose matching type is subgroup name extraction 4 in S6 is as follows: matching the sorted text lines with the template information by calculating a specific threshold, wherein the specific threshold is related to the number of the text lines in the embodiment, generally, the number of the successfully matched text lines is more than half of the total number, and a new subgroup is generated if the matching is successful; and when the matched subgroup name contains the account and the account number is not identified, calculating the number of the current subgroup to calculate the account number.

The specific method for extracting the text line content with the matching type of single-line extraction 5 is as follows: matching the sorted text lines with the template information by calculating a specific threshold, wherein the specific threshold is related to the number of the text lines in the embodiment, generally, the number of the successfully matched text lines is more than half of the total number, and a new line of data is generated if the matching is successful; the single-line extraction refers to a structure which meets the requirement of extracting the whole line, and a text structure which needs the single-line extraction is extracted by using known several fixed keywords, such as "time to end" and "repayment record" shown by an oval box in fig. 3 a; the left and right text line structures such as the "report number", "report time", and the like shown by the oval box in fig. 3c find the right or left text to be extracted by the fixed keywords, and then the extraction result is stored in the line data by the minimum identification unit, where the minimum identification unit includes the keywords, the result value, the keywords, the position information of the result value, and the like corresponding to each cell in the image data one by one, and after the current text line matching extraction is completed, the next text line matching extraction is performed.

By carrying out image identification and extraction according to the types, relevant information can be efficiently extracted from the image, and structured extraction is carried out aiming at the matched field type, so that the identification accuracy of the field is higher, the identification rate of the whole credit report is higher, the data accuracy is increased, the format structure of the credit investigation report can be reserved for outputting the final result, and the identification result is rich and clear.

S7 checks and summarizes the extracted results: data verification is carried out on the extraction result according to the data types obtained by matching, the data types comprise money amount, date, proportion and repayment records, the extraction result can be filtered in a regular expression mode aiming at different matched data types in the verification process, and the filtering comprises digital filtering, English filtering, Chinese filtering, symbol filtering and the like; integrating the checked extraction results according to the group structure, and keeping the information which does not form the group structure in the current image data;

In addition, the present invention also discloses a credit investigation report recognition system, which adopts the credit investigation report recognition method to realize credit investigation report recognition, as shown in fig. 2, the system comprises:

an image data acquisition module 100, configured to acquire image data of a credit investigation report to be identified;

the image preprocessing module 200 is used for performing direction judgment, inclination correction, watermark detection and filtering on the acquired image data of the credit investigation report;

the OCR recognition module 300 is used for performing full-text OCR recognition on the image data of the credit investigation report;

the format analysis module 400 is used for carrying out format analysis on the image data of the credit investigation report;

the template matching module 500 is used for matching according to the content of the text line and the template information;

a data extraction module 600, configured to perform data extraction according to the matched information type;

the verification and summary module 700 is configured to perform data verification and summary on the extraction result;

and the structured output module 800 is used for outputting the extraction result to the xml file in a structured manner.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A credit investigation report identification method is characterized by comprising the following steps:

s2, preprocessing the image data to obtain preprocessed image data;

s7 checks and summarizes the extracted results: performing data verification on the extraction result according to the data types obtained by matching, and filtering the extraction result in a regular expression mode according to the different matched data types; integrating the checked extraction results according to the group structure, and keeping the information which does not form the group structure in the current image data;

2. The credit investigation report recognition method according to claim 1, characterized in that: the preprocessing process in the S2 includes:

s232, utilizing a biomedical image segmentation technology and combining a convolution network model to perform down-sampling on the preprocessed image, namely obtaining feature maps and feature values of different scales through convolution and pooling, then performing up-sampling and deconvolution, wherein the up-sampling part comprises the step of up-sampling the feature values back to be matched with ground route to finish pixel-level classification, finally obtaining an image with the same size as the preprocessed image, and utilizing a classification result to finish image segmentation containing watermark information;

3. The credit investigation report recognition method according to claim 1, characterized in that: the specific method for analyzing the layout structure of the image data in S4 is as follows:

s42, judging whether the page of the single-page image data is a single page or multiple pages, wherein the judging method aiming at the single-page image data comprises a deep learning classification method and a template matching method;

the deep learning classification method comprises the following specific steps:

combining the default boxes generated by different feature maps, continuously matching with a ground route by a non-maximum suppression method, and filtering out overlapped or incorrect default boxes;

judging the page structure by detecting the size and the position of a bounding box in the image data: if the sizes of the bounding boxes in the detected image data are similar and distributed left and right, judging that the image data is double pages, otherwise, judging that the image data is single page;

the template matching method comprises the following specific steps:

traversing the whole template content by using the content of the text lines after line aggregation, and judging whether the matching is successful or not by calculating a specific threshold value;

4. The credit investigation report recognition method according to claim 1, characterized in that: the content segmentation of the text lines in the S5 comprises left-right segmentation and/or up-down segmentation;

5. The credit investigation report recognition method according to claim 1, characterized in that: the specific method for extracting the text content extracted when the matching type is the group name in S6 is as follows: and matching the sorted text lines with the template information by calculating a specific threshold, generating a new group if the matching is successful, setting the end mark identified in the step S41 as true if the end mark is matched with the end word of the credit investigation report, and outputting an extraction result according to a specific format.

6. The credit investigation report recognition method according to claim 1, characterized in that: the specific method for extracting the text content of which the matching type is the common line extraction in S6 is as follows: matching the sorted text line with the template information by calculating a specific threshold, and generating a new line of data if the matching is successful;

s621, reading the sorted text lines in sequence, adding marks to the tail text line and the continuous page text line of each page, and converting the positions of the continuous page text lines, namely calculating the horizontal and vertical positions of the continuous page text line and the last tail text line, and adding the distance to the positions of the continuous page text lines to enable all the text lines to be read sequentially from top to bottom, and enabling text contents not to be lost or repeated, so that the position calculation of subsequent texts is facilitated;

7. The credit investigation report recognition method according to claim 1, characterized in that: the specific method for extracting the text content of which the matching type is the general table extraction in S6 is as follows: matching the sorted text line with template information by calculating a specific threshold, and generating new list data if matching is successful;

s631, reading the sorted text lines in sequence, adding marks to the tail and the continuation page text lines of each page, and converting the positions of the continuation page text lines, namely calculating the distance between the horizontal position and the vertical position of the continuation page text line and the last page tail text line, and adding the distance to the position of the continuation page text line to ensure that all the text lines are sequentially read from top to bottom, the text content cannot be lost or repeated, and the position calculation of the subsequent text is facilitated;

s634, the extraction result is stored in line data by a minimum identification unit, after matching extraction of the current text line is completed, if the previous text line is empty, the current text line data is inserted into the result of the previous text line, otherwise, a line of data needs to be added newly, the line data is stored in list data, and then matching extraction of the next text line is performed.

8. The credit investigation report recognition method according to claim 1, characterized in that: the specific method for extracting the text content of which the matching type is the extraction of the payment record in S6 is as follows: matching the sorted text line with template information by calculating a specific threshold, and generating new list data if matching is successful; wherein the repayment record extraction is a table structure meeting upper and lower keywords and result values, when the repayment record extraction is carried out, the repayment record extraction comprises left year data, repayment record data distributed in an upper column and a lower column and money amount data,

s641, sequentially reading the sorted text lines, adding marks to the tail text line and the continuous page text line of each page, and converting the positions of the continuous page text lines, namely, calculating the horizontal and vertical positions of the continuous page text line and the last tail text line, and adding the distance to the positions of the continuous page text lines to ensure that all the text lines are sequentially read from top to bottom, and the text content is not lost or repeated, so that the position calculation of the subsequent text is facilitated;

s644, storing the extraction result in line data by a minimum identification unit, after the matching extraction of the current text line is completed, if the previous text line is empty, inserting the current text line data into the result of the previous text line, otherwise, adding a new line of data, storing the line data into list data, and then performing the matching extraction of the next text line.

9. The credit investigation report recognition method according to claim 1, characterized in that: the specific method for extracting the text line content whose matching type is the sub-group name extraction in S6 is as follows: matching the sorted text lines with template information by calculating a specific threshold, and generating a new subgroup if matching is successful; and when the matched subgroup name contains the account and the account number is not identified, calculating the number of the current subgroup to calculate the account number.

10. The credit investigation report recognition method according to claim 1, characterized in that: the specific method for extracting the text line content of which the matching type is extracted in a single line in S6 is as follows: matching the sorted text line with the template information by calculating a specific threshold, and generating a new line of data if the matching is successful; the text structure needing single-line extraction is extracted by using the known fixed keywords, the extraction result is stored in line data by a minimum identification unit, and the next text line matching extraction is carried out after the current text line matching extraction is completed.