CN104966051B

CN104966051B - A kind of Layout Recognition method of file and picture

Info

Publication number: CN104966051B
Application number: CN201510297257.2A
Authority: CN
Inventors: 时金桥; 范晓鹏; 陈小军; 郭莉; 蒲以国; 文新; 邹亚劼; 王洋
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2015-06-03
Filing date: 2015-06-03
Publication date: 2018-07-17
Anticipated expiration: 2035-06-03
Also published as: CN104966051A

Abstract

The invention discloses a kind of Layout Recognition methods of file and picture, a format is devised first enters library facility, format content can be preserved in library, and by the format content format sequence number that high, alignment thereof generates with respect to word, if a unknown picture is analyzed by format, obtained format sequence number is as some format sequence number in library, then the layout information that will remove to extract the unknown picture by the prompt message in library.The present invention identifies document picture by efficient and accurate printed page analysis method, is particularly suitable for the Layout Recognition of Chinese official document file and picture.

Description

A kind of Layout Recognition method of file and picture

Technical field

The invention belongs to area of pattern recognition, are a kind of Layout Recognition methods proposed for file scanned image.

Background technology

In recent years, with China's economic fast development, government department guidance and formulate policy it is more and more, country and Local policy is issued in the form of official document, with the development of science and technology, the documents such as more and more official documents are preserved with the format of image. The official document different in face of enormous amount, format, it would be desirable to it can go out the format of official document to its automatic distinguishing, and impersonal force.

Official document, that is, Party and government offices' official document.The type abbreviation language of official document, General Office of the State Council's publication《State administrative organs Document treatment Tentative Measures》The official document of state administrative organs is summarized as 13 kinds of nine class, order determines, bulletin, notice, leads to Report proposal, is reported, is asked for instructions, giving an written reply, opinion, letter, meeting summary.Include part number, level of confidentiality and security deadline, urgent journey in official document The attributes such as degree, issued organ's mark, documment number, signed by, the cut-off rule in version head, title, Zhu Song organs, text.Having In body implementation procedure, a official document includes not necessarily above-mentioned all properties, and with the increase of official document quantity, the electronics such as scanner are set Standby extensive use, official document are able to preserve with the format of scan image, therefore how effectively to carry out format knowledge to pictures such as official documents It is not very necessary.

How particular document picture, and correct extraction document picture corresponding information are detected from a large amount of pictures, so far Until the present, still without what good method.Currently, printed page analysis technology has had evolved to uses difference for different documents Technology.Ma Zhuan, State of Zhao's power, Ren Zhanpeng et al. propose the research of the automatic marking papers system based on OCR identification technologies.This is a kind of Top-down analysis method refers to the entirety from the page, payes attention to global information, general image is divided into several areas Main region is continued to divide further according to the hierarchical structure information of text image in domain.Wu Yukun proposes the business card based on OCR The bottom-up analysis method of printed page analysis has been used in system research in the research, from the pixel of image, pay attention to part Image zonule is gradually synthesized big region, word by information, and --- word --- line of text --- paragraph etc. is schemed until covering is entire Picture.For these methods both for the format of the similar size of font, the algorithm of use is template matching algorithm, connected domain algorithm Deng the disadvantage is that operand is big, speed is slow.Current existing line of text, character cutting method Chinese, digital mixing environment and Cutting can not accurately be carried out in the case of different font size word mixings, in official document identifying system, about dispatch for word and dispatch Department, title etc. are all that font size differs.Therefore, it is necessary to an efficient and accurate printed page analysis methods to identify text Shelves picture.

Invention content

In view of the above-mentioned problems, the object of the present invention is to provide a kind of Layout Recognition method of file and picture, by efficiently with And accurately printed page analysis method identifies document picture, is particularly suitable for the Layout Recognition of Chinese official document file and picture.

To achieve the goals above, the present invention uses following technical scheme：

A kind of Layout Recognition method of file and picture, includes the following steps：

1) according to the format picture of different document sample, format feature database is generated.

Further, the format content of different document sample is preserved in the format feature database and by format content with respect to word The format sequence number that high, alignment thereof generates.

In order to which more accurately extraction layout information, the present invention devise a format and enter library facility, exactly pass through first User interface draws rectangle frame by user and goes to indicate which block is title, which block is dispatch department, which block to the format picture of input It is to send the documents for word etc., is then put in storage, format content can be preserved in library, and high, alignment thereof generates with respect to word by format content Format sequence number, the format sequence number layout information extraction in it is extremely important.It is the serial number by sequence, and alignment What the numeric sequence number that mode generates generated.If having 3 pieces in format, first generated after ranking results sequence is 001221, first 0 indicates first piece, and second 0 indicates first piece and indicate second piece for maximum, 1, and 2 indicate that second piece is the Three is big, and so on.Second sequence that alignment thereof generates is 212, wherein 2 indicate align center, 1 indicates Right Aligns.That Its Serial No. 001221212.

In the format analysis phase, only there are one sequence numbers, if a unknown picture is analyzed by format, obtained format Sequence number is as some format sequence number in library, then the version that will remove to extract the unknown picture by the prompt message in library Formula information.This format feature database generated can improve the accuracy of layout information extraction.

2) document to be identified is scanned, scan image is obtained.

This step can also include being pre-processed to scan image, and the pretreatment includes that (removal ink goes to print for denoising Chapter), Slant Rectify etc..

Some documents may will produce pad-ink in print procedure, other may be will produce in scanning process and is made an uproar Sound, especially salt-pepper noise.Secondly, some document pictures have been capped some seals, it can generate normal format region dry It disturbs, this also results in subsequent OCR (Optical Character Recognition, optical character identification) Recognition feedback knot Fruit is a piece of mess code.Again, the inclination of document picture, which can divide line of text, generates interference.Therefore the invention system is needed to provide The denoising function of picture, to enhance the robustness and accuracy of this invention.

3) region division is carried out to scan image, determines the text of document to be identified.

Line of text segmentation is carried out to scan image according to projection information, mainly by the textural characteristics of monochrome pixels point come really Determine cutting position.Find out the minimum font size of line of text, the bottom-up end of text row for finding text, then top-down searching It can be with the matched text initial row of end line.If can not find start of text row or end of text row, by start of text rower It is denoted as 0, end of text rower is denoted as the ending of line of text.It is the text of document between start of text row and end of text row.

4) region division is carried out to part more than document text to be identified, and obtains the layout information in each region.

To part more than text, the row of word height having the same, line space, alignment thereof is put into the same region. And if there are multiple line of text in left side inside the same region, only there are one line of text on right side, need to draw region again Point, using a line of text on right side as the subregion in the region.

Ready-portioned region will generate a format sequence number, which is the opposite word Gao Sheng by alignment thereof At.

The layout information includes：The alignment side of font size size, sequence, region relative to entire scan image in region Formula.

5) layout information that step 4) obtains is matched with the layout information in format feature database, if matched, Corresponding layout information is then extracted from format feature database；If do not matched, by the layout information in each region and in advance The format word set integrates that (when document is official document document, which includes lemma collection, and department's word collection and dispatch are for word word Collection) matching, obtain Layout Recognition result information.

Specifically, the layout information that step 4) obtains is primarily directed to document picture to be identified, mainly format sequence Number, and each OCR result in region.Layout information in format feature database is mainly：Each corresponding rule of storage picture, Namely：1) format sequence number；2) information labels (the corresponding regional number of information belonging to i.e.), for example title is which block, dispatch portion Which block door is, which block dispatch is for word.If some pending picture match has arrived sequence number, corresponded to by information labels Information is extracted to pending picture, such as the sequence number of title：1,1 indicates that first region is title.

By above step, the analysis to picture format can be completed, finally correctly extracts corresponding layout information.Wherein It finds the text of file and picture and determines that the format region of text above section is core of the invention.

The beneficial effects of the present invention are：

Compared with prior art, Layout Recognition method provided by the invention has higher recognition accuracy, precision and effect Rate, and there is larger practicability and application value.

Description of the drawings

Fig. 1 is the overall flow figure of Layout Recognition method of the present invention.

Fig. 2 is official document schematic diagram in the embodiment of the present invention 1.

Fig. 3 is the layout information schematic diagram extracted in the embodiment of the present invention 1.

Fig. 4 is official document schematic diagram in the embodiment of the present invention 2.

Fig. 5 is the layout information schematic diagram extracted in the embodiment of the present invention 2.

Specific implementation mode

It will elaborate below to embodiments of the present invention in conjunction with attached drawing by taking Chinese official document document as an example.

The overall flow of Layout Recognition method of the present invention is as shown in Figure 1, specifically include five steps：

1. a pair official document scan image pre-processes, the behaviour such as size adjusting, the fuzzy, slant correction of removal are carried out to image Make, in favor of the Layout Recognition of official document.Concrete processing procedure is as follows：

(1) for removing salt-pepper noise, according to switch filtering thought, present invention preparation uses max-min operators as green pepper Salt noise detector carries out progressive scan from left to right using adaptive neighborhood window to image, while to being located in window The pixel of the heart carries out noise differentiation.If the gray value of the point is between maximum and minimum, then it is assumed that the point is quilt Noise pollution；If the gray value of the point is equal to extreme value, then it is assumed that the point may be polluted by salt-pepper noise, then be recycled improved Method differentiated, and using operation result as the substitution value of the point.

(2) seal for removing part on title finds profile, according to the training of some samples using canny edge detections Value, when the contour area at edge be more than a certain threshold value when, then it be seal possibility it is very big, it can be removed.

(3) Slant Rectify is the statistical chart by adding up black pixel number in image, to line direction project To horizontal and vertical projection.It is maximum according to the side of the perspective view along text inclined direction for inclined image, in certain angle File and picture is rotated as interval using specific resolution ratio respectively in range, obtains the perspective view of rotated image, then will be made The maximum rotation angle of perspective view mean square error is as angle of inclination.

2. according to projection information, line of text segmentation is carried out to official document.For determining character area, count black per a line Point number.Find the initial row that continuous three rows stain number is more than 3, initial row of the label current line as text.From starting text Row starts to count the average stain number of the first eight row, counts the stain number of each column in this eight row, and first stain number is more than etc. Row in 5 originate row as text.Row of the last one stain number more than or equal to 5 are arranged as the end of text.Text is originated It is equally divided into 5 regions between row and text starting row.If the stain number in two regions is less than 3, current text row is marked For end of text row, next line of text is otherwise continued to scan on.It is line of text segmentation between text initial row and end of text row Result.

3. the row calculated per a line is high, according to the alignment thereof of text and the high information of row, where determining text Row.Find out the minimum font size of line of text, bottom-up ground scan text row.The line of text for meeting the following conditions is found as text End line：Font size is differed with minimum font size within two pixels；Both ends are aligned or left-justify；Section after away from minimum text line number After the section of the line of text at place away from difference two pixels within.Top-down scan text row, finds and meets the following conditions Line of text is as start of text row：Font size is differed with minimum font size within two pixels；Both ends are aligned or Right Aligns；Section after away from With after the section of the line of text where minimum text line number away from differing within two pixels.If can not find start of text row or just Start of text rower is denoted as 0 by literary end line, and end of text rower is denoted as the ending of line of text.More than text it is us in this way Carry out the region of Layout Recognition.

4. determining each region according to communication information, and to carrying out line of text segmentation in each region, preserve the area The information such as line of text height, the alignment thereof of line number, region initial position, region relative to entire scan image in domain.Specifically Steps are as follows：

(1) floor projection is carried out to text area above, forms line of text, region divides in advance.

A) denoising is carried out to floor projection, deletes the influence of some straight lines and discrete point.(filter continuous line number be less than etc. In 7 successive projection row；It filters continuous line number and is less than or equal to 10 more than 7, and floor projection result mean value is less than or equal to 20 Successive projection row) merge projection line of text as region.(horizontal scan projection result from top to bottom, continuous two projections text One's own profession font size is identical (criterion is that absolute value of the difference is less than or equal to 2), and (1) judges whether line-spacing is less than or equal to 2 times of font sizes, small In equal to 2 times font sizes, merging two projection rows becomes a region；(2) continuous two rows font size it is close (criterion be difference it is exhausted It is more than 2 to value to be less than or equal to 4), judge whether line-spacing is less than or equal to 1 times of font size, if it is less than equal to 1 times font size, merges two Projection row becomes a region；(3) a line is bigger than upper row font size below, and difference is less than or equal to 10, and line-spacing is less than etc. In 1 times of font size, while the line-spacing and the third line of the third line and the second row and the font size of the first row meet preceding two rule.)

(2) division determination is carried out to each pre- division region.

A) to region progress upright projection and to projection result denoising, storage zone row initial position, end position and width Degree.

B) region line of text divides, text message record.(floor projection is carried out to region, and projection result is gone It makes an uproar operation, redefines text row information, the details of line of text in posting field.)

C) judge that (a large amount of blank refer to that continuous white point number is more than or equal to 10 times of areas with the presence or absence of a large amount of blank in upright projection The row in domain is high).In the presence of jumping to d), there is no jump to e).

D) it is several regions by region division according to a large amount of blank.

I. initial position and the end position of the row, column in the region after each segmentation, height, width are determined.

Ii. floor projection is carried out to the region after each segmentation, and denoising operation is carried out to projection result, redefine text This row information, the details of line of text in posting field.

E) line of text in region is judged, judges whether the region is that multiple line of text correspond to a line of text Situation.

Iii. region is presorted as three sub-spaces (left subspace, sub-spaces, right subspace).(Subspace partition It is defined as, left subspace：Initial position on the left of region, at the 1/3 of zone length；Sub-spaces：To at 2/3 at 1/3；Right son Space：The end position in region is arrived at 2/3).

Iv. floor projection is carried out to three sub-spaces respectively, and denoising operation is carried out to projection result.

V. the text row information of record subspace (text line number, initial position and end position, row is high, line-spacing)

Vi. judge the correlation of the line of text of 3 sub-spaces and whole region.Right subspace there are a line of text, There are two and more line of text for left subspace or at least one space of sub-spaces.And the line of text of right subspace Row height occupy whole region height (95% or more) or line of text be present in region floor projection part centre.It is such Situation needs specially treated to go to f), otherwise terminates.

F) multiple line of text correspond to the case where line of text.

I. the part of multiple line of text is divided into region, the part of a remaining line of text is as the attached of the region Subregion.Determine current region and attached subregion.(according to upright projection)

Ii. whether detection current region can merge with previous region, combination principle and b in (1)) it is similar.If can if close And cannot then it continue.

Iii. whether detection current region can merge with latter area, combination principle and b in (1)) it is similar.Can, merge； Cannot, continue.

Iv. initial position and the end position of the row, column in the region after merging or detecting, height, width are determined.

V. floor projection is carried out to region, and denoising operation is carried out to projection result, redefined text row information, record The details of line of text in region.

It determines the region for finishing current official document, traverses each region and obtain layout information, extract font size size in region, row Sequence, the alignment thereof in region is as layout information.

5. matched using the rule in the information and format feature database retained above (including location matches and keyword Matching), it has matched and has then extracted layout information by format feature database.If not being matched to format sequence number, pass through setting Lemma collection, department's word collection, the word collection sent the documents for word, each region that will identify that are matched with word collection, obtain Layout Recognition knot Fruit information.

Embodiment 1

The official document in the one environmental protection Room, width Anhui Province is as shown in Fig. 2, carry out layout information such as Fig. 3 institutes of format Detection and Extraction Show,

Region division is carried out to picture first, by the OCR result for obtaining sequence number and each region after division. It is gone and format storehouse matching according to the method provided in text.First sample figure in format has been hit after matching (id is hit in Fig. 3 =0), information extraction is carried out according to hit format rule.

Embodiment 2

The official document of one width the National Audit Office is as shown in figure 4, the layout information for carrying out format Detection and Extraction is as shown in Figure 5.

Claims

1. a kind of Layout Recognition method of file and picture, includes the following steps：

1) according to the format picture of different document sample, format feature database is generated；

2) document to be identified is scanned, scan image is obtained；

3) line of text segmentation is carried out to scan image, determines the text of document to be identified；

4) region division is carried out to part more than document text to be identified, and obtains the layout information in each region；

5) layout information that step 4) obtains is matched with the layout information in format feature database, if matched, from Corresponding layout information is extracted in format feature database；If do not matched, by the layout information in each region with preset Format word collection matching, obtain Layout Recognition result information.

2. the Layout Recognition method of file and picture as described in claim 1, which is characterized in that preserved in the format feature database The format content of different document sample and the format sequence number that high, alignment thereof generates by the opposite word in format content.

3. the Layout Recognition method of file and picture as described in claim 1, which is characterized in that further include to sweeping in step 2) Tracing is as being pre-processed.

4. the Layout Recognition method of file and picture as claimed in claim 3, which is characterized in that the pretreatment include denoising with Slant Rectify.

5. the Layout Recognition method of file and picture as claimed in claim 4, which is characterized in that the denoising include removal ink and Remove seal.

6. the Layout Recognition method of file and picture as described in claim 1, which is characterized in that according to projection information in step 3) Line of text segmentation is carried out to scan image, cutting position is determined by the textural characteristics of monochrome pixels point.

7. the Layout Recognition method of file and picture as claimed in claim 6, which is characterized in that the bottom-up text for finding text This end line, then top-down searching can be with the matched text initial row of end line；If can not find start of text row or Start of text rower is denoted as 0 by end of text row, and end of text rower is denoted as the ending of line of text；Text initial row and text It is the result of line of text segmentation between end line.

8. the Layout Recognition method of file and picture as described in claim 1, which is characterized in that having the same in step 4) Word height, line space, alignment thereof row be put into the same region, and if there are multiple texts in left side inside the same region Row, only there are one line of text on right side, need to divide region again, using a line of text on right side as the sub-district in the region Domain.

9. the Layout Recognition method of file and picture as described in claim 1, which is characterized in that ready-portioned region in step 4) A format sequence number is generated, which is by alignment thereof, and word height relatively generates.

10. the Layout Recognition method of file and picture as described in claim 1, which is characterized in that in step 4), the format letter Breath includes：The alignment thereof of font size size, sequence, region relative to entire scan image in region.