CN107358208A

CN107358208A - A kind of PDF document structured message extracting method and device

Info

Publication number: CN107358208A
Application number: CN201710576556.9A
Authority: CN
Inventors: 徐龙; 李德彦; 杨宇
Original assignee: China Science And Technology (beijing) Co Ltd; Beijing Shenzhou Taiyue Software Co Ltd
Current assignee: Dingfu Intelligent Technology Co., Ltd
Priority date: 2017-07-14
Filing date: 2017-07-14
Publication date: 2017-11-17
Anticipated expiration: 2037-07-14
Also published as: CN107358208B

Abstract

The embodiment of the present application discloses a kind of PDF document structured message extracting method, and methods described includes：Obtain the original page of PDF document；At least one actual page comprising content of text or title is extracted from the original page；Titles at different levels are extracted from the actual page and are under the jurisdiction of the content of text of the title；Each described title of structured storage and the content of text for being under the jurisdiction of the title.Structured message extracting method in above-mentioned technical proposal title at different levels in PDF document and can be under the jurisdiction of the corresponding content of text of titles at different levels and extract, and structured storage, so as to obtain structured message, so that the structured message extraction of PDF document can automate realization, avoid manual reprocessing, convenient and efficient.

Description

A kind of PDF document structured message extracting method and device

Technical field

The application is related to PDF document information extraction field, more particularly to a kind of PDF document structured message extracting method. In addition, the application further relates to a kind of PDF document structured message extraction element.

Background technology

PDF (Portable Document Format, portable document format), is developed by Adobe Systems The file format gone out, exchange files are carried out for the mode unrelated with application program, operating system, hardware, belong to format document. It is relatively independent between the PDF page, it can verily reproduce each character, color and the image of original copy, but PDF storage It is non-structured data memory format, without the logical construction of recording documents, without logical elements such as paragraph, forms.

Extract the information in PDF document, generally use OCR (Optical Character Recognition, optics word Symbol identification) technology.But the information of the PDF document extracted using OCR technique, it is rendering of being carried out in a manner of vector, It is no logical relation (such as adjacent, front and rear relation) between each character.The text that the character extracted is formed is only It is the matrix that three coordinates of x, y, z add rotation amount to render.The problem of form and random big position be present in such text, Also need to be handled again by hand, can just obtain the structured message with clear and definite hierarchical structure.

Therefore, the information in PDF document is extracted using existing method, in the text extracted, text formatting and position with Meaning, can not advantageously obtain structured message, this is those skilled in the art's urgent problem to be solved.

The content of the invention

The application provides a kind of PDF document structured message extracting method and a kind of PDF document structured message extraction dress Put, to solve the problems, such as advantageously obtain PDF document structured message by prior art.

In a first aspect, this application provides a kind of PDF document structured message extracting method, this method includes：

Obtain the original page of PDF document；

At least one actual page comprising content of text or title is extracted from the original page；

Titles at different levels are extracted from the actual page and are under the jurisdiction of the content of text of the title；

Each described title of structured storage and the content of text for being under the jurisdiction of the title.

With reference in a first aspect, in first aspect in the first possible implementation, extracted at least from the original page The step of one actual page comprising content of text or title, including：

Whether judge respectively in the original page comprising catalogue page, header and footer；

Catalogue page in original page, header or footer are deleted, obtain at least one actual page.

With reference to first aspect and above-mentioned possible implementation, in second of possible implementation of first aspect, from The step of titles at different levels are extracted in the actual page and are under the jurisdiction of the content of text of the title, including：

Extract the first order title in each actual page；

Current content between first order title and next first order title in actual page is extracted, as with current first Content corresponding to level title；If last first order title in the entitled actual page of the current first order, is extracted in the actual page Content after current first order title, as content corresponding with current first order title；

By each first order title, and the content corresponding to the first order title, as an one-level logical page (LPAGE)；

If one-level title in the absence of in the one-level logical page (LPAGE), each described title of the structured storage and it is subordinate to In the content of text of the title the step of, including：

Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein, it is subordinate to In the content of text of first order title be content corresponding with the first order title.

With reference to first aspect and above-mentioned possible implementation, in first aspect in the third possible implementation, institute The content by each first order title, and corresponding to the first order title is stated, before the step of one-level logical page (LPAGE), It is further comprising the steps of：

If not having first order title in currently practical page, all the elements of currently practical page are incorporated into a upper first order Content corresponding to title；

If first first order title in currently practical page, will be described currently practical not in the first row of currently practical page Content in page before first first order title is incorporated into content corresponding to a first order title.

With reference to first aspect and above-mentioned possible implementation, in the 4th kind of possible implementation of first aspect, from The step of titles at different levels are extracted in the actual page and are under the jurisdiction of the content of text of the title, it is further comprising the steps of：

(N+1) level title is extracted from each N level logical page (LPAGE) respectively, and is under the jurisdiction of the text of (N+1) level title Content, N take >=1 integer.

With reference to first aspect and above-mentioned possible implementation, in the 5th kind of possible implementation of first aspect, institute State and extract (N+1) level title from each N level logical page (LPAGE) respectively, and be under the jurisdiction of the content of text of (N+1) level title Step, including：

Extract the N+1 level titles in each N levels logical page (LPAGE)；

Extract the content between current N+1 levels title and next N+1 level titles, as with current N+1 level marks Content corresponding to topic；If last N+1 level title in the current entitled N levels logical page (LPAGE) of N+1 levels, extracts the N levels and patrols The content after current N+1 level titles in page is collected, as content corresponding with current N+1 level titles；

By each N+1 level title, and content corresponding with the N+1 level titles, as a N+1 level logical page (LPAGE)；

Each described title of the structured storage and the step of be under the jurisdiction of the content of text of the title, including：

Structured storage the 1st is to N+1 level titles, and is under the jurisdiction of the described 1st respectively in the text of N+1 level titles Hold, wherein, the content of text for being under the jurisdiction of N+1 level titles is content corresponding with the N+1 level titles, is under the jurisdiction of i-stage mark The content of text of topic is the content in addition to i+1 level logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.

With reference to first aspect and above-mentioned possible implementation, in the 6th kind of possible implementation of first aspect, institute The step of stating and extract N+1 level titles from each N level logical page (LPAGE) respectively, and being under the jurisdiction of the content of text of N+1 level titles Including：

Determine to whether there is form in each N level logical page (LPAGE), if form be present, the form is cut into form area Block, extract N+1 levels title and be under the jurisdiction of the content of text of the N+1 level titles.

With reference to first aspect and above-mentioned possible implementation, in the 7th kind of possible implementation of first aspect, institute The step of extracting the first order title in each actual page is stated, including：

Obtain the title line in actual page and title line Y-axis coordinate in actual page；

If the difference of the Y-axis coordinate of current head line and next title line is less than 3 Y-axis units in same actual page When, next title line is merged with current head line；

The content of text of a line nearest from title line on title line is obtained as the first order title in actual page.

Second aspect, present invention also provides a kind of PDF document structured message extraction element, including：

Acquiring unit, for obtaining the original page of PDF document；

First extraction unit, for extracting at least one reality comprising content of text or title from the original page Page；

Second extraction unit, for extracting titles at different levels from the actual page and being under the jurisdiction of in the text of the title Hold；

Memory cell, each described title and it is under the jurisdiction of the content of text of the title for structured storage.

With reference to second aspect, in second aspect in the first possible implementation, first extraction unit, including：

Judging unit, for whether judging respectively in the original page comprising catalogue page, header and footer；

Unit is deleted, for the catalogue page in original page, header or footer to be deleted, obtains at least one actual page.

Compared with prior art, this method removes first from the original page of PDF document and structured message may be carried The part for producing and disturbing, such as catalogue page, header, footer etc. are taken, generates actual page, it is actual so as to complete to extract from original page The step of page.Then titles at different levels and it is under the jurisdiction of the corresponding content of text of titles at different levels from actual page and extracts, ties Structureization stores, so as to obtain structured message so that the structured message extraction of PDF document can automate realization, avoid hand Work processing, convenient and efficient.

Brief description of the drawings

In order to illustrate more clearly of the technical scheme of the application, letter will be made to the required accompanying drawing used in embodiment below Singly introduce, it should be apparent that, for those of ordinary skills, without having to pay creative labor, Other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 to Fig. 7 is the flow of an embodiment of the PDF document structured message of the application this extracting method Figure；

Fig. 8 to Figure 19 is sub-step in one embodiment of this extracting method of the PDF document structured message of the application Effect diagram；

Figure 20 is the structural representation of one embodiment of the PDF document structured message of the application this extraction element.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with embodiment and accompanying drawing to this hair It is bright to be described in further detail.

Fig. 1 is refer to, in a detailed embodiment, this extracting method of PDF document structured message includes：

S100 obtains the original page of PDF document.

S200 extracts at least one actual page comprising content of text or title from original page.

S300 extracts titles at different levels from actual page and is under the jurisdiction of the content of text of the title.

Each described title of S400 structured storages and the content of text for being under the jurisdiction of the title.

Structured message refers to that information is decomposed into multiple inter-related parts, each part after analysis Between have clear and definite hierarchical structure.In this application, PDF document structured message means the text extracted from PDF document, Titles at different levels and the content of text for being under the jurisdiction of title have clear and definite hierarchical structure in text.Structured message can subsequently pass through The file of the multiple formats such as html, word, txt is shown.

Structured storage refers to the content of the multiple files of needs to be saved in by tree structure and level in a file. In this application, each described title of structured storage and it is under the jurisdiction of the content of text of the title, refers to titles at different levels, And it is under the jurisdiction of the content of titles at different levels, stored according to tree structure and level, so as to obtain the structuring of PDF document Information.

Above-mentioned method, the extraction that removal may be to structured message first from the original page of PDF document produce interference Part, such as catalogue page, header, footer etc., actual page is generated, the step of so as to complete to extract actual page from original page. Then titles at different levels and it is under the jurisdiction of the corresponding content of text of titles at different levels from actual page and extracts, structured storage, So as to obtain structured message so that the structured message extraction of PDF document can automate realization, avoid manual processing, just It is prompt efficient.

The step of above-mentioned S100-S400, is described in detail below.

In the step of S100, the original page of PDF document can be inputted by user to obtain, can also be from storage medium Obtain.

In the step of S200, Fig. 2 is refer to, the step of can specifically including S210 and the step of S220.

Whether S210 is judged in original page comprising catalogue page, header and footer.

In step S210, comprise the following steps：

S211 obtains the page number, the character and the total line number of character of current original page of current original page；

S212 is matched the page number of current original page and character with the first preset rules, it is determined that current original page whether For catalogue page.

In step S211, the page number of current original page, the character of current original page and the total line number of character, it can pass through The instruments such as PDFBox, iText directly obtain.Wherein, PDFBox is the Java platform class libraries of an operation PDF document, is out Source instrument, anyone can be programmed on its basis, for creating PDF document, operation existing document and extraction The text message of document.IText is also a java class libraries for being used to generate PDF document increased income, by iText not only PDF or rtf document can be generated, and can be pdf document by XML, Html file translations.

In step S212, the first preset rules can be preset by developer or user.For example, first is pre- If in rule, determining whether the rule of catalogue page includes：The page number of current original page is first page or second page, and current former The line number shared by heading order number on beginning page exceedes the 40% of the total line number of character of current original page, and current original page is catalogue Page；Or the page number of current original page is first page or second page, and in the character of current original page, occur successively " Chinese, Line number shared by the character string of non-Chinese continuous symbol, sequence number " form exceedes the 40% of the total line number of character of current original page, when Preceding original page is catalogue page；Or the page number of current original page is first page or second page, and in the character in current original page Comprising preset keyword, current original page is catalogue page.

For example, if the page number of current original page is first page or second page, and the title sequence in original page Number, such as " 1.1 ", " 1.1.1 ", " 1, ", " 2, " etc., 40% of shared line number more than the total line number of character of current original page, It is catalogue page to determine that current original page.Or if the page number of current original page is first page or second page, it is current original Page character in, such as " from signing for this contract 10 days (hesitating the phase) if in you require surrender, cost only deducts in our company Take ... ... 1.4 ", " line number so shared by the character string of form such as chapter 1 ... ... 15 " exceedes current original page The 40% of the total line number of character, it is catalogue page to determine that current original page.Also or, Fig. 8 is refer to, if the page of current original page Code is first page or second page, and includes " chapter 1 ", " first ", " Co., Ltd ", " catalogue " in current original page During Deng these preset keywords, it is catalogue page to determine that current original page.

In step S212 the first preset rules, in another example, judge whether the rule comprising header includes in original page： If the first line character is identical in continuous 3-5 pages of original page, determine that original page includes header.Further for example, judge be in original page The no rule comprising footer includes：If last column character is identical in continuous 3-5 pages of original page, determine that original page includes page Pin.

S220 deletes the catalogue page in original page, header or footer, obtains at least one actual page.

Specifically, if including catalogue page in original page, the whole page of catalogue page in original page is deleted；If wrapped in original page Containing header, then the header in original page is deleted；If including footer in original page, the footer in original page is deleted.So as to Remove the partial content that may be produced to the extraction of the structured text of PDF document in the original page or original page of interference, obtain to A few actual page.

Before the step of carrying out S300, first the character being in actual page with a line can be merged, form row Text, as shown in figure 9, being merged to the character of same a line, each reality can be obtained by instruments such as PDFBox in advance The coordinate information of character on page, including X-axis coordinate and Y-axis coordinate are identical or gap is within preset range by Y-axis coordinate Character merges, and obtains style of writing originally.Traveled through in units of composing a piece of writing originally, come the text for extracting titles at different levels and being under the jurisdiction of the title The step of this content, for example, by traveling through the style of writing sheet in actual page, to extract first order title and be under the jurisdiction of the first order mark The content of text of topic；By traveling through one-level logical page (LPAGE), to extract second level title in one-level logical page (LPAGE) and be under the jurisdiction of the second level The content of text of title.

The step of S300 and two kinds of situations can be included the step of corresponding S400, one kind is not present in one-level logical page (LPAGE) The situation of next stage title, another kind are next stage title in one-level logical page (LPAGE) also be present.

Fig. 3, Fig. 4, Figure 10 be refer to Figure 14.Fig. 3 is the flow chart of S300-S400 in one embodiment, Fig. 4 the The flow chart of S311 steps in one embodiment.The effect diagram for the step of Figure 10 is S311 in one embodiment；Figure 11 For the effect diagram in one embodiment the step of S312；The effect for the step of Figure 12 is S313 in one embodiment is shown It is intended to；The effect diagram for the step of Figure 13 is S314 in one embodiment；Figure 14 is the step of S410 in one embodiment Rapid effect diagram.Include in one embodiment, the step of S300：

S311 extracts the first order title in each actual page；

Current content between first order title and next first order title in S312 extraction actual pages, as with it is current Content corresponding to first order title；If last first order title in the entitled actual page of the current first order, extracts the reality Content in page after current first order title, as content corresponding with current first order title；

If not having first order title in the currently practical pages of S313, all the elements of currently practical page are incorporated into upper one Content corresponding to one-level title；If first first order title in currently practical page not in the first row of currently practical page, Content before first first order title in the currently practical page is incorporated into content corresponding to a upper first order title；

S314 is by each first order title, and the content corresponding to the first order title, as an one-level logical page (LPAGE).

If one-level title in the absence of in the one-level logical page (LPAGE), the step of corresponding S400, including：

Each first order title of S410 structured storages and the content of text for being under the jurisdiction of the first order title, wherein, The content of text for being under the jurisdiction of first order title is content corresponding with the first order title.

, can be according to the size of font, the pattern of font, word content or title in actual page in the step of S311 Line etc. extracts the first order title in actual page；The size of the font, the pattern of font, word content or title line are all It can be obtained by instruments such as PDFBox, iText.

First order title in actual page is extracted by the font size in actual page, for example, passing through relatively more each style of writing The size of this font, if the largest font of current line text, it is determined that current line text is first order title.Pass through reality The first order title in font style extraction actual page in page, for example, passing through this font style and the default font sample of composing a piece of writing Formula is matched, and it is first order title to determine current line text.The font size of above-mentioned style of writing sheet, current line text can be used Font size of the size of middle first character as the style of writing sheet, multiple size identicals in current line text can also be used The size of multiple characters, the font size as the style of writing sheet；The font style of above-mentioned style of writing sheet, style of writing can be used the in this Font style of the pattern of one character as the style of writing sheet, multiple pattern identicals in current line text can also be used multiple The pattern of character, the font style as the style of writing sheet.The first order in actual page is extracted by the word content in actual page Title, for example, being matched by word content with predetermined keyword, if containing " chapter 1 ", " second in word content The predetermined keyword such as chapter ", " first ", " Part I ", it is determined that current line text is first order title.

Divided into for some first order titles by the PDF document of title line, can also by the title line in actual page come First order title is extracted, Fig. 4 is refer to, specifically includes：

S3111 obtains title line and title line Y-axis coordinate in actual page in actual page；

If the difference of the Y-axis coordinate of current head line and next title line is less than 3 Y-axis in the same actual pages of S3112 During unit, next title line is merged with current head line；

S3113 obtains the text of a line nearest from title line on title line as the first order title in actual page.

In the step of S3111, the Y-axis coordinate of title line can be obtained by instruments such as PDFBox, iText in actual page Take.

In the step of S3113, a line nearest from title line, can by compare this Y-axis coordinate of style of writing with it is current The distance between Y-axis coordinate of title line, to determine a line nearest from title line, the text of the row is obtained as in actual page First order title.

During one-level logical page (LPAGE) is extracted from actual page, due to being carried out page by page according to the original order of actual page Extraction, it is possible to a kind of situation occurs：Content corresponding to same first order title ought to be used as, but because respectively front and rear It is opened in two actual pages.The step of by above-mentioned S313, the content of this part in actual page can be merged into upper one Content corresponding to individual one-level title, so as to ensure that each one-level logical page (LPAGE) can include complete content, overcome common PDF The problem of content that paging is split in document information acquisition methods can not be polymerize.

Fig. 5, Fig. 6, Figure 15 be refer to Figure 18, Fig. 5 is the flow chart of S300-S400 in second embodiment, Fig. 6 the The flow chart of S320 steps in two embodiments.The effect diagram for the step of Figure 15 is S321 in second embodiment；Figure 16 For the effect diagram in second embodiment the step of S322；The effect for the step of Figure 17 is S323 in second embodiment is shown It is intended to；Figure 18 is effect signal the step of being related to the content of text for being under the jurisdiction of i-stage title in S420 in second embodiment Figure.In the second embodiment, the step of S300 includes：

S311 extracts the first order title in each actual page；

S314 is by each first order title, and the content corresponding to the first order title, as an one-level logical page (LPAGE)；

Further comprising the steps of if next stage title in one-level logical page (LPAGE) be present, S320 is respectively from each N level logic (N+1) level title is extracted in page, and is under the jurisdiction of the content of text of (N+1) level title, N takes >=1 integer.The step for can To use recursive process, untill not including N+1 level titles in N level logical page (LPAGE)s.Specifically include：

S321 extracts the N+1 level titles in each N levels logical page (LPAGE), and N takes >=1 integer；

S322 extracts the content between current N+1 levels title and next N+1 level titles, as with current N+1 Content corresponding to level title；If last N+1 level title in the current entitled N levels logical page (LPAGE) of N+1 levels, extracts the N Content in level logical page (LPAGE) after current N+1 level titles, as content corresponding with current N+1 level titles；

S323 is by each N+1 level title, and content corresponding with the N+1 level titles, as a N+1 level logic Page.

Correspondingly the step of S400, including：

S420 structured storages the 1st to N+1 level titles, and be under the jurisdiction of respectively the described 1st to N+1 level titles text Content, wherein, the content of text for being under the jurisdiction of N+1 level titles is content corresponding with the N+1 level titles, is under the jurisdiction of i-stage The content of text of title is the content in addition to i+1 level logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.

It is corresponding with N level titles interior herein it should be noted that when including N+1 level titles in N level logical page (LPAGE)s Hold, contain the content of text for being under the jurisdiction of N level titles, and N+1 level logical page (LPAGE)s.When N+1 levels are not present in N level logical page (LPAGE)s During title, content corresponding with N level titles, exactly it is under the jurisdiction of the content of text of N level titles.That is, in the application In, content corresponding with N level titles, and be under the jurisdiction of the content of text of N level titles, include therebetween and by comprising Relation.

It should be noted that if extracting multiple one-level logical page (LPAGE)s from actual page, wherein, in part primary logical page (LPAGE) not Next stage title be present, next stage title also be present in part primary logical page (LPAGE), then in the absence of one-level one-level logic Page, the step of structured storage for the structured storage in one embodiment the step of, for next stage title also be present One-level logical page (LPAGE), the step of structured storage for the structured storage in second embodiment the step of, the PDF texts that finally obtain In mark structure information, the structured storage result in two embodiments is contained.

For including form in some N level logical page (LPAGE)s, and there is title PDF document in form, such as the PDF shown in Figure 19 Document, then Fig. 7 and Figure 19 are refer to, Fig. 7 is the flow chart of S300-S400 in the 3rd embodiment, and Figure 19 is the 3rd implementation In example in 320a form cutting schematic diagram.In the 3rd embodiment, in foregoing PDF document structured message extracting method, The step of S320, includes：

S320a is determined to whether there is form in each N levels logical page (LPAGE), if form be present, the form is cut into form Block, extract N+1 levels title and be under the jurisdiction of the content of text of the N+1 level titles.

Specifically, in the step of S320a, " it is determined that whether there is form in each N level logical page (LPAGE)s, if form be present, by institute State form and be cut into form block " the step of can include：

S320a1 determines whether include form in N level logical page (LPAGE)s according to the second preset rules；The second preset rules bag Include：If in content corresponding to N level titles, include at least two continuous spaces with a line, and it is empty described at least continuous three row The position of lattice is identical, it is determined that form be present in current N level logical page (LPAGE)s, and to occur a line at least two continuous spaces for the first time As the initial row of form, there is end line of a line at least two continuous spaces as form in last time；

Longitudinally cutting lines of the S320a2 using the position at least two of form continuous spaces as form, with the null in form For transverse cut, form is cut into form block；

S320a3 is with from left to right, and order from top to bottom obtains the content in the form block successively, with current N levels Content in logical page (LPAGE) in addition to the form together, as the content corresponding to the N level titles in current N levels logical page (LPAGE).

The step of S320a, by the way that the form in N level logical page (LPAGE)s is carried out into cutting, the content in form is obtained, instead of original Some forms, so as to have updated content corresponding with N level titles in former N levels logical page (LPAGE), new N levels logical page (LPAGE) is formd to replace Change former N levels logical page (LPAGE).And afterwards the step of, that is, S321-323 extracts N+1 levels title and person in servitude from N level logical page (LPAGE)s In the step of belonging to the content of text of the N+1 level titles, described N level logical page (LPAGE)s, refer to new N level logical page (LPAGE)s.

It should be noted that during when handling a PDF document, it is understood that there may be part N level logical page (LPAGE)s have table Lattice, the situation of form is not present in part N levels logical page (LPAGE), now, for the N level logical page (LPAGE)s in the absence of form, using second reality The step of applying S320 in example contains for existing the content of text that extracts N+1 levels title and be under the jurisdiction of the N+1 level titles The N level logical page (LPAGE)s of the form of title, extract N+1 levels title the step of S320a using in the 3rd embodiment and be under the jurisdiction of The content of text of the N+1 level titles.

Figure 20 is refer to, in another embodiment, also provides a kind of PDF document structured message extraction dress Put, including：

Acquiring unit 1, for obtaining the original page of PDF document；

First extraction unit 2, for extracting at least one reality comprising content of text or title from the original page Page；

Second extraction unit 3, for extracting titles at different levels from the actual page and being under the jurisdiction of in the text of the title Hold；

Memory cell 4, each described title and it is under the jurisdiction of the content of text of the title for structured storage.

Alternatively, the first extraction unit 2, including：

Judging unit 21, for whether judging respectively in the original page comprising catalogue page, header and footer；

Unit 22 is deleted, for the catalogue page in original page, header or footer to be deleted, obtains at least one actual page.

Above-mentioned PDF document structured message extraction element, the structured message of extraction PDF document can be automated, is kept away Exempt from manual processing, convenient and efficient.Deleted by the first extraction unit 2 on the influential mesh of PDF document structured message extraction Page, header and footer are recorded, so as to further ensure the accuracy of structured message extraction.

Alternatively, the second extraction unit 3 includes：

First order title extraction unit, for extracting the first order title in each actual page；

First order contents extracting unit, for extract in actual page current first order title and next first order title it Between content, as content corresponding with current first order title；If last in the current entitled actual page of the first order the One-level title, the content after current first order title in the actual page is extracted, in corresponding with current first order title Hold；

One-level logical page (LPAGE) generation unit, for the content by each first order title, and corresponding to the first order title, As an one-level logical page (LPAGE).

Memory cell 4 includes first order memory cell, for when one-level title in the absence of in the one-level logical page (LPAGE), Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein, it is under the jurisdiction of the first order The content of text of title is content corresponding with the first order title.

Alternatively, the second extraction unit 3 also includes combining unit, the combining unit respectively with first order contents extraction list Member connects with one-level logical page (LPAGE) generation unit, if for not having first order title in currently practical page, by the institute of currently practical page There is content to be incorporated into content corresponding to a first order title；If or for first first order title in currently practical page Not in the first row of currently practical page, the content before first first order title in the currently practical page is incorporated into upper one Content corresponding to individual first order title.

During one-level logical page (LPAGE) is extracted from actual page, due to being carried out page by page according to the original order of actual page Extraction, it is possible to a kind of situation occurs：Content corresponding to same first order title ought to be used as, but because respectively front and rear It is opened in two actual pages.By above-mentioned combining unit, the content of this part in actual page can be merged into upper one Content corresponding to individual one-level title, so as to ensure that each one-level logical page (LPAGE) can include complete content, overcome common PDF The problem of content that paging is split in document information acquisition methods can not be polymerize.

Alternatively, the second extraction unit 3 also includes N level extraction units, for being extracted respectively from each N level logical page (LPAGE) N+1 level titles, and it is under the jurisdiction of the content of text of N+1 level titles, N takes >=1 integer.Only exist when in N level logical page (LPAGE)s During N+1 level titles, N levels extraction unit is just run, and when N+1 level titles are not present in N level logical page (LPAGE)s, the extraction of N levels is single Member is out of service.

Alternatively, N levels extraction unit includes：

N+1 level title extraction units, for extracting the N+1 level titles in each N levels logical page (LPAGE)；

N+1 level contents extracting units, for extracting between current N+1 levels title and next N+1 level titles Content, as content corresponding with current N+1 level titles；If last in the current entitled N levels logical page (LPAGE) of N+1 levels N+1 level titles, extract the content after current N+1 level titles in the N level logical page (LPAGE)s, as with current N+1 level titles Corresponding content；

N+1 level logical page (LPAGE) generation units, for by each N+1 level title, and it is corresponding with the N+1 level titles in Hold, as a N+1 level logical page (LPAGE).

Memory cell 4 also includes N level memory cell, is subordinate to for structured storage the 1st to N+1 level titles, and respectively Belong to the described 1st to N+1 level titles content of text, wherein, be under the jurisdiction of N+1 level titles content of text be and the N+ Content corresponding to 1 grade of title, the content of text for being under the jurisdiction of i-stage title are to remove i+1 levels in content corresponding with the i-stage title Content outside logical page (LPAGE), i=1,2 ..., N.N levels memory cell ability only when next stage title in one-level logical page (LPAGE) be present Operation, if in the absence of in one-level logical page (LPAGE) during one-level title, the operation of first order memory cell.

It should be noted that if the second extraction unit extracts multiple one-level logical page (LPAGE)s from actual page, wherein, part one , next stage title also be present in part primary logical page (LPAGE), then in the absence of next in one-level title in the absence of in level logical page (LPAGE) The one-level logical page (LPAGE) of level, structured storage uses first order memory cell, for the one-level logical page (LPAGE) of next stage title also be present, Structured storage uses N level memory cell, when handling a PDF document, two memory cell may all can use arrive, It can be used only and arrive one of memory cell.

Alternatively, the second extraction unit 3 also includes form cutting acquiring unit, for determining in each N level logical page (LPAGE) With the presence or absence of form, if form be present, the form is cut into form block, N+1 levels title is extracted and is under the jurisdiction of described The content of text of N+1 level titles.When in N level logical page (LPAGE)s, including form, and N+1 levels in content corresponding with N level titles Title in the table when, form cutting acquiring unit, direct cutting form can be used, then extract N+1 levels title and be subordinate to In the content of text of the N+1 level titles.Form cutting acquiring unit sometimes can be used alone, and be extracted instead of N levels single Member, it is sometimes necessary to be used cooperatively with N level extraction units.

Alternatively, first order title extraction unit can include：

Title line acquiring unit, for obtaining title line and title line Y-axis coordinate in actual page in actual page；

Title line combining unit, for when the Y-axis coordinate of current head line and next title line in same actual page Difference when being less than 3 Y-axis units, next title line is merged with current head line；

First order title acquiring unit, for obtaining the content of text conduct of a line nearest from title line on title line First order title in actual page.

It is required that those skilled in the art can be understood that the technology in the embodiment of the present invention can add by software The mode of general hardware platform realize.Based on such understanding, the technical scheme in the embodiment of the present invention substantially or Say that the part to be contributed to prior art can be embodied in the form of software product, the computer software product can be deposited Storage is in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are causing a computer equipment (can be with Be personal computer, server, either network equipment etc.) perform some part institutes of each embodiment of the present invention or embodiment The method stated.

In this specification between each embodiment identical similar part mutually referring to.Invention described above is real The mode of applying is not intended to limit the scope of the present invention..

Claims

1. a kind of PDF document structured message extracting method, it is characterised in that methods described includes：

Obtain the original page of PDF document；

2. PDF document structured message extracting method according to claim 1, it is characterised in that from the original page The step of extracting at least one actual page comprising content of text or title, including：

3. PDF document structured message extracting method according to claim 1, it is characterised in that from the actual page The step of extracting titles at different levels and being under the jurisdiction of the content of text of the title, including：

Extract the first order title in each actual page；

Extract current content between first order title and next first order title in actual page, as with current first order mark Content corresponding to topic；If last first order title in the entitled actual page of the current first order, extract current in the actual page Content after first order title, as content corresponding with current first order title；

If one-level title in the absence of in the one-level logical page (LPAGE), each described title of the structured storage and it is under the jurisdiction of institute The step of stating the content of text of title, including：

Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein, it is under the jurisdiction of The content of text of one-level title is content corresponding with the first order title.

4. PDF document structured message extracting method according to claim 3, it is characterised in that described by each first Level title, and the content corresponding to the first order title, before the step of one-level logical page (LPAGE), in addition to following step Suddenly：

If not having first order title in currently practical page, all the elements of currently practical page are incorporated into a upper first order title Corresponding content；

If first first order title in currently practical page be not in the first row of currently practical page, by the currently practical page Content before first first order title is incorporated into content corresponding to a first order title.

5. PDF document structured message extracting method according to claim 3, it is characterised in that from the actual page The step of extracting titles at different levels and being under the jurisdiction of the content of text of the title, it is further comprising the steps of：

(N+1) level title is extracted from each N level logical page (LPAGE) respectively, and is under the jurisdiction of the content of text of (N+1) level title, N takes >=1 integer.

6. PDF document structured message extracting method according to claim 5, it is characterised in that described respectively from each The step of N+1 level titles being extracted in individual N levels logical page (LPAGE), and being under the jurisdiction of the content of text of N+1 level titles, including：

Extract the N+1 level titles in each N levels logical page (LPAGE)；

Extract the content between current N+1 levels title and next N+1 level titles, as with current N+1 level titles pair The content answered；If last N+1 level title in the current entitled N levels logical page (LPAGE) of N+1 levels, extracts the N level logical page (LPAGE)s In content after current N+1 level titles, as content corresponding with current N+1 level titles；

Structured storage the 1st to N+1 level titles, and be under the jurisdiction of respectively the described 1st to N+1 level titles content of text, its In, the content of text for being under the jurisdiction of N+1 level titles is content corresponding with the N+1 level titles, is under the jurisdiction of the text of i-stage title This content is the content in addition to i+1 level logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.

7. PDF document structured message extracting method according to claim 5, it is characterised in that described respectively from each The step of extracting N+1 level titles in individual N levels logical page (LPAGE), and being under the jurisdiction of the content of text of N+1 level titles includes：

Determine to whether there is form in each N level logical page (LPAGE), if form be present, the form is cut into form block, carried Take N+1 levels title and be under the jurisdiction of the content of text of the N+1 level titles.

8. the PDF document structured message extracting method according to claim any one of 3-7, it is characterised in that described to carry The step of taking the first order title in each actual page, including：

, will if the difference of the Y-axis coordinate of current head line and next title line is less than 3 Y-axis units in same actual page Next title line merges with current head line；

A kind of 9. PDF document structured message extraction element, it is characterised in that including：

Acquiring unit, for obtaining the original page of PDF document；

First extraction unit, for extracting at least one actual page comprising content of text or title from the original page；

Second extraction unit, for extracting titles at different levels from the actual page and being under the jurisdiction of the content of text of the title；

10. PDF document structured message extraction element according to claim 9, it is characterised in that first extraction is single Member, including：