The content of the invention
The application provides a kind of PDF document structured message extracting method and a kind of PDF document structured message extraction dress
Put, to solve the problems, such as advantageously obtain PDF document structured message by prior art.
In a first aspect, this application provides a kind of PDF document structured message extracting method, this method includes:
Obtain the original page of PDF document;
At least one actual page comprising content of text or title is extracted from the original page;
Titles at different levels are extracted from the actual page and are under the jurisdiction of the content of text of the title;
Each described title of structured storage and the content of text for being under the jurisdiction of the title.
With reference in a first aspect, in first aspect in the first possible implementation, extracted at least from the original page
The step of one actual page comprising content of text or title, including:
Whether judge respectively in the original page comprising catalogue page, header and footer;
Catalogue page in original page, header or footer are deleted, obtain at least one actual page.
With reference to first aspect and above-mentioned possible implementation, in second of possible implementation of first aspect, from
The step of titles at different levels are extracted in the actual page and are under the jurisdiction of the content of text of the title, including:
Extract the first order title in each actual page;
Current content between first order title and next first order title in actual page is extracted, as with current first
Content corresponding to level title;If last first order title in the entitled actual page of the current first order, is extracted in the actual page
Content after current first order title, as content corresponding with current first order title;
By each first order title, and the content corresponding to the first order title, as an one-level logical page (LPAGE);
If one-level title in the absence of in the one-level logical page (LPAGE), each described title of the structured storage and it is subordinate to
In the content of text of the title the step of, including:
Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein, it is subordinate to
In the content of text of first order title be content corresponding with the first order title.
With reference to first aspect and above-mentioned possible implementation, in first aspect in the third possible implementation, institute
The content by each first order title, and corresponding to the first order title is stated, before the step of one-level logical page (LPAGE),
It is further comprising the steps of:
If not having first order title in currently practical page, all the elements of currently practical page are incorporated into a upper first order
Content corresponding to title;
If first first order title in currently practical page, will be described currently practical not in the first row of currently practical page
Content in page before first first order title is incorporated into content corresponding to a first order title.
With reference to first aspect and above-mentioned possible implementation, in the 4th kind of possible implementation of first aspect, from
The step of titles at different levels are extracted in the actual page and are under the jurisdiction of the content of text of the title, it is further comprising the steps of:
(N+1) level title is extracted from each N level logical page (LPAGE) respectively, and is under the jurisdiction of the text of (N+1) level title
Content, N take >=1 integer.
With reference to first aspect and above-mentioned possible implementation, in the 5th kind of possible implementation of first aspect, institute
State and extract (N+1) level title from each N level logical page (LPAGE) respectively, and be under the jurisdiction of the content of text of (N+1) level title
Step, including:
Extract the N+1 level titles in each N levels logical page (LPAGE);
Extract the content between current N+1 levels title and next N+1 level titles, as with current N+1 level marks
Content corresponding to topic;If last N+1 level title in the current entitled N levels logical page (LPAGE) of N+1 levels, extracts the N levels and patrols
The content after current N+1 level titles in page is collected, as content corresponding with current N+1 level titles;
By each N+1 level title, and content corresponding with the N+1 level titles, as a N+1 level logical page (LPAGE);
Each described title of the structured storage and the step of be under the jurisdiction of the content of text of the title, including:
Structured storage the 1st is to N+1 level titles, and is under the jurisdiction of the described 1st respectively in the text of N+1 level titles
Hold, wherein, the content of text for being under the jurisdiction of N+1 level titles is content corresponding with the N+1 level titles, is under the jurisdiction of i-stage mark
The content of text of topic is the content in addition to i+1 level logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.
With reference to first aspect and above-mentioned possible implementation, in the 6th kind of possible implementation of first aspect, institute
The step of stating and extract N+1 level titles from each N level logical page (LPAGE) respectively, and being under the jurisdiction of the content of text of N+1 level titles
Including:
Determine to whether there is form in each N level logical page (LPAGE), if form be present, the form is cut into form area
Block, extract N+1 levels title and be under the jurisdiction of the content of text of the N+1 level titles.
With reference to first aspect and above-mentioned possible implementation, in the 7th kind of possible implementation of first aspect, institute
The step of extracting the first order title in each actual page is stated, including:
Obtain the title line in actual page and title line Y-axis coordinate in actual page;
If the difference of the Y-axis coordinate of current head line and next title line is less than 3 Y-axis units in same actual page
When, next title line is merged with current head line;
The content of text of a line nearest from title line on title line is obtained as the first order title in actual page.
Second aspect, present invention also provides a kind of PDF document structured message extraction element, including:
Acquiring unit, for obtaining the original page of PDF document;
First extraction unit, for extracting at least one reality comprising content of text or title from the original page
Page;
Second extraction unit, for extracting titles at different levels from the actual page and being under the jurisdiction of in the text of the title
Hold;
Memory cell, each described title and it is under the jurisdiction of the content of text of the title for structured storage.
With reference to second aspect, in second aspect in the first possible implementation, first extraction unit, including:
Judging unit, for whether judging respectively in the original page comprising catalogue page, header and footer;
Unit is deleted, for the catalogue page in original page, header or footer to be deleted, obtains at least one actual page.
Compared with prior art, this method removes first from the original page of PDF document and structured message may be carried
The part for producing and disturbing, such as catalogue page, header, footer etc. are taken, generates actual page, it is actual so as to complete to extract from original page
The step of page.Then titles at different levels and it is under the jurisdiction of the corresponding content of text of titles at different levels from actual page and extracts, ties
Structureization stores, so as to obtain structured message so that the structured message extraction of PDF document can automate realization, avoid hand
Work processing, convenient and efficient.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with embodiment and accompanying drawing to this hair
It is bright to be described in further detail.
Fig. 1 is refer to, in a detailed embodiment, this extracting method of PDF document structured message includes:
S100 obtains the original page of PDF document.
S200 extracts at least one actual page comprising content of text or title from original page.
S300 extracts titles at different levels from actual page and is under the jurisdiction of the content of text of the title.
Each described title of S400 structured storages and the content of text for being under the jurisdiction of the title.
Structured message refers to that information is decomposed into multiple inter-related parts, each part after analysis
Between have clear and definite hierarchical structure.In this application, PDF document structured message means the text extracted from PDF document,
Titles at different levels and the content of text for being under the jurisdiction of title have clear and definite hierarchical structure in text.Structured message can subsequently pass through
The file of the multiple formats such as html, word, txt is shown.
Structured storage refers to the content of the multiple files of needs to be saved in by tree structure and level in a file.
In this application, each described title of structured storage and it is under the jurisdiction of the content of text of the title, refers to titles at different levels,
And it is under the jurisdiction of the content of titles at different levels, stored according to tree structure and level, so as to obtain the structuring of PDF document
Information.
Above-mentioned method, the extraction that removal may be to structured message first from the original page of PDF document produce interference
Part, such as catalogue page, header, footer etc., actual page is generated, the step of so as to complete to extract actual page from original page.
Then titles at different levels and it is under the jurisdiction of the corresponding content of text of titles at different levels from actual page and extracts, structured storage,
So as to obtain structured message so that the structured message extraction of PDF document can automate realization, avoid manual processing, just
It is prompt efficient.
The step of above-mentioned S100-S400, is described in detail below.
In the step of S100, the original page of PDF document can be inputted by user to obtain, can also be from storage medium
Obtain.
In the step of S200, Fig. 2 is refer to, the step of can specifically including S210 and the step of S220.
Whether S210 is judged in original page comprising catalogue page, header and footer.
In step S210, comprise the following steps:
S211 obtains the page number, the character and the total line number of character of current original page of current original page;
S212 is matched the page number of current original page and character with the first preset rules, it is determined that current original page whether
For catalogue page.
In step S211, the page number of current original page, the character of current original page and the total line number of character, it can pass through
The instruments such as PDFBox, iText directly obtain.Wherein, PDFBox is the Java platform class libraries of an operation PDF document, is out
Source instrument, anyone can be programmed on its basis, for creating PDF document, operation existing document and extraction
The text message of document.IText is also a java class libraries for being used to generate PDF document increased income, by iText not only
PDF or rtf document can be generated, and can be pdf document by XML, Html file translations.
In step S212, the first preset rules can be preset by developer or user.For example, first is pre-
If in rule, determining whether the rule of catalogue page includes:The page number of current original page is first page or second page, and current former
The line number shared by heading order number on beginning page exceedes the 40% of the total line number of character of current original page, and current original page is catalogue
Page;Or the page number of current original page is first page or second page, and in the character of current original page, occur successively " Chinese,
Line number shared by the character string of non-Chinese continuous symbol, sequence number " form exceedes the 40% of the total line number of character of current original page, when
Preceding original page is catalogue page;Or the page number of current original page is first page or second page, and in the character in current original page
Comprising preset keyword, current original page is catalogue page.
For example, if the page number of current original page is first page or second page, and the title sequence in original page
Number, such as " 1.1 ", " 1.1.1 ", " 1, ", " 2, " etc., 40% of shared line number more than the total line number of character of current original page,
It is catalogue page to determine that current original page.Or if the page number of current original page is first page or second page, it is current original
Page character in, such as " from signing for this contract 10 days (hesitating the phase) if in you require surrender, cost only deducts in our company
Take ... ... 1.4 ", " line number so shared by the character string of form such as chapter 1 ... ... 15 " exceedes current original page
The 40% of the total line number of character, it is catalogue page to determine that current original page.Also or, Fig. 8 is refer to, if the page of current original page
Code is first page or second page, and includes " chapter 1 ", " first ", " Co., Ltd ", " catalogue " in current original page
During Deng these preset keywords, it is catalogue page to determine that current original page.
In step S212 the first preset rules, in another example, judge whether the rule comprising header includes in original page:
If the first line character is identical in continuous 3-5 pages of original page, determine that original page includes header.Further for example, judge be in original page
The no rule comprising footer includes:If last column character is identical in continuous 3-5 pages of original page, determine that original page includes page
Pin.
S220 deletes the catalogue page in original page, header or footer, obtains at least one actual page.
Specifically, if including catalogue page in original page, the whole page of catalogue page in original page is deleted;If wrapped in original page
Containing header, then the header in original page is deleted;If including footer in original page, the footer in original page is deleted.So as to
Remove the partial content that may be produced to the extraction of the structured text of PDF document in the original page or original page of interference, obtain to
A few actual page.
Before the step of carrying out S300, first the character being in actual page with a line can be merged, form row
Text, as shown in figure 9, being merged to the character of same a line, each reality can be obtained by instruments such as PDFBox in advance
The coordinate information of character on page, including X-axis coordinate and Y-axis coordinate are identical or gap is within preset range by Y-axis coordinate
Character merges, and obtains style of writing originally.Traveled through in units of composing a piece of writing originally, come the text for extracting titles at different levels and being under the jurisdiction of the title
The step of this content, for example, by traveling through the style of writing sheet in actual page, to extract first order title and be under the jurisdiction of the first order mark
The content of text of topic;By traveling through one-level logical page (LPAGE), to extract second level title in one-level logical page (LPAGE) and be under the jurisdiction of the second level
The content of text of title.
The step of S300 and two kinds of situations can be included the step of corresponding S400, one kind is not present in one-level logical page (LPAGE)
The situation of next stage title, another kind are next stage title in one-level logical page (LPAGE) also be present.
Fig. 3, Fig. 4, Figure 10 be refer to Figure 14.Fig. 3 is the flow chart of S300-S400 in one embodiment, Fig. 4 the
The flow chart of S311 steps in one embodiment.The effect diagram for the step of Figure 10 is S311 in one embodiment;Figure 11
For the effect diagram in one embodiment the step of S312;The effect for the step of Figure 12 is S313 in one embodiment is shown
It is intended to;The effect diagram for the step of Figure 13 is S314 in one embodiment;Figure 14 is the step of S410 in one embodiment
Rapid effect diagram.Include in one embodiment, the step of S300:
S311 extracts the first order title in each actual page;
Current content between first order title and next first order title in S312 extraction actual pages, as with it is current
Content corresponding to first order title;If last first order title in the entitled actual page of the current first order, extracts the reality
Content in page after current first order title, as content corresponding with current first order title;
If not having first order title in the currently practical pages of S313, all the elements of currently practical page are incorporated into upper one
Content corresponding to one-level title;If first first order title in currently practical page not in the first row of currently practical page,
Content before first first order title in the currently practical page is incorporated into content corresponding to a upper first order title;
S314 is by each first order title, and the content corresponding to the first order title, as an one-level logical page (LPAGE).
If one-level title in the absence of in the one-level logical page (LPAGE), the step of corresponding S400, including:
Each first order title of S410 structured storages and the content of text for being under the jurisdiction of the first order title, wherein,
The content of text for being under the jurisdiction of first order title is content corresponding with the first order title.
, can be according to the size of font, the pattern of font, word content or title in actual page in the step of S311
Line etc. extracts the first order title in actual page;The size of the font, the pattern of font, word content or title line are all
It can be obtained by instruments such as PDFBox, iText.
First order title in actual page is extracted by the font size in actual page, for example, passing through relatively more each style of writing
The size of this font, if the largest font of current line text, it is determined that current line text is first order title.Pass through reality
The first order title in font style extraction actual page in page, for example, passing through this font style and the default font sample of composing a piece of writing
Formula is matched, and it is first order title to determine current line text.The font size of above-mentioned style of writing sheet, current line text can be used
Font size of the size of middle first character as the style of writing sheet, multiple size identicals in current line text can also be used
The size of multiple characters, the font size as the style of writing sheet;The font style of above-mentioned style of writing sheet, style of writing can be used the in this
Font style of the pattern of one character as the style of writing sheet, multiple pattern identicals in current line text can also be used multiple
The pattern of character, the font style as the style of writing sheet.The first order in actual page is extracted by the word content in actual page
Title, for example, being matched by word content with predetermined keyword, if containing " chapter 1 ", " second in word content
The predetermined keyword such as chapter ", " first ", " Part I ", it is determined that current line text is first order title.
Divided into for some first order titles by the PDF document of title line, can also by the title line in actual page come
First order title is extracted, Fig. 4 is refer to, specifically includes:
S3111 obtains title line and title line Y-axis coordinate in actual page in actual page;
If the difference of the Y-axis coordinate of current head line and next title line is less than 3 Y-axis in the same actual pages of S3112
During unit, next title line is merged with current head line;
S3113 obtains the text of a line nearest from title line on title line as the first order title in actual page.
In the step of S3111, the Y-axis coordinate of title line can be obtained by instruments such as PDFBox, iText in actual page
Take.
In the step of S3113, a line nearest from title line, can by compare this Y-axis coordinate of style of writing with it is current
The distance between Y-axis coordinate of title line, to determine a line nearest from title line, the text of the row is obtained as in actual page
First order title.
During one-level logical page (LPAGE) is extracted from actual page, due to being carried out page by page according to the original order of actual page
Extraction, it is possible to a kind of situation occurs:Content corresponding to same first order title ought to be used as, but because respectively front and rear
It is opened in two actual pages.The step of by above-mentioned S313, the content of this part in actual page can be merged into upper one
Content corresponding to individual one-level title, so as to ensure that each one-level logical page (LPAGE) can include complete content, overcome common PDF
The problem of content that paging is split in document information acquisition methods can not be polymerize.
Fig. 5, Fig. 6, Figure 15 be refer to Figure 18, Fig. 5 is the flow chart of S300-S400 in second embodiment, Fig. 6 the
The flow chart of S320 steps in two embodiments.The effect diagram for the step of Figure 15 is S321 in second embodiment;Figure 16
For the effect diagram in second embodiment the step of S322;The effect for the step of Figure 17 is S323 in second embodiment is shown
It is intended to;Figure 18 is effect signal the step of being related to the content of text for being under the jurisdiction of i-stage title in S420 in second embodiment
Figure.In the second embodiment, the step of S300 includes:
S311 extracts the first order title in each actual page;
Current content between first order title and next first order title in S312 extraction actual pages, as with it is current
Content corresponding to first order title;If last first order title in the entitled actual page of the current first order, extracts the reality
Content in page after current first order title, as content corresponding with current first order title;
If not having first order title in the currently practical pages of S313, all the elements of currently practical page are incorporated into upper one
Content corresponding to one-level title;If first first order title in currently practical page not in the first row of currently practical page,
Content before first first order title in the currently practical page is incorporated into content corresponding to a upper first order title;
S314 is by each first order title, and the content corresponding to the first order title, as an one-level logical page (LPAGE);
Further comprising the steps of if next stage title in one-level logical page (LPAGE) be present, S320 is respectively from each N level logic
(N+1) level title is extracted in page, and is under the jurisdiction of the content of text of (N+1) level title, N takes >=1 integer.The step for can
To use recursive process, untill not including N+1 level titles in N level logical page (LPAGE)s.Specifically include:
S321 extracts the N+1 level titles in each N levels logical page (LPAGE), and N takes >=1 integer;
S322 extracts the content between current N+1 levels title and next N+1 level titles, as with current N+1
Content corresponding to level title;If last N+1 level title in the current entitled N levels logical page (LPAGE) of N+1 levels, extracts the N
Content in level logical page (LPAGE) after current N+1 level titles, as content corresponding with current N+1 level titles;
S323 is by each N+1 level title, and content corresponding with the N+1 level titles, as a N+1 level logic
Page.
Correspondingly the step of S400, including:
S420 structured storages the 1st to N+1 level titles, and be under the jurisdiction of respectively the described 1st to N+1 level titles text
Content, wherein, the content of text for being under the jurisdiction of N+1 level titles is content corresponding with the N+1 level titles, is under the jurisdiction of i-stage
The content of text of title is the content in addition to i+1 level logical page (LPAGE)s, i=1,2 ..., N in content corresponding with the i-stage title.
It is corresponding with N level titles interior herein it should be noted that when including N+1 level titles in N level logical page (LPAGE)s
Hold, contain the content of text for being under the jurisdiction of N level titles, and N+1 level logical page (LPAGE)s.When N+1 levels are not present in N level logical page (LPAGE)s
During title, content corresponding with N level titles, exactly it is under the jurisdiction of the content of text of N level titles.That is, in the application
In, content corresponding with N level titles, and be under the jurisdiction of the content of text of N level titles, include therebetween and by comprising
Relation.
It should be noted that if extracting multiple one-level logical page (LPAGE)s from actual page, wherein, in part primary logical page (LPAGE) not
Next stage title be present, next stage title also be present in part primary logical page (LPAGE), then in the absence of one-level one-level logic
Page, the step of structured storage for the structured storage in one embodiment the step of, for next stage title also be present
One-level logical page (LPAGE), the step of structured storage for the structured storage in second embodiment the step of, the PDF texts that finally obtain
In mark structure information, the structured storage result in two embodiments is contained.
For including form in some N level logical page (LPAGE)s, and there is title PDF document in form, such as the PDF shown in Figure 19
Document, then Fig. 7 and Figure 19 are refer to, Fig. 7 is the flow chart of S300-S400 in the 3rd embodiment, and Figure 19 is the 3rd implementation
In example in 320a form cutting schematic diagram.In the 3rd embodiment, in foregoing PDF document structured message extracting method,
The step of S320, includes:
S320a is determined to whether there is form in each N levels logical page (LPAGE), if form be present, the form is cut into form
Block, extract N+1 levels title and be under the jurisdiction of the content of text of the N+1 level titles.
Specifically, in the step of S320a, " it is determined that whether there is form in each N level logical page (LPAGE)s, if form be present, by institute
State form and be cut into form block " the step of can include:
S320a1 determines whether include form in N level logical page (LPAGE)s according to the second preset rules;The second preset rules bag
Include:If in content corresponding to N level titles, include at least two continuous spaces with a line, and it is empty described at least continuous three row
The position of lattice is identical, it is determined that form be present in current N level logical page (LPAGE)s, and to occur a line at least two continuous spaces for the first time
As the initial row of form, there is end line of a line at least two continuous spaces as form in last time;
Longitudinally cutting lines of the S320a2 using the position at least two of form continuous spaces as form, with the null in form
For transverse cut, form is cut into form block;
S320a3 is with from left to right, and order from top to bottom obtains the content in the form block successively, with current N levels
Content in logical page (LPAGE) in addition to the form together, as the content corresponding to the N level titles in current N levels logical page (LPAGE).
The step of S320a, by the way that the form in N level logical page (LPAGE)s is carried out into cutting, the content in form is obtained, instead of original
Some forms, so as to have updated content corresponding with N level titles in former N levels logical page (LPAGE), new N levels logical page (LPAGE) is formd to replace
Change former N levels logical page (LPAGE).And afterwards the step of, that is, S321-323 extracts N+1 levels title and person in servitude from N level logical page (LPAGE)s
In the step of belonging to the content of text of the N+1 level titles, described N level logical page (LPAGE)s, refer to new N level logical page (LPAGE)s.
It should be noted that during when handling a PDF document, it is understood that there may be part N level logical page (LPAGE)s have table
Lattice, the situation of form is not present in part N levels logical page (LPAGE), now, for the N level logical page (LPAGE)s in the absence of form, using second reality
The step of applying S320 in example contains for existing the content of text that extracts N+1 levels title and be under the jurisdiction of the N+1 level titles
The N level logical page (LPAGE)s of the form of title, extract N+1 levels title the step of S320a using in the 3rd embodiment and be under the jurisdiction of
The content of text of the N+1 level titles.
Figure 20 is refer to, in another embodiment, also provides a kind of PDF document structured message extraction dress
Put, including:
Acquiring unit 1, for obtaining the original page of PDF document;
First extraction unit 2, for extracting at least one reality comprising content of text or title from the original page
Page;
Second extraction unit 3, for extracting titles at different levels from the actual page and being under the jurisdiction of in the text of the title
Hold;
Memory cell 4, each described title and it is under the jurisdiction of the content of text of the title for structured storage.
Alternatively, the first extraction unit 2, including:
Judging unit 21, for whether judging respectively in the original page comprising catalogue page, header and footer;
Unit 22 is deleted, for the catalogue page in original page, header or footer to be deleted, obtains at least one actual page.
Above-mentioned PDF document structured message extraction element, the structured message of extraction PDF document can be automated, is kept away
Exempt from manual processing, convenient and efficient.Deleted by the first extraction unit 2 on the influential mesh of PDF document structured message extraction
Page, header and footer are recorded, so as to further ensure the accuracy of structured message extraction.
Alternatively, the second extraction unit 3 includes:
First order title extraction unit, for extracting the first order title in each actual page;
First order contents extracting unit, for extract in actual page current first order title and next first order title it
Between content, as content corresponding with current first order title;If last in the current entitled actual page of the first order the
One-level title, the content after current first order title in the actual page is extracted, in corresponding with current first order title
Hold;
One-level logical page (LPAGE) generation unit, for the content by each first order title, and corresponding to the first order title,
As an one-level logical page (LPAGE).
Memory cell 4 includes first order memory cell, for when one-level title in the absence of in the one-level logical page (LPAGE),
Each first order title of structured storage and the content of text for being under the jurisdiction of the first order title, wherein, it is under the jurisdiction of the first order
The content of text of title is content corresponding with the first order title.
Alternatively, the second extraction unit 3 also includes combining unit, the combining unit respectively with first order contents extraction list
Member connects with one-level logical page (LPAGE) generation unit, if for not having first order title in currently practical page, by the institute of currently practical page
There is content to be incorporated into content corresponding to a first order title;If or for first first order title in currently practical page
Not in the first row of currently practical page, the content before first first order title in the currently practical page is incorporated into upper one
Content corresponding to individual first order title.
During one-level logical page (LPAGE) is extracted from actual page, due to being carried out page by page according to the original order of actual page
Extraction, it is possible to a kind of situation occurs:Content corresponding to same first order title ought to be used as, but because respectively front and rear
It is opened in two actual pages.By above-mentioned combining unit, the content of this part in actual page can be merged into upper one
Content corresponding to individual one-level title, so as to ensure that each one-level logical page (LPAGE) can include complete content, overcome common PDF
The problem of content that paging is split in document information acquisition methods can not be polymerize.
Alternatively, the second extraction unit 3 also includes N level extraction units, for being extracted respectively from each N level logical page (LPAGE)
N+1 level titles, and it is under the jurisdiction of the content of text of N+1 level titles, N takes >=1 integer.Only exist when in N level logical page (LPAGE)s
During N+1 level titles, N levels extraction unit is just run, and when N+1 level titles are not present in N level logical page (LPAGE)s, the extraction of N levels is single
Member is out of service.
Alternatively, N levels extraction unit includes:
N+1 level title extraction units, for extracting the N+1 level titles in each N levels logical page (LPAGE);
N+1 level contents extracting units, for extracting between current N+1 levels title and next N+1 level titles
Content, as content corresponding with current N+1 level titles;If last in the current entitled N levels logical page (LPAGE) of N+1 levels
N+1 level titles, extract the content after current N+1 level titles in the N level logical page (LPAGE)s, as with current N+1 level titles
Corresponding content;
N+1 level logical page (LPAGE) generation units, for by each N+1 level title, and it is corresponding with the N+1 level titles in
Hold, as a N+1 level logical page (LPAGE).
Memory cell 4 also includes N level memory cell, is subordinate to for structured storage the 1st to N+1 level titles, and respectively
Belong to the described 1st to N+1 level titles content of text, wherein, be under the jurisdiction of N+1 level titles content of text be and the N+
Content corresponding to 1 grade of title, the content of text for being under the jurisdiction of i-stage title are to remove i+1 levels in content corresponding with the i-stage title
Content outside logical page (LPAGE), i=1,2 ..., N.N levels memory cell ability only when next stage title in one-level logical page (LPAGE) be present
Operation, if in the absence of in one-level logical page (LPAGE) during one-level title, the operation of first order memory cell.
It should be noted that if the second extraction unit extracts multiple one-level logical page (LPAGE)s from actual page, wherein, part one
, next stage title also be present in part primary logical page (LPAGE), then in the absence of next in one-level title in the absence of in level logical page (LPAGE)
The one-level logical page (LPAGE) of level, structured storage uses first order memory cell, for the one-level logical page (LPAGE) of next stage title also be present,
Structured storage uses N level memory cell, when handling a PDF document, two memory cell may all can use arrive,
It can be used only and arrive one of memory cell.
Alternatively, the second extraction unit 3 also includes form cutting acquiring unit, for determining in each N level logical page (LPAGE)
With the presence or absence of form, if form be present, the form is cut into form block, N+1 levels title is extracted and is under the jurisdiction of described
The content of text of N+1 level titles.When in N level logical page (LPAGE)s, including form, and N+1 levels in content corresponding with N level titles
Title in the table when, form cutting acquiring unit, direct cutting form can be used, then extract N+1 levels title and be subordinate to
In the content of text of the N+1 level titles.Form cutting acquiring unit sometimes can be used alone, and be extracted instead of N levels single
Member, it is sometimes necessary to be used cooperatively with N level extraction units.
Alternatively, first order title extraction unit can include:
Title line acquiring unit, for obtaining title line and title line Y-axis coordinate in actual page in actual page;
Title line combining unit, for when the Y-axis coordinate of current head line and next title line in same actual page
Difference when being less than 3 Y-axis units, next title line is merged with current head line;
First order title acquiring unit, for obtaining the content of text conduct of a line nearest from title line on title line
First order title in actual page.
It is required that those skilled in the art can be understood that the technology in the embodiment of the present invention can add by software
The mode of general hardware platform realize.Based on such understanding, the technical scheme in the embodiment of the present invention substantially or
Say that the part to be contributed to prior art can be embodied in the form of software product, the computer software product can be deposited
Storage is in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are causing a computer equipment (can be with
Be personal computer, server, either network equipment etc.) perform some part institutes of each embodiment of the present invention or embodiment
The method stated.
In this specification between each embodiment identical similar part mutually referring to.Invention described above is real
The mode of applying is not intended to limit the scope of the present invention..