CN102306294A

CN102306294A - Method and system for extracting image from portable document format (PDF) file page

Info

Publication number: CN102306294A
Application number: CN201110243119A
Authority: CN
Inventors: 晏检平
Original assignee: Shenzhen Wondershare Software Co Ltd
Current assignee: Shenzhen Wondershare Software Co Ltd
Priority date: 2011-08-23
Filing date: 2011-08-23
Publication date: 2012-01-04
Also published as: WO2013026245A1

Abstract

The invention discloses a method for extracting an image from a portable document format (PDF) file page. The method comprises the following steps of: acquiring position information of each image element in the PDF file page; dividing all image elements in the page into different sets according to the position information; and taking all image elements in each set as a whole for image extraction. The invention also discloses a system for extracting the image from the PDF file page. By adopting the method or the system disclosed by the invention, the extracted image can be easily edited, and high extraction efficiency is realized.

Description

A kind of method and system of from the PDF file page, extracting image

Technical field

The present invention relates to document processing field, particularly relate to a kind of method and system of from the PDF file page, extracting image.

Background technology

PDF is the abbreviation of Portable Document Format (portable file layout), is a kind of electronic document format.The pdf document form becomes the desirable file layout of carrying out electronic document distribution and formatted message propagation on the internet with its remarkable characteristic.Current, the technical paper major part of issue is submitted to PDF on the internet.But the emphasis point of PDF file is to describe the print format of document, and does not describe the data structure in the original document, and is difficult for editor.Therefore, be the file of other form with the PDF file conversion if desired, be the comparison difficulty.Especially the image in the PDF file is the most scabrous problem in the PDF file conversion.

In the prior art, when the PDF file conversion is the file of other form, mainly contain dual mode for the extraction of image:

A kind of is with intact the extracting of all images element in the PDF file (width of cloth picture possibly be made up of a large amount of pictorial elements).The pictorial element that this mode extracts often has thousands of.Because what this mode extracted is a large amount of pictorial elements, does not have clear and definite which pictorial element simultaneously and constitute piece image.Therefore, the image that this mode extracts can only be edited and can't edit integral image pictorial element.

Also having a kind of is directly the full page in the PDF file to be extracted as a picture.The image that this mode extracts, the same problem that is difficult for editor that exists.

Summary of the invention

The purpose of this invention is to provide a kind of method and system of from the PDF file page, extracting image, can make the image that extracts be easy to editor, have higher extraction efficiency simultaneously.

For realizing above-mentioned purpose, the invention provides following scheme:

A kind of method of from the PDF file page, extracting image comprises:

Obtain the positional information of each pictorial element in the PDF file page;

According to said positional information, all images element in the page is divided into different set;

All images element in each set is carried out image as a whole to be extracted.

Preferably, the said positional information of obtaining each pictorial element in the PDF file page comprises:

Obtain the top left corner apex location coordinate information of each pictorial element in the PDF file page, and write down the reference point of said coordinate information as this pictorial element.

Preferably, said according to said positional information, all images element in the page is divided into different set, comprising:

Said pictorial element is carried out the division of horizontal direction, obtain one or more row sets;

Pictorial element in the said row set is carried out the division of vertical direction, obtain the ranks set.

Preferably, said said pictorial element is carried out the division of horizontal direction, obtains one or more row sets, comprising:

A, according to the ordinate of the reference point of pictorial element, all images element is sorted;

B, according to the ranking results of ordinate, first pictorial element is divided to first row set;

C, judge whether the next pictorial element and the just ordinate scope of divided image element intersect;

D is if then be divided to the row set that said firm divided image element belongs to said next pictorial element; Otherwise, said next pictorial element is divided to new row set, return step C.

Preferably, said pictorial element in the said row set is carried out the division of vertical direction, obtains the ranks set, comprising:

E, for each row set, the horizontal ordinate according to the reference point of said pictorial element sorts to the pictorial element in the row set;

F, according to the ranking results of horizontal ordinate, first pictorial element in the row set is divided to first row set; Said row set is the ranks set corresponding to full page;

G, judge whether next pictorial element and just divided image element intersect in the horizontal ordinate direction;

H is if then be divided to said next pictorial element the row set at said firm divided image element place; Otherwise, said next pictorial element is divided to new row set, return step G.

Preferably, saidly all images element in the set of each ranks carried out image as a whole extract, comprising:

Obtain the peripheral profile of each ranks set;

According to said peripheral profile, all images element in the said ranks set is extracted as a width of cloth picture.

Preferably, the said peripheral profile that obtains each ranks set; According to said peripheral profile, all images element in the said ranks set is extracted as a width of cloth picture, comprising:

Obtain the peripheral rectangle of each ranks set;

According to this periphery rectangle all images element in this ranks set being carried out sectional drawing as a whole extracts.

A kind of system that from the PDF file page, extracts image comprises:

Position information acquisition module is used for obtaining the positional information of each pictorial element of the PDF file page;

Module is divided in set, is used for according to said positional information, and all images element in the page is divided into different set;

Extraction module is used for all images element of each set is carried out the image extraction as a whole.

Preferably, said position information acquisition module comprises:

The coordinate information acquiring unit is used for obtaining the top left corner apex location coordinate information of each pictorial element of the PDF file page, and writes down the reference point of said coordinate information as this pictorial element.

Preferably, said set division module comprises:

The row set division unit is used for said pictorial element is carried out the division of horizontal direction, obtains one or more row sets;

Ranks set division unit is used for the pictorial element of said row set is carried out the division of vertical direction, obtains the ranks set.

According to specific embodiment provided by the invention, the invention discloses following technique effect:

The disclosed method of from the PDF file page, extracting image of the present invention; Through according to the positional information of pictorial element in the file page; Its procession is divided; Integral body is carried out in ranks set after dividing extract, make the image that extracts be easy to editor, have higher extraction efficiency simultaneously.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art; To do to introduce simply to the accompanying drawing of required use among the embodiment below; Obviously, the accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills; Under the prerequisite of not paying creative work property, can also obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the said method flow diagram that from the PDF file page, extracts image of the embodiment of the invention;

Fig. 2 is the said system construction drawing that from the PDF file page, extracts image of the embodiment of the invention.

Embodiment

To combine the accompanying drawing in the embodiment of the invention below, the technical scheme in the embodiment of the invention is carried out clear, intactly description, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making the every other embodiment that is obtained under the creative work prerequisite, all belong to the scope of the present invention's protection.

The purpose of this invention is to provide a kind of method and system of from the PDF file page, extracting image, can pictorial element be divided into a small amount of significant set, and extract according to the original image information in the PDF file.

For make above-mentioned purpose of the present invention, feature and advantage can be more obviously understandable, below in conjunction with accompanying drawing and embodiment the present invention done further detailed explanation.

Referring to Fig. 1, be the said method flow diagram that from the PDF file page, extracts image of the embodiment of the invention.As shown in Figure 1, the method comprising the steps of:

S101: the positional information of obtaining each pictorial element in the PDF file page;

Pictorial element can be various type.Concrete, can use the mode of coordinate to write down the positional information of each pictorial element.For the pictures different element, the shared region area size of this element also is not quite similar.(x y) writes down this positions of elements information can to adopt this element planimetric coordinates among the present invention.Wherein, x representes horizontal ordinate, and y representes ordinate.The element that region area is big more, the coordinates regional of its covering are also big more.

Therefore step S101 can comprise:

Obtain the coordinate information of each pictorial element in the PDF file page.

Concrete, can obtain the top left corner apex location coordinate information of each pictorial element in the PDF file page, and write down the reference point of said coordinate information as this pictorial element.

S102:, all images element in the page is divided into different set according to said positional information;

Usually, possibly comprise a plurality of pictorial elements (for example pixel) in the width of cloth picture.Because these pictorial elements belong to a width of cloth picture, so the position of these pictorial elements is very compact.The implication of step S102 is exactly to form mode according to PDF file page Central Plains picture originally, will belong to the pictorial element of a picture as much as possible, is divided in the same set, so that do as a whole the extraction.

In the practical application, step S102 can comprise:

Pictorial element in the set of said row is carried out the division of vertical direction, obtain the ranks set.

Concrete, for adopting the coordinate mode to represent the positional information of each pictorial element, the division of the set of row can comprise step:

A, according to the ordinate of the reference point of said pictorial element, all images element is sorted;

For each pictorial element, need sort according to the coordinate of the point at its same position place.Concrete, can adopt the ordinate of the upper left point of each pictorial element, all images element is sorted; Also can adopt the ordinate of upper right point, following an of left side or the lower-right most point etc. of each pictorial element to sort.These points can be thought the reference point of pictorial element.

The purpose of ordering is in order to be divided into same row set by the pictorial element that horizontal level is close.Therefore, if in the coordinate system, axis of ordinates is by under the last sensing, and the ordinate that is positioned at the element of page top so will be less than the ordinate of the element that is positioned at page below, and can sort this moment according to the ascending order of ordinate; If in the coordinate system, axis of ordinates is by pointing to down, and the ordinate that is positioned at the element of page top so will be greater than the ordinate of the element that is positioned at page below, and can sort this moment according to the descending order of ordinate.

For instance, the ordinate scope of supposing firm divided image element is at 10-100, and the ordinate scope of next pictorial element is at 20-50, and there is the part that intersects in obvious two scopes.Be divided to row set that said firm divided image element belong to next pictorial element this moment, thinks that promptly both are on the position with delegation basically.

If just the ordinate scope of divided image element is at 10-100, the ordinate scope of next pictorial element is at 200-260, and then there is not crossing part in two scopes.Be divided to new row set with said next pictorial element this moment, thinks that promptly both do not belong to same delegation.Repeating step C and D are until all images element has all been divided.

Pictorial element in the set of said row is carried out the division of vertical direction, obtains the ranks set, specifically can comprise step:

The purpose of ordering is in order to be divided into same row set by the pictorial element that the upright position is close.Therefore, if in the coordinate system, abscissa axis is pointed to right by a left side, and the horizontal ordinate that is positioned at the element of page left so will be less than the horizontal ordinate that is positioned at right-hand element of the page, and can sort this moment according to the ascending order of horizontal ordinate; If in the coordinate system, abscissa axis is that a left side is pointed in the bottom right, and the horizontal ordinate that is positioned at the element of page left so will be greater than the horizontal ordinate that is positioned at right-hand element of the page, and can sort this moment according to the descending order of horizontal ordinate.

For instance, the horizontal ordinate scope of supposing firm divided image element is at 10-100, and the horizontal ordinate scope of next pictorial element is at 20-150, and there is the part that intersects in obvious two scopes.Gather the row that next pictorial element is divided to said firm divided image element place this moment, thinks that promptly both are on the position of same row basically.

If just the horizontal ordinate scope of divided image element is at 10-100, the horizontal ordinate scope of next pictorial element is at 200-260, and then there is not crossing part in two scopes.Be divided to new row set with said next pictorial element this moment, thinks that promptly both do not belong to same row.Repeating step G and H until the pictorial element in certain row set has been divided, divide another row set then, the final division of accomplishing all row sets.

Need to prove that step e-H is to each row set.For the row set that marks off in each row set, just can think ranks set for full page.

S103: all images element in each set is carried out image as a whole extract.

Because each the ranks set after procession is divided all is very approaching at horizontal and vertical position, these elements constitute same width of cloth image probably jointly.Therefore, can all images element in each ranks set be extracted as a whole.

Concrete, can adopt following manner to extract:

Obtain the peripheral profile of each ranks set;

More specifically, for ease of understanding and operation, obtaining the peripheral profile of each ranks set, can be the peripheral rectangle that obtains each ranks set; According to this periphery rectangle all images element in this ranks set being carried out sectional drawing then extracts.

In sum; The disclosed method of from the PDF file page, extracting image of the present invention; Through according to the positional information of pictorial element in the file page, its procession is divided, integral body is carried out in the ranks set after dividing extract; Make the image that extracts be easy to editor, have higher extraction efficiency simultaneously.

Corresponding with the disclosed method of from the PDF file page, extracting image of the present invention, the invention also discloses a kind of system that from the PDF file page, extracts image.

Referring to Fig. 2, be the said system construction drawing that from the PDF file page, extracts image of the embodiment of the invention.As shown in Figure 2, this system comprises:

Position information acquisition module 201 is used for obtaining the positional information of each pictorial element of the PDF file page;

Module 202 is divided in set, is used for according to said positional information, and all images element in the page is divided into different set;

Extraction module 203 is used for all images element of each set is carried out the image extraction as a whole.

In the practical application, said position information acquisition module 201 can comprise:

Said set is divided module 202 and can be comprised:

Each embodiment adopts the mode of going forward one by one to describe in this instructions, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For the disclosed system of embodiment, because it is corresponding with the embodiment disclosed method, so description is fairly simple, relevant part is partly explained referring to method and is got final product.

Used concrete example among this paper principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, part all can change on embodiment and range of application.In sum, this description should not be construed as limitation of the present invention.

Claims

1. a method of from the PDF file page, extracting image is characterized in that, comprising:

All images element in each set is carried out image as a whole to be extracted.

2. method according to claim 1 is characterized in that, the said positional information of obtaining each pictorial element in the PDF file page comprises:

3. method according to claim 1 is characterized in that, and is said according to said positional information, and all images element in the page is divided into different set, comprising:

4. method according to claim 3 is characterized in that, said said pictorial element is carried out the division of horizontal direction, obtains one or more row sets, comprising:

5. method according to claim 3 is characterized in that, said pictorial element in the said row set is carried out the division of vertical direction, obtains the ranks set, comprising:

6. according to each described method of claim 3-5, it is characterized in that, saidly all images element in the set of each ranks is carried out image as a whole extract, comprising:

Obtain the peripheral profile of each ranks set;

7. method according to claim 6 is characterized in that, the said peripheral profile that obtains each ranks set; According to said peripheral profile, all images element in the said ranks set is extracted as a width of cloth picture, comprising:

Obtain the peripheral rectangle of each ranks set;

8. a system that from the PDF file page, extracts image is characterized in that, comprising:

9. system according to claim 8 is characterized in that, said position information acquisition module comprises:

10. system according to claim 8 is characterized in that, said set is divided module and comprised: