CN102541905B

CN102541905B - For attribute processing methods and the device of pdf document

Info

Publication number: CN102541905B
Application number: CN201010605620.XA
Authority: CN
Inventors: 张立业; 卢秀琴
Original assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Priority date: 2010-12-15
Filing date: 2010-12-15
Publication date: 2015-11-25
Anticipated expiration: 2030-12-15
Also published as: CN102541905A

Abstract

The invention provides a kind of attribute processing methods for pdf document, comprise the following steps: resolve pdf document, obtain the attribute of pdf document according to the attribute dictionary preset, wherein, the attribute dictionary preset comprises the particular community expecting the described pdf document of test, and particular community is used for search; The attribute of each pdf document obtained and filename thereof are joined in database as a record.Present invention also offers a kind of attribute treating apparatus for pdf document, comprise: acquisition module, for resolving pdf document, the attribute of pdf document is obtained according to the attribute dictionary preset, wherein, the attribute dictionary preset comprises the particular community expecting the described pdf document of test, and particular community is used for search; Logging modle, for joining the attribute of each pdf document obtained and filename thereof in database as a record.Present invention saves cost of labor, improve efficiency.

Description

For attribute processing methods and the device of pdf document

Technical field

The present invention relates to print field, in particular to for the attribute processing methods of PDF (PortableDocumentFormat can carry document format) file and device.

Background technology

In the test process for printing industry software, often need to select the pdf document possessing certain particular community (key) or some particular community collection from the sample file of existing a large amount of PDF and carry out test activity targetedly.

At present, the method filtering out the pdf document of particular community has two kinds: one to be possess which important attribute by going out this file by filename direct representation after making pdf document, goes to screen by filename in the future.But this kind of method has stricter restriction due to the filename length of system and character used, therefore can not list too many attribute, and inquiry is got up for the bad realization of screening of composite attribute.Another method is when testing at every turn, and each pdf document is opened in equal artificially, checks its attribute one by one, and this process is quite time-consuming, and efficiency is very low.

Because this kind of test activity is relatively more frequent, and have strict requirement the time cycle, therefore two kinds of methods of prior art are all infeasible.

Summary of the invention

The present invention aims to provide a kind of attribute processing methods for pdf document and device, to solve the very low problem of existing pdf document attribute selection method efficiency.

In an embodiment of the present invention, provide a kind of attribute processing methods for pdf document, comprise the following steps: pre-set attribute dictionary, wherein will expect that the particular community of test pdf document joins in attribute dictionary as the attribute being used for searching for; Obtain the attribute of pdf document; The attribute of each pdf document obtained and filename thereof are joined in database as a record; Wherein, the attribute of described acquisition pdf document comprises: resolve described pdf document and obtain header file, content flow and file dictionary; The attribute of described pdf document is obtained from described header file, described content flow and described file dictionary; Wherein, the described attribute obtaining described pdf document from described header file, described content flow and described file dictionary comprises: travel through all dictionary objects in described header file, described content flow and described file dictionary, judge in ergodic process the dictionary object of described traversal whether have described in attribute in the attribute dictionary that pre-sets.

In an embodiment of the present invention, additionally provide a kind of attribute treating apparatus for pdf document, comprising: attribute dictionary arranges module, for pre-setting attribute dictionary, wherein will expect that the particular community of test pdf document joins in attribute dictionary as the attribute being used for searching for; Acquisition module, obtains the attribute of pdf document; Logging modle, for joining in database using the attribute of each pdf document obtained and filename thereof as a record; Described acquisition module comprises: pdf document parsing module, obtains header file, content flow and file dictionary for resolving described pdf document; Pdf document dictionary parsing module, for obtaining the attribute of described pdf document from described header file, described content flow and described file dictionary; Described pdf document dictionary parsing module is used for: travel through all dictionary objects in described header file, described content flow and described file dictionary, judge in ergodic process the dictionary object of described traversal whether have described in attribute in the attribute dictionary that pre-sets.

The attribute processing methods for pdf document of above-described embodiment and device are because adopt data-base recording pdf document attribute, be convenient to inquiry in the future, so overcome the very low problem of existing pdf document attribute selection method efficiency, therefore save cost of labor, improve efficiency.

Accompanying drawing explanation

Accompanying drawing described herein is used to provide a further understanding of the present invention, and form a application's part, schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 shows according to an embodiment of the invention for the process flow diagram of the attribute processing methods of pdf document;

Fig. 2 shows according to the preferred embodiment of the invention for the process flow diagram of the attribute processing methods of pdf document;

Fig. 3 shows according to an embodiment of the invention for the schematic diagram of the attribute treating apparatus of pdf document;

Fig. 4 shows according to the preferred embodiment of the invention for the schematic diagram of the attribute treating apparatus of pdf document.

Embodiment

Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.

Fig. 1 shows according to an embodiment of the invention for the process flow diagram of the attribute processing methods of pdf document, comprises the following steps:

Step S10, obtains the attribute of pdf document;

Step S20, joins the attribute of each pdf document obtained and filename thereof in database as a record.

In prior art, when testing, each pdf document is opened in equal artificially at every turn, checks its attribute one by one, and this process is quite time-consuming, and efficiency is very low.And this attribute processing methods is because adopt data-base recording pdf document attribute, be convenient to inquiry in the future, so without the need to testing at every turn time again artificially open each pdf document, overcome the problem that existing pdf document attribute selection method efficiency is very low, therefore save cost of labor, improve efficiency.

Preferably, step S10 comprises: resolve pdf document and obtain header file, content flow (contents) and file dictionary; The attribute of pdf document is obtained from header file, content flow and file dictionary.Above-mentioned resolving because can realize by performing computer software, thus eliminates the process of manual analysis pdf document up hill and dale, alleviates cost of labor widely, considerably improves efficiency.Certainly, as basic embodiment of the present invention, the attribute of pdf document also can be obtained by the mode of manual analysis.

Preferably, the attribute obtaining pdf document from header file, content flow and file dictionary comprises: all dictionary objects in traversal header file, content flow and file dictionary, judges whether the dictionary object traveled through has the attribute in the attribute dictionary pre-set in ergodic process.In this preferred embodiment, adopt attribute dictionary to preset the attribute needing search, thus improve the speed of program looks PDF attribute.

Preferably, the attribute processing methods for pdf document also comprises: pre-set attribute dictionary, wherein will expect that the particular community of test pdf document joins in attribute dictionary as the attribute being used for searching for.In the preferred embodiment, because pre-set attribute dictionary according to the object of test pdf document, thus the result can guaranteeing to carry out the process of PDF attributive analysis can be used in the test of pdf document.In addition, because the process of can set a property artificially dictionary, i.e. adjustable PDF attributive analysis process, so when test purpose changes, without the need to adjusting the process of PDF attributive analysis process, only need Update attribute dictionary simply.Because attribute dictionary customizes as required, therefore extendability is also stronger, puts in storage if there has been newly-increased attribute specification only to need amendment dictionary re-using native system to carry out pdf document parsing.

Preferably, in ergodic process, judge that the attribute whether dictionary object traveled through has in the attribute dictionary pre-set comprises: for the current dictionary object traversed, judge whether it has in the attribute of attribute dictionary and not yet determine the attribute that pdf document has had, in the attribute of attribute dictionary, determine that the attribute that pdf document has had then no longer judges.According to this preferred embodiment, when attribute dictionary comprises multiple attribute, if in the dictionary object process of traversal pdf document, when determining that certain dictionary object has certain attribute of attribute dictionary, so in ensuing dictionary object ergodic process, just without the need to judging that this is determined attribute, and only need judge whether pdf document has other attributes of attribute dictionary.Do like this and obviously improve executing efficiency, when pdf document quantity is many especially, attribute processing speed can be accelerated significantly.

Preferably, step S10 comprises: acquisition approach from the character string of input; All pdf documents in traverse path, to obtain the attribute of each pdf document of traversal.According to the preferred embodiment, user only need input a path, just automatically can carry out attribute process to pdf documents all in path, alleviates the manual burden of user, improves work efficiency.

Preferably, attribute comprise following one of at least:

Doctype, PDF version, whether in advance color separation file, total page number, whether there is OutputIntent, whether submit to by stream mode, whether process OptionalContent, whether resolve AnnotationProcessed, whether file is encrypted, whether be encryption of soaring, PDFXVersion, whether cross reference table is flow object, whether multiple cross reference, there is the Content of flow object, there is the Content of array object, there is the Content of empty object, notes content attribute [type of comment (WidgetType, Link, FreeText, CirCle, Polygon, Ployline, Highlight, Underline, Squiggly, StrikeOut, Stamp, Caret, Ink, FileAttachment, sound, Movie, PrinterMark, TrapNet, WaterMark, ThreeD), whether Widge can export, N object type (flow object in AP dictionary, dictionary object, other object)], Alternative Content attribute [selectable objects type (OCG, OCMenberShip), MemberShip whether is had to determine OC state, OC state (ON, OFF, UnDenfined), MemberShip computation rule (VE, ANYON, ANYOFF, ALLON, ALLOFF)], image object attribute [image type (Normal, InlineImage, Mask, explictMask, ColorkeyMask, Smask), position dark (1, 2, 4, 6, 8, 16), whether exist line high be 1 image, whether there is the image that live width is 1, X-direction resolution, Y-direction resolution, whether there is default Decode, colour generation purpose, double exposure pattern, whether double exposure, whether front end is assembled, image processing type, whether front end zoom, image zoom algorithm, whether scan from left to right, whether scan from the top down, trasfer type, whether cut, look face quantity, whether be out of shape, whether contain UCR, whether contain BG, linked network type, whether Transfer is there is in linked network, linked network Spot type function, bHasTwoSquaresThreshold)], gradient attributes [type of fade, whether define background color, double exposure pattern, whether define BBox, whether contain UCR, be whether the Pattern of type 2, Transfer type, whether multiple output function, whether double exposure, whether contain BG, type function, whether multi output, whether multi input, whether there is Range item], path attribute [path type, whether existence closes SubPath, whether there is curve, whether there is null vector, whether there is fixed-point number to cross the border, draw operational character, Trasfer type, whether double exposure, whether contain UCR, whether there is multiple SubPath, whether cannot there is not closedly SubPath, whether be buffered, whether Flatness is less than default value, exist close to vertical/horizontal straight line, double exposure pattern, Flatness and whether be curve, whether contain BG], font attribute [font type (Type0, Type1, Type3, TrueType), font name, basis font name, font type of coding, width table type, whether font file is embedded, font PaintType, whether synthesize runic effect, whether synthesize italic effect, whether OpenType font, whether non-indirect referencing object, the whether font of Symbolic type), hidden primitive attribute (has the primitive types of OC attribute, be hidden primitive types (StrokeElement, FillElement, TextElement, ShadingElement, XobjectElement), whether nested multilayer in MarkedContent)], font contents attribute [TextRenderMode, TextKnockOut, existence will enter the Type3 character of cache, whether can not there is not the Type3 character of cache, existence comprises the Type3 character of Image, existence comprises the Type3 character of Form, existence comprises the Type3 character of Font, existence comprises the Type1 character of seac instruction, existence comprises the Type1 character of StemHint, existence comprises the Type1 character of CounterHint, whether the width table information in dictionary is inconsistent with the metric in font file, TransferType, existence comprises the TrueType character of Instruction, whether contain UCR, double exposure pattern, whether contain UCR, whether contain BG, whether double exposure, font type] color space type [CS_DeviceGray, CS_DeviceRGB, CS_DeviceCMYK, CS_CalGray, CS_CalRGB, CS_ICCBased, CS_Separation, CS_DeviceN, CS_Indexed, CS_Lab, CS_Pattern], function property [type function (SampleFunc, ExpFunc, StitchFunc, PSFunc), whether multi output, whether multi input, whether there is Range item], transparent attribute [pel in transparent group, pel contains spot color, containing softImageMask, father's transparent attribute, transparent group of self attributes (Isolated, Konckout, PageGroup), transparent image status attribute (BlendMode, AIS, OP, OPM, SoftMask type, background colour)], FilterType[ASCIIHEX, ASCII85, RLE, LZW, FLATE, FAX, DCT, JBIG2, CRYPT, SUBFILE, RESTREAM, SPECIAL, JPX].

The preferred embodiment of the present invention is including but not limited to above-mentioned attribute.

Fig. 2 shows according to the preferred embodiment of the invention for the process flow diagram of the attribute processing methods of pdf document, and the preferred embodiment combines the scheme of each embodiment above-mentioned.

For the character string of user input, carry out automatically attribute in order to the whole pdf documents in all paths of it being comprised and resolve contrast, generate data-base recording and carry out unified management, step as shown in Figure 2 completes following process:

Step S1: according to the character string of input, splits and obtains effective path.

Step S2: all pdf documents in traverse path.

Step S3: each file of traversal is carried out dissection process one by one.

Step S4: following operation is performed to the current pdf document of resolving:

Step S41: analyze pdf document dictionary object and perform following operation:

Step S411: the dictionary object obtaining pdf document.

Step S412: search the attribute whether comprising and specify in PDF dictionary object.

Step S413: record searching result.

Step S42: the content flow analyzing each page dictionary in pdf document performs following operation:

Step S421: obtain the content flow in pdf document page dictionary object.

Step S422: search whether comprise specified attribute at content of pages stream.

Step S423: record searching result.

Step S5: judge that whether all pages of pdf document are complete by analysis, if do not analyzed, has then continued to perform step S3.

Step S6: if all pages of pdf document are complete by analysis, then generate a data record, the content of the attribute record of pdf document filled this data record by form in the database table of specifying.

Step S7: judge that whether all pdf documents in specified path are complete by analysis, if do not analyzed, has then continued to perform above-mentioned step S2-S6.If analyzed, then ending said process.

Fig. 3 shows according to an embodiment of the invention for the schematic diagram of the attribute treating apparatus of pdf document, comprising:

Acquisition module 10, for obtaining the attribute of pdf document;

Logging modle 20, for joining the attribute of each pdf document obtained and filename thereof in database as a record.

In prior art, when testing, each pdf document is opened in equal artificially at every turn, checks its attribute one by one, and this process is quite time-consuming, and efficiency is very low.And this attribute treating apparatus is because adopt data-base recording pdf document attribute, be convenient to inquiry in the future, so without the need to testing at every turn time again artificially open each pdf document, overcome the problem that existing pdf document attribute selection method efficiency is very low, therefore save cost of labor, improve efficiency.

Preferably, acquisition module 10 comprises: pdf document parsing module, obtains header file, content flow and file dictionary for resolving pdf document; Pdf document dictionary parsing module, for obtaining the attribute of pdf document from header file, content flow and file dictionary.Above-mentioned resolving because can realize by performing computer software, thus eliminates the process of manual analysis pdf document up hill and dale, alleviates cost of labor widely, considerably improves efficiency.Certainly, as basic embodiment of the present invention, the attribute of pdf document also can be obtained by the mode of manual analysis.

Preferably, acquisition module 10 comprises: file path acquisition module, for acquisition approach in the character string from input; Traversal path extracts pdf document module, for all pdf documents in traverse path, to obtain the attribute of each pdf document of traversal.According to the preferred embodiment, user only need input a path, just automatically can carry out attribute process to pdf documents all in path, alleviates the manual burden of user, improves work efficiency.

Fig. 4 shows according to the preferred embodiment of the invention for the schematic diagram of the attribute treating apparatus of pdf document.The preferred embodiment combines the scheme of each embodiment above-mentioned.This attribute treating apparatus comprises:

File path acquisition module 12, traversal path extract pdf document module 14, pdf document parsing module 22, pdf document dictionary parsing module 24, content of pages stream parsing module 26, attribute search module 28, PDF attribute record module 32, data-base recording generation module 34, wherein:

File path acquisition module 12, for obtaining each effective file path in the character string from input, such as, file path acquisition module 12 splits out multiple active path by the method for searching special decollator " | " from the character string of input, then each active path follow-up module that passes to one by one is processed.

Traversal path extracts pdf document module 14, for traveling through each pdf document in specified path, such as, traversal path extracts pdf document module 14 to the active path imported into, traversal each file wherein, and screened by file suffixes, by each " .pdf " suffix file one by one special delivery to subsequent module for processing;

Pdf document parsing module 22, for resolving in pdf document whether comprise defined specified attribute.Which includes pdf document dictionary parsing module and content of pages stream parsing module.

Pdf document dictionary parsing module 24, for obtaining the dictionary of pdf document, and whether search comprises defined attribute, such as, pdf document dictionary parsing module 24 obtains the dictionary object of the pdf document imported into, and call attribute search module and search in this attribute dictionary whether comprise defined attribute, and log file essential information and Search Results.

Content of pages stream parsing module 26, for splitting out the content flow in every page of dictionary, and one by one the content of pages of acquisition is flow to row relax, whether search wherein comprises defined attribute, such as, content of pages stream parsing module 26 splits out the content flow in the dictionary object of each page of pdf document, and the content flow of each page obtained is carried out subsequent treatment one by one, call attribute search module and search in this content of pages stream whether comprise defined attribute, and record searching result.

Whether attribute search module 28, exist in specific dictionary object for searching for the attribute of specifying.

Above-mentioned file path acquisition module 12, traversal path extraction pdf document module 14, pdf document parsing module 22, pdf document dictionary parsing module 24, content of pages stream parsing module 26, attribute search module 28 achieve the acquisition module 10 in Fig. 3

PDF attribute record module 32, the particular community that essential information and search file out for preserving pdf document comprise.

Data-base recording generation module 34, for the PDF attribute record of preservation being recorded in the database table of specifying with the form of data-base recording, such as, data-base recording generation module: the data-base recording that interpolation one is new in the database of specifying, the pdf document attribute Search Results of prior process record is carried out arrangement to merge, fill this data-base recording according to specified format.

Above-mentioned PDF attribute record module 32, data-base recording generation module 34 achieve the logging modle 20 in Fig. 3.

The preferred embodiment because whole process can process in bulk, and does not need human intervention, automatically completes from extraction document to all processes of resolving warehouse-in, has therefore saved cost of labor in large quantities, improve efficiency.And once after warehouse-in, can carry out screening compactly fast for putting in storage content at any time, and the power of various combinations of attributes screening can be realized, be convenient to management and, also make the screening of more refinement become possibility.

As can be seen from the above description, the above embodiments of the present invention overcome the very low problem of existing pdf document attribute selection method efficiency, have therefore saved cost of labor, have improve efficiency.

Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus they storages can be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to any specific hardware and software combination.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. for an attribute processing methods for pdf document, it is characterized in that, comprise the following steps:

Pre-set attribute dictionary, wherein will expect that the particular community of test pdf document joins in attribute dictionary as the attribute being used for searching for;

Obtain the attribute of pdf document;

The attribute of each described pdf document obtained and filename thereof are joined in database as a record;

Wherein, the attribute of described acquisition pdf document comprises:

Resolve described pdf document and obtain header file, content flow and file dictionary;

The attribute of described pdf document is obtained from described header file, described content flow and described file dictionary;

Wherein, the described attribute obtaining described pdf document from described header file, described content flow and described file dictionary comprises:

Travel through all dictionary objects in described header file, described content flow and described file dictionary, judge in ergodic process the dictionary object of described traversal whether have described in attribute in the attribute dictionary that pre-sets.

2. method according to claim 1, is characterized in that, judges that the attribute whether dictionary object of described traversal has in the attribute dictionary pre-set comprises in ergodic process:

For the current described dictionary object traversed, judge whether it has in the attribute of described attribute dictionary and not yet determine the attribute that described pdf document has had, in the attribute of described attribute dictionary, determine that the attribute that described pdf document has had then no longer judges.

3. method according to claim 1, is characterized in that, the attribute obtaining pdf document comprises:

Acquisition approach from the character string of input;

Travel through all pdf documents in described path, to obtain the attribute of each pdf document of described traversal.

4. the method according to any one of claim 1-3, is characterized in that, described attribute comprise following one of at least:

Doctype, PDF version, whether in advance color separation file, total page number, whether there is OutputIntent, whether submit to by stream mode, whether process OptionalContent, whether resolve AnnotationProcessed, whether file is encrypted, whether be encryption of soaring, PDFXVersion, whether cross reference table is flow object, whether multiple cross reference, there is the Content of flow object, there is the Content of array object, there is the Content of empty object, notes content attribute, Alternative Content attribute, image object attribute, gradient attributes, path attribute, font attribute, font contents attribute, color space type, function property, transparent attribute, FilterType.

5., for an attribute treating apparatus for pdf document, it is characterized in that, comprising:

Attribute dictionary arranges module, for pre-setting attribute dictionary, wherein will expect that the particular community of test pdf document joins in attribute dictionary as the attribute being used for searching for;

Acquisition module, obtains the attribute of pdf document;

Logging modle, for joining in database using the attribute of each described pdf document obtained and filename thereof as a record;

Described acquisition module comprises:

Pdf document parsing module, obtains header file, content flow and file dictionary for resolving described pdf document;

Pdf document dictionary parsing module, for obtaining the attribute of described pdf document from described header file, described content flow and described file dictionary;

Described pdf document dictionary parsing module is used for:

6. device according to claim 5, is characterized in that, described acquisition module comprises:

File path acquisition module, for acquisition approach in the character string from input;

Traversal path extracts pdf document module, for traveling through all pdf documents in described path, to obtain the attribute of each pdf document of described traversal.