CN104732228A

CN104732228A - Detection and correction method for messy codes of PDF (portable document format) document

Info

Publication number: CN104732228A
Application number: CN201510181385.0A
Authority: CN
Inventors: 邹季英; 梁洵; 袁仁慧
Original assignee: TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd; TONGFANG KNOWLEDGE NETWORK DIGITAL PUBLICATION TECHNOLOGY Co Ltd
Current assignee: TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd; TONGFANG KNOWLEDGE NETWORK DIGITAL PUBLICATION TECHNOLOGY Co Ltd
Priority date: 2015-04-16
Filing date: 2015-04-16
Publication date: 2015-06-24
Anticipated expiration: 2035-04-16
Also published as: CN104732228B

Abstract

The invention discloses a detection and correction method for messy codes of a PDF (portable document format) document. The detection and correction method includes extracting all font characteristics in the PDF document; dividing fonts into normal fonts, garbage fonts and undetermined fonts according to the font characteristics; extracting dot matrix images of characters in the undetermined fonts, calculating similarity between the dot matrix images and corresponding codes according to a messy code detection algorithm for image statistical characteristics, and judging normal characters or garbage characters in the undetermined fonts according to the similarity; performing vertical and horizontal editing and correcting on the garbage characters in the undetermined fonts and garbage characters in the garbage fonts; correcting the PDF document according to a correction result to remove the garbage characters. The detection and correction method has the advantages that automatic detection of the messy codes is achieved through combination of the font characteristics and character image characteristics, labor and time for messy code correction are reduced through combination of vertical editing and horizontal editing, the messy codes are removed effectively, interference of the messy codes to follow-up fragmentization is avoided, processing efficiency and quality are improved, and processing cost is reduced.

Description

A kind of detection of PDF document mess code, the method for correction

Technical field

The method that the present invention relates to mess code character machining, correction in the fragmentation process of PDF document particularly relates to Chinese and the English detection of PDF document mess code character, the method for correction.

Background technology

PDF (Portable Document Format, portable file layout) be a kind of electronic file form, there is the feature with operating system platform independence, become widely used desirable document format in electronic document distribution and digital information propagation.

In the fragmentation process (metadata indexing) of PDF document, word operation to be got to document.What is called is got word and is referred to and to be copied by document character and to paste assigned address.Usually, document displaying contents is correct and displaying contents is consistent with getting word result.When displaying contents is with to get word result inconsistent, namely display correct, get word when makeing mistakes, claim this phenomenon to be the mess code phenomenon of PDF document.When getting word result containing a large amount of mess code, indexer must knock in indexing content word by word and sentence by sentence with keyboard; When a small amount of or indivedual mess code doping is wherein difficult to find, for ensureing that quality of indexing indexer will spend the plenty of time to check and get word result.Therefore, mess code phenomenon seriously reduces work efficiency and the quality of metadata indexing.

Mess code phenomenon has also had a strong impact on the accuracy of data content in electronic document secondary processing.Along with the development of computer technology, network technology, digital information is propagated becomes main flow circulation way.In digital information is propagated, mutual conversion requirements between the dissimilar electronic document of different-format be met, such as, turn mutually between PDF and WORD, EPUB.Following phenomenon may be there is: when a PDF document is converted to extended formatting electronic document under the prerequisite that page text importing is correct, Char Disorder phenomenon appears in the document after conversion in PDF document transfer process.Although the document after conversion can be found by hand inspection and correct mess code, hand inspection is not only wasted time and energy, and when a small amount of mess code human eye in a document that adulterates not easily is discovered, have impact on data content accuracy, reduce crudy.

Add man-hour at PDF document fragmentation, if first carry out mess code detection, correction to document, find from source mess code to correct mess code, mess code just can be avoided the harmful effect of following process.Therefore, carry out mess code detection, correct being very necessary to PDF document.At present, disclosed ripe method is rarely had to solve PDF document Confused-code.Approximate technology, as in PDF Word Input in conjunction with OCR (OpticalCharacter Recognition) technology to improve the accuracy of Word Input.OCR technology is that the image of character is converted to the technology of character computer ISN by a kind of character recognition technologies that utilizes.OCR technology comprises pre-processing image data, printed page analysis, character segmentation, monocase identification.The individual character recognition technology in OCR technology is mainly employed in PDF Word Input.In mess code detects, if each character with the not making any distinction between unified individual character recognition technology used in OCR technology to document, the cost spent is very high.Such as, for most of character normally only containing the PDF document of a small amount of mess code, OCR individual character recognition technology is used to each character, inevitably will consume the plenty of time on the normal character of identification.

Summary of the invention

For solving the problems of the technologies described above, the object of this invention is to provide a kind of method that PDF document mess code detects, corrects, the method adopts the mode of the image statistics integrate features of font characteristic sum character, achieve the automatic detection of mess code, get rid of the interference that mess code is processed PDF document fragmentation, improve crudy and cut down finished cost.

Object of the present invention is realized by following technical scheme:

The method that PDF document mess code detects, correct, comprising:

Extract all font features in PDF document;

According to font feature, font is divided into normal font, mess code font and font undetermined;

Extract the dot matrix image of character in font undetermined, and calculate dot matrix image and corresponding similarity of encoding based on the mess code detection algorithm of image statistics feature, judge normal character in font undetermined or mess code character according to similarity;

Mess code character in described font undetermined and the mess code character in mess code font are carried out vertical and horizontal and adapts correction;

By correcting modified result PDF document, remove mess code character.

Compared with prior art, one or more embodiment of the present invention can have the following advantages by tool:

From characteristics of image two angles of PDF document font characteristic sum character, complement each other, improve mess code detection efficiency further;

When mess code detects in units of font, the character that same font repeats only need detect once, has abandoned the mode of the poor efficiency word for word getting word duplicate detection from document page by page sentence by sentence;

In mess code detects, based on the mess code detection algorithm of image statistics feature compared with OCR individual character recognition technology, advantage is that the former is that guiding combining image feature carries out mess code judgement with character code, namely according to the statistical nature of dot matrix image corresponding in the coding lookup feature database of current character, judge whether current character is mess code by the dot matrix image of current character and the similarity of statistical nature.And the latter directly identifies according to dot matrix image, then recognition result and character code are contrasted judge.OCR individual character recognition technology generally carries out two stage recognition: thick identification and thin identification.Thick identification reduces the scope, and thin identification determines net result.And in mess code detection, character code has determined scope does not need thick identification to reduce the scope.Mess code detection algorithm as can be seen here based on image statistics feature compares OCR individual character recognition technology, more simply, time saving and energy saving be more suitable for mess code detect.

Vertical and horizontal are adapted to combine to reduce and are manually adapted the used time, improve mess code and correct efficiency.

Accompanying drawing explanation

Fig. 1 is the method flow diagram that PDF document mess code detects, corrects.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail.

As shown in Figure 1, be the method flow that PDF document mess code detects, corrects, described method comprises:

Extract all font features in PDF document;

By correcting modified result PDF document, remove mess code character.

Above-mentioned font feature comprises: font type, font coded system, whether there are mapping relations between present encoding and standard code, whether be embedded font etc.Described font type is mainly divided into two kinds: composite font (Composite Font) and simple font (Simple Font).The difference of composite font and simple font is: the former adopts multibyte Coding and description character, and the latter adopts byte Coding and description character usually.Described font coded system can be divided into standard code mode and custom coding mode.Standard code mode refers to coded system published, sanctified by usage, as EUC (Extended Unix Code), UCS2 (Universal Multiple-Octet Coded Character Set 2), ANSI (AmericanNational Standards Institute) etc.; Custom coding mode refers to undocumented, privately owned coded system.Mapping relations between described present encoding and standard code refer to, when font coded system is self-defined, use the mapping relations between present encoding and standard code privately owned custom coding can be converted to disclosed standard code.Described embedded font refers to the resources such as the shape (Glyph) of all characters of this font related to by document by certain rale store in PDF document.Embedded font is corresponding with non-embedded font, and font resource is not embedded in document by non-embedded font, and the resource used is outside from document, as system font resource.

It is that embedded font in document employs custom coding that PDF document produces the true cause of mess code, but lacks the mapping relations between standard code; Or there are the mapping relations of mistake.

Above-mentioned normal font, is normal character depending on characters all under this font, is left intact;

Mess code font, is judged to be mess code by characters all under this font;

Font undetermined, extract the dot matrix image of all characters of this font that document relates to and corresponding coding (Unicode coding), adopt the mess code detection algorithm based on image statistics feature, the similarity of both calculating, is judged to be mess code by dissimilar character.

Mess code detection algorithm based on image statistics feature have employed the thought of statistical-simulation spectrometry, the image pattern of each character (being mainly Chinese and English character) different fonts different size need be collected respectively, extract characteristics of image, find the regularity of distribution of this feature space, describe this distribution with statistical model.The image pattern of each character needs to reach some and just has statistical significance, and the image pattern number lower limit of the character involved by the present embodiment is 100.The information such as the characteristics of image of character, statistical model are saved as feature database for future use by certain rule.

When calculating similarity, first extract the lattice image features of character to be checked, then from feature database, find the statistical information of corresponding character with word coding, finally estimate the probability that characteristics of image to be checked occurs in corresponding statistical model.Probability is higher shows that similarity degree is higher, and the lower then similarity degree of probability is lower.A threshold value can be preset, be then judged to be mess code lower than this threshold value.

It is above-mentioned that to adapt correction be all mess codes are exported to the instrument of adapting to carry out adapting correction.Longitudinally adapting correction, is the concentrated batch modification that the identical characters of different fonts pooled together; Laterally adapting correction, is pooled together by the kinds of characters of same font, and picture and text contrast, manual amendment.

According to pdf document design feature, in units of font, setting up or mapping table under upgrading embedded font between the present encoding of all characters and standard code with correcting result, the object removing mess code can be reached.

Although the embodiment disclosed by the present invention is as above, the embodiment that described content just adopts for the ease of understanding the present invention, and be not used to limit the present invention.Technician in any the technical field of the invention; under the prerequisite not departing from the spirit and scope disclosed by the present invention; any amendment and change can be done what implement in form and in details; but scope of patent protection of the present invention, the scope that still must define with appending claims is as the criterion.

Claims

1. the detection of PDF document mess code, a method for correction, is characterized in that, described method comprises:

Extract all font features in PDF document;

By correcting modified result PDF document, remove mess code character.

2. the detection of PDF document mess code as claimed in claim 1, the method for correction, it is characterized in that, described font feature comprises: font type, font coded system, whether there are mapping relations between present encoding and standard code, whether be embedded font.

3. the detection of PDF document mess code as claimed in claim 1, the method for correction, it is characterized in that, the mess code detection algorithm of described image statistics feature comprises: the parameter evaluation method of image characteristics extraction algorithm, statistical model and Image Feature Matching algorithm.

4. the detection of PDF document mess code as claimed in claim 1, the method for correction, is characterized in that, the character pattern image in described font undetermined is encoded similar with corresponding, be then judged to be normal character; Otherwise be judged to be mess code character.

5. the detection of PDF document mess code as claimed in claim 1, the method for correction, is characterized in that,

Described longitudinal direction is adapted and corrected is the identical characters of different fonts pooled together, and concentrated batch is modified;

Described transverse direction is adapted and corrected is pooled together by the kinds of characters of same font, and picture and text contrast is modified.

6. the detection of PDF document mess code as claimed in claim 1, the method for correction, it is characterized in that, described is adopt in units of font by correction modified result PDF document, set up with correcting result or upgrade the mapping table between the present encoding of all characters under PDF document embedded font and standard code, to remove mess code character.