CN104732228A - Detection and correction method for messy codes of PDF (portable document format) document - Google Patents

Detection and correction method for messy codes of PDF (portable document format) document Download PDF

Info

Publication number
CN104732228A
CN104732228A CN201510181385.0A CN201510181385A CN104732228A CN 104732228 A CN104732228 A CN 104732228A CN 201510181385 A CN201510181385 A CN 201510181385A CN 104732228 A CN104732228 A CN 104732228A
Authority
CN
China
Prior art keywords
font
mess code
character
detection
pdf document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510181385.0A
Other languages
Chinese (zh)
Other versions
CN104732228B (en
Inventor
邹季英
梁洵
袁仁慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
TONGFANG KNOWLEDGE NETWORK DIGITAL PUBLICATION TECHNOLOGY Co Ltd
Original Assignee
TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
TONGFANG KNOWLEDGE NETWORK DIGITAL PUBLICATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd, TONGFANG KNOWLEDGE NETWORK DIGITAL PUBLICATION TECHNOLOGY Co Ltd filed Critical TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Priority to CN201510181385.0A priority Critical patent/CN104732228B/en
Publication of CN104732228A publication Critical patent/CN104732228A/en
Application granted granted Critical
Publication of CN104732228B publication Critical patent/CN104732228B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a detection and correction method for messy codes of a PDF (portable document format) document. The detection and correction method includes extracting all font characteristics in the PDF document; dividing fonts into normal fonts, garbage fonts and undetermined fonts according to the font characteristics; extracting dot matrix images of characters in the undetermined fonts, calculating similarity between the dot matrix images and corresponding codes according to a messy code detection algorithm for image statistical characteristics, and judging normal characters or garbage characters in the undetermined fonts according to the similarity; performing vertical and horizontal editing and correcting on the garbage characters in the undetermined fonts and garbage characters in the garbage fonts; correcting the PDF document according to a correction result to remove the garbage characters. The detection and correction method has the advantages that automatic detection of the messy codes is achieved through combination of the font characteristics and character image characteristics, labor and time for messy code correction are reduced through combination of vertical editing and horizontal editing, the messy codes are removed effectively, interference of the messy codes to follow-up fragmentization is avoided, processing efficiency and quality are improved, and processing cost is reduced.

Description

A kind of detection of PDF document mess code, the method for correction
Technical field
The method that the present invention relates to mess code character machining, correction in the fragmentation process of PDF document particularly relates to Chinese and the English detection of PDF document mess code character, the method for correction.
Background technology
PDF (Portable Document Format, portable file layout) be a kind of electronic file form, there is the feature with operating system platform independence, become widely used desirable document format in electronic document distribution and digital information propagation.
In the fragmentation process (metadata indexing) of PDF document, word operation to be got to document.What is called is got word and is referred to and to be copied by document character and to paste assigned address.Usually, document displaying contents is correct and displaying contents is consistent with getting word result.When displaying contents is with to get word result inconsistent, namely display correct, get word when makeing mistakes, claim this phenomenon to be the mess code phenomenon of PDF document.When getting word result containing a large amount of mess code, indexer must knock in indexing content word by word and sentence by sentence with keyboard; When a small amount of or indivedual mess code doping is wherein difficult to find, for ensureing that quality of indexing indexer will spend the plenty of time to check and get word result.Therefore, mess code phenomenon seriously reduces work efficiency and the quality of metadata indexing.
Mess code phenomenon has also had a strong impact on the accuracy of data content in electronic document secondary processing.Along with the development of computer technology, network technology, digital information is propagated becomes main flow circulation way.In digital information is propagated, mutual conversion requirements between the dissimilar electronic document of different-format be met, such as, turn mutually between PDF and WORD, EPUB.Following phenomenon may be there is: when a PDF document is converted to extended formatting electronic document under the prerequisite that page text importing is correct, Char Disorder phenomenon appears in the document after conversion in PDF document transfer process.Although the document after conversion can be found by hand inspection and correct mess code, hand inspection is not only wasted time and energy, and when a small amount of mess code human eye in a document that adulterates not easily is discovered, have impact on data content accuracy, reduce crudy.
Add man-hour at PDF document fragmentation, if first carry out mess code detection, correction to document, find from source mess code to correct mess code, mess code just can be avoided the harmful effect of following process.Therefore, carry out mess code detection, correct being very necessary to PDF document.At present, disclosed ripe method is rarely had to solve PDF document Confused-code.Approximate technology, as in PDF Word Input in conjunction with OCR (OpticalCharacter Recognition) technology to improve the accuracy of Word Input.OCR technology is that the image of character is converted to the technology of character computer ISN by a kind of character recognition technologies that utilizes.OCR technology comprises pre-processing image data, printed page analysis, character segmentation, monocase identification.The individual character recognition technology in OCR technology is mainly employed in PDF Word Input.In mess code detects, if each character with the not making any distinction between unified individual character recognition technology used in OCR technology to document, the cost spent is very high.Such as, for most of character normally only containing the PDF document of a small amount of mess code, OCR individual character recognition technology is used to each character, inevitably will consume the plenty of time on the normal character of identification.
Summary of the invention
For solving the problems of the technologies described above, the object of this invention is to provide a kind of method that PDF document mess code detects, corrects, the method adopts the mode of the image statistics integrate features of font characteristic sum character, achieve the automatic detection of mess code, get rid of the interference that mess code is processed PDF document fragmentation, improve crudy and cut down finished cost.
Object of the present invention is realized by following technical scheme:
The method that PDF document mess code detects, correct, comprising:
Extract all font features in PDF document;
According to font feature, font is divided into normal font, mess code font and font undetermined;
Extract the dot matrix image of character in font undetermined, and calculate dot matrix image and corresponding similarity of encoding based on the mess code detection algorithm of image statistics feature, judge normal character in font undetermined or mess code character according to similarity;
Mess code character in described font undetermined and the mess code character in mess code font are carried out vertical and horizontal and adapts correction;
By correcting modified result PDF document, remove mess code character.
Compared with prior art, one or more embodiment of the present invention can have the following advantages by tool:
From characteristics of image two angles of PDF document font characteristic sum character, complement each other, improve mess code detection efficiency further;
When mess code detects in units of font, the character that same font repeats only need detect once, has abandoned the mode of the poor efficiency word for word getting word duplicate detection from document page by page sentence by sentence;
In mess code detects, based on the mess code detection algorithm of image statistics feature compared with OCR individual character recognition technology, advantage is that the former is that guiding combining image feature carries out mess code judgement with character code, namely according to the statistical nature of dot matrix image corresponding in the coding lookup feature database of current character, judge whether current character is mess code by the dot matrix image of current character and the similarity of statistical nature.And the latter directly identifies according to dot matrix image, then recognition result and character code are contrasted judge.OCR individual character recognition technology generally carries out two stage recognition: thick identification and thin identification.Thick identification reduces the scope, and thin identification determines net result.And in mess code detection, character code has determined scope does not need thick identification to reduce the scope.Mess code detection algorithm as can be seen here based on image statistics feature compares OCR individual character recognition technology, more simply, time saving and energy saving be more suitable for mess code detect.
Vertical and horizontal are adapted to combine to reduce and are manually adapted the used time, improve mess code and correct efficiency.
Accompanying drawing explanation
Fig. 1 is the method flow diagram that PDF document mess code detects, corrects.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail.
As shown in Figure 1, be the method flow that PDF document mess code detects, corrects, described method comprises:
Extract all font features in PDF document;
According to font feature, font is divided into normal font, mess code font and font undetermined;
Extract the dot matrix image of character in font undetermined, and calculate dot matrix image and corresponding similarity of encoding based on the mess code detection algorithm of image statistics feature, judge normal character in font undetermined or mess code character according to similarity;
Mess code character in described font undetermined and the mess code character in mess code font are carried out vertical and horizontal and adapts correction;
By correcting modified result PDF document, remove mess code character.
Above-mentioned font feature comprises: font type, font coded system, whether there are mapping relations between present encoding and standard code, whether be embedded font etc.Described font type is mainly divided into two kinds: composite font (Composite Font) and simple font (Simple Font).The difference of composite font and simple font is: the former adopts multibyte Coding and description character, and the latter adopts byte Coding and description character usually.Described font coded system can be divided into standard code mode and custom coding mode.Standard code mode refers to coded system published, sanctified by usage, as EUC (Extended Unix Code), UCS2 (Universal Multiple-Octet Coded Character Set 2), ANSI (AmericanNational Standards Institute) etc.; Custom coding mode refers to undocumented, privately owned coded system.Mapping relations between described present encoding and standard code refer to, when font coded system is self-defined, use the mapping relations between present encoding and standard code privately owned custom coding can be converted to disclosed standard code.Described embedded font refers to the resources such as the shape (Glyph) of all characters of this font related to by document by certain rale store in PDF document.Embedded font is corresponding with non-embedded font, and font resource is not embedded in document by non-embedded font, and the resource used is outside from document, as system font resource.
It is that embedded font in document employs custom coding that PDF document produces the true cause of mess code, but lacks the mapping relations between standard code; Or there are the mapping relations of mistake.
Above-mentioned normal font, is normal character depending on characters all under this font, is left intact;
Mess code font, is judged to be mess code by characters all under this font;
Font undetermined, extract the dot matrix image of all characters of this font that document relates to and corresponding coding (Unicode coding), adopt the mess code detection algorithm based on image statistics feature, the similarity of both calculating, is judged to be mess code by dissimilar character.
Mess code detection algorithm based on image statistics feature have employed the thought of statistical-simulation spectrometry, the image pattern of each character (being mainly Chinese and English character) different fonts different size need be collected respectively, extract characteristics of image, find the regularity of distribution of this feature space, describe this distribution with statistical model.The image pattern of each character needs to reach some and just has statistical significance, and the image pattern number lower limit of the character involved by the present embodiment is 100.The information such as the characteristics of image of character, statistical model are saved as feature database for future use by certain rule.
When calculating similarity, first extract the lattice image features of character to be checked, then from feature database, find the statistical information of corresponding character with word coding, finally estimate the probability that characteristics of image to be checked occurs in corresponding statistical model.Probability is higher shows that similarity degree is higher, and the lower then similarity degree of probability is lower.A threshold value can be preset, be then judged to be mess code lower than this threshold value.
It is above-mentioned that to adapt correction be all mess codes are exported to the instrument of adapting to carry out adapting correction.Longitudinally adapting correction, is the concentrated batch modification that the identical characters of different fonts pooled together; Laterally adapting correction, is pooled together by the kinds of characters of same font, and picture and text contrast, manual amendment.
According to pdf document design feature, in units of font, setting up or mapping table under upgrading embedded font between the present encoding of all characters and standard code with correcting result, the object removing mess code can be reached.
Although the embodiment disclosed by the present invention is as above, the embodiment that described content just adopts for the ease of understanding the present invention, and be not used to limit the present invention.Technician in any the technical field of the invention; under the prerequisite not departing from the spirit and scope disclosed by the present invention; any amendment and change can be done what implement in form and in details; but scope of patent protection of the present invention, the scope that still must define with appending claims is as the criterion.

Claims (6)

1. the detection of PDF document mess code, a method for correction, is characterized in that, described method comprises:
Extract all font features in PDF document;
According to font feature, font is divided into normal font, mess code font and font undetermined;
Extract the dot matrix image of character in font undetermined, and calculate dot matrix image and corresponding similarity of encoding based on the mess code detection algorithm of image statistics feature, judge normal character in font undetermined or mess code character according to similarity;
Mess code character in described font undetermined and the mess code character in mess code font are carried out vertical and horizontal and adapts correction;
By correcting modified result PDF document, remove mess code character.
2. the detection of PDF document mess code as claimed in claim 1, the method for correction, it is characterized in that, described font feature comprises: font type, font coded system, whether there are mapping relations between present encoding and standard code, whether be embedded font.
3. the detection of PDF document mess code as claimed in claim 1, the method for correction, it is characterized in that, the mess code detection algorithm of described image statistics feature comprises: the parameter evaluation method of image characteristics extraction algorithm, statistical model and Image Feature Matching algorithm.
4. the detection of PDF document mess code as claimed in claim 1, the method for correction, is characterized in that, the character pattern image in described font undetermined is encoded similar with corresponding, be then judged to be normal character; Otherwise be judged to be mess code character.
5. the detection of PDF document mess code as claimed in claim 1, the method for correction, is characterized in that,
Described longitudinal direction is adapted and corrected is the identical characters of different fonts pooled together, and concentrated batch is modified;
Described transverse direction is adapted and corrected is pooled together by the kinds of characters of same font, and picture and text contrast is modified.
6. the detection of PDF document mess code as claimed in claim 1, the method for correction, it is characterized in that, described is adopt in units of font by correction modified result PDF document, set up with correcting result or upgrade the mapping table between the present encoding of all characters under PDF document embedded font and standard code, to remove mess code character.
CN201510181385.0A 2015-04-16 2015-04-16 A kind of detection of PDF document mess code, the method for correction Active CN104732228B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510181385.0A CN104732228B (en) 2015-04-16 2015-04-16 A kind of detection of PDF document mess code, the method for correction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510181385.0A CN104732228B (en) 2015-04-16 2015-04-16 A kind of detection of PDF document mess code, the method for correction

Publications (2)

Publication Number Publication Date
CN104732228A true CN104732228A (en) 2015-06-24
CN104732228B CN104732228B (en) 2018-03-30

Family

ID=53456103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510181385.0A Active CN104732228B (en) 2015-04-16 2015-04-16 A kind of detection of PDF document mess code, the method for correction

Country Status (1)

Country Link
CN (1) CN104732228B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488471A (en) * 2015-11-30 2016-04-13 北大方正集团有限公司 Character pattern recognition method and device
CN105718554A (en) * 2016-01-19 2016-06-29 深圳市天朗时代科技有限公司 Document collaboration conversion method and system
CN108985289A (en) * 2018-07-18 2018-12-11 百度在线网络技术(北京)有限公司 Messy code detection method and device
US10261729B1 (en) 2018-02-27 2019-04-16 Ricoh Company, Ltd. Document manipulation mechanism
CN109684962A (en) * 2018-12-14 2019-04-26 苏州梦想人软件科技有限公司 AR e-book quality determining method
CN110728115A (en) * 2018-07-17 2020-01-24 珠海金山办公软件有限公司 Disordered code identification method and device for document content and electronic equipment
CN110728111A (en) * 2018-07-17 2020-01-24 珠海金山办公软件有限公司 Messy code repairing method and device for document content, terminal equipment and server
CN110765826A (en) * 2018-07-27 2020-02-07 珠海金山办公软件有限公司 Method and device for identifying messy codes in Portable Document Format (PDF)
CN111144107A (en) * 2019-12-25 2020-05-12 福建天晴在线互动科技有限公司 Messy code identification method based on slicing algorithm
CN111401362A (en) * 2020-03-06 2020-07-10 上海眼控科技股份有限公司 Tampering detection method, device, equipment and storage medium for vehicle VIN code
CN111695327A (en) * 2019-02-28 2020-09-22 珠海金山办公软件有限公司 Method and device for repairing messy codes, electronic equipment and readable storage medium
CN113158745A (en) * 2021-02-02 2021-07-23 北京惠朗时代科技有限公司 Disorder code document picture identification method and system based on multi-feature operator
CN113627129A (en) * 2020-05-08 2021-11-09 珠海金山办公软件有限公司 Character copying method and device, electronic equipment and readable storage medium
CN114519858A (en) * 2022-02-16 2022-05-20 北京百度网讯科技有限公司 Document image recognition method and device, storage medium and electronic equipment
CN114529930A (en) * 2022-01-13 2022-05-24 上海森亿医疗科技有限公司 PDF repairing method based on non-standard mapping font, storage medium and equipment
CN114629707A (en) * 2022-03-16 2022-06-14 深信服科技股份有限公司 Method and device for detecting messy codes, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02278104A (en) * 1989-04-19 1990-11-14 Fuji Electric Co Ltd Detecting method for angle of inclination of document image
CN101110072A (en) * 2007-08-21 2008-01-23 无敌科技(西安)有限公司 Device and method for automatic identifying literal code
CN104346616A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Character recognition device and character recognition method
CN104424010A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Method and system for detecting and repairing text document messy codes
CN104424165A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Messy code detection method and system for text documents

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02278104A (en) * 1989-04-19 1990-11-14 Fuji Electric Co Ltd Detecting method for angle of inclination of document image
CN101110072A (en) * 2007-08-21 2008-01-23 无敌科技(西安)有限公司 Device and method for automatic identifying literal code
CN104346616A (en) * 2013-08-09 2015-02-11 北大方正集团有限公司 Character recognition device and character recognition method
CN104424010A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Method and system for detecting and repairing text document messy codes
CN104424165A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Messy code detection method and system for text documents

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488471A (en) * 2015-11-30 2016-04-13 北大方正集团有限公司 Character pattern recognition method and device
CN105488471B (en) * 2015-11-30 2019-03-29 北大方正集团有限公司 A kind of font recognition methods and device
CN105718554A (en) * 2016-01-19 2016-06-29 深圳市天朗时代科技有限公司 Document collaboration conversion method and system
US10261729B1 (en) 2018-02-27 2019-04-16 Ricoh Company, Ltd. Document manipulation mechanism
CN110728115B (en) * 2018-07-17 2024-01-26 珠海金山办公软件有限公司 Document content messy code identification method and device and electronic equipment
CN110728115A (en) * 2018-07-17 2020-01-24 珠海金山办公软件有限公司 Disordered code identification method and device for document content and electronic equipment
CN110728111A (en) * 2018-07-17 2020-01-24 珠海金山办公软件有限公司 Messy code repairing method and device for document content, terminal equipment and server
CN108985289A (en) * 2018-07-18 2018-12-11 百度在线网络技术(北京)有限公司 Messy code detection method and device
CN110765826A (en) * 2018-07-27 2020-02-07 珠海金山办公软件有限公司 Method and device for identifying messy codes in Portable Document Format (PDF)
CN109684962B (en) * 2018-12-14 2023-04-18 苏州梦想人软件科技有限公司 AR electronic book quality detection method
CN109684962A (en) * 2018-12-14 2019-04-26 苏州梦想人软件科技有限公司 AR e-book quality determining method
CN111695327B (en) * 2019-02-28 2024-01-26 珠海金山办公软件有限公司 Method and device for repairing messy codes, electronic equipment and readable storage medium
CN111695327A (en) * 2019-02-28 2020-09-22 珠海金山办公软件有限公司 Method and device for repairing messy codes, electronic equipment and readable storage medium
CN111144107B (en) * 2019-12-25 2022-08-09 福建天晴在线互动科技有限公司 Messy code identification method based on slicing algorithm
CN111144107A (en) * 2019-12-25 2020-05-12 福建天晴在线互动科技有限公司 Messy code identification method based on slicing algorithm
CN111401362A (en) * 2020-03-06 2020-07-10 上海眼控科技股份有限公司 Tampering detection method, device, equipment and storage medium for vehicle VIN code
CN113627129A (en) * 2020-05-08 2021-11-09 珠海金山办公软件有限公司 Character copying method and device, electronic equipment and readable storage medium
CN113627129B (en) * 2020-05-08 2024-06-21 珠海金山办公软件有限公司 Text copying method and device, electronic equipment and readable storage medium
CN113158745A (en) * 2021-02-02 2021-07-23 北京惠朗时代科技有限公司 Disorder code document picture identification method and system based on multi-feature operator
CN113158745B (en) * 2021-02-02 2024-04-02 北京惠朗时代科技有限公司 Multi-feature operator-based messy code document picture identification method and system
CN114529930B (en) * 2022-01-13 2024-03-01 上海森亿医疗科技有限公司 PDF restoration method, storage medium and device based on nonstandard mapping fonts
CN114529930A (en) * 2022-01-13 2022-05-24 上海森亿医疗科技有限公司 PDF repairing method based on non-standard mapping font, storage medium and equipment
CN114519858B (en) * 2022-02-16 2023-09-05 北京百度网讯科技有限公司 Document image recognition method and device, storage medium and electronic equipment
CN114519858A (en) * 2022-02-16 2022-05-20 北京百度网讯科技有限公司 Document image recognition method and device, storage medium and electronic equipment
CN114629707A (en) * 2022-03-16 2022-06-14 深信服科技股份有限公司 Method and device for detecting messy codes, electronic equipment and storage medium
CN114629707B (en) * 2022-03-16 2024-05-24 深信服科技股份有限公司 Disorder code detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN104732228B (en) 2018-03-30

Similar Documents

Publication Publication Date Title
CN104732228A (en) Detection and correction method for messy codes of PDF (portable document format) document
CN101782896B (en) PDF character extraction method combined with OCR technology
CN108415887B (en) Method for converting PDF file into OFD file
CN107622230B (en) PDF table data analysis method based on region identification and segmentation
CN111428474A (en) Language model-based error correction method, device, equipment and storage medium
US11615635B2 (en) Heuristic method for analyzing content of an electronic document
JPH0798765A (en) Direction-detecting method and image analyzer
CN112580308A (en) Document comparison method and device, electronic equipment and readable storage medium
JP5664174B2 (en) Apparatus and method for extracting circumscribed rectangle of character from portable electronic file
CN104462068B (en) Character conversion system and character conversion method
KR102345498B1 (en) Line segmentation method
WO2019041527A1 (en) Method of extracting chart in document, electronic device and computer-readable storage medium
CN108038093B (en) PDF character extraction method and device
EP2845147B1 (en) Re-digitization and error correction of electronic documents
CN110543844A (en) metadata extraction method for government affair metadata PDF file
CN111368695A (en) Table structure extraction method
CN113723270A (en) File processing method and device based on RPA and AI
US10643022B2 (en) PDF extraction with text-based key
CN104516859B (en) A kind of word modification method and system
CN112949290B (en) Text error correction method and device and communication equipment
CN117058157A (en) CAD drawing cutting and labeling method
CN110442843B (en) Character replacement method, system, computer device and computer readable storage medium
CN112699634B (en) Typesetting processing method of electronic book, electronic equipment and storage medium
CN103942182B (en) A kind of English text form optimization method and device
CN105335346A (en) PDF (Portable Document Format) document text extracting method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant