CN104463153A - Method and system for increasing recognition rate of characters in format file - Google Patents

Method and system for increasing recognition rate of characters in format file Download PDF

Info

Publication number
CN104463153A
CN104463153A CN201310450972.6A CN201310450972A CN104463153A CN 104463153 A CN104463153 A CN 104463153A CN 201310450972 A CN201310450972 A CN 201310450972A CN 104463153 A CN104463153 A CN 104463153A
Authority
CN
China
Prior art keywords
character
coding
book
format document
universal standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310450972.6A
Other languages
Chinese (zh)
Other versions
CN104463153B (en
Inventor
董宁
耿蕾蕾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Founder Apabi Technology Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201310450972.6A priority Critical patent/CN104463153B/en
Publication of CN104463153A publication Critical patent/CN104463153A/en
Application granted granted Critical
Publication of CN104463153B publication Critical patent/CN104463153B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a method and system for increasing the recognition rate of characters in a format file. Character original codes corresponding to the same preset characters in the format file are compared with character standard codes to obtain code comparison results, probability statistics is performed on the multiple code comparison results, so that a probability value is obtained and compared with a threshold value, and if the probability value exceeds the threshold value, the format file displays the characters which are obtained through comparison between the character original codes and a general standard character code bank; otherwise, the format file displays the characters after OCR. Through a probability statistics method, the characters which are obtained through comparison between the character original codes and the general standard character code bank are selectively displayed, or the format file displays the characters after OCR, and therefore the accuracy of character recognition is effectively improved.

Description

A kind of method and system improving character identification rate in format document
Technical field
The present invention relates to a kind of method improving Text region rate, specifically a kind of method and system improving character identification rate in format document.
Background technology
In order to ensure the reading effect of reader, the type-setting document that the publication side of books and periodicals is issued before printing is generally format document.So-called format document is exactly the file of the information such as position, glyph bitmap, font, size, color that clearly can record each character, and described format document can also record the coding of each character.Because format document describes glyph bitmap and intercharacter relative position, therefore there is certain stability, can ensure that the format document that reader reads under any computer environment all has consistent visual effect with the books and periodicals printed, conventional format document is mainly PDF etc.
Although describe the coding of character in some format documents, when display, generally showing according to glyph bitmap, is not show according to coding.When extract the character of word from format document time, because the coding of the character recorded in format document generally can obtain by the mode of universal standard coding or custom coding, therefore specific to a format document, the coded system of its character uncertain, and then the character that just can not obtain word according to this coding.
Therefore in prior art, usual employing OCR(Optical Character Recognition, optical character identification) technology extracts character in format document, but because OCR technology itself exists the problem of discrimination, often there is the high problem of error rate in the character of the word therefore adopting OCR technology to identify, affects user and read.
Summary of the invention
For this reason, technical matters to be solved by this invention is to overcome in prior art when adopting OCR technology identification character exists the higher problem of error rate, provides a kind of method and system improving character identification rate in format document.
For solving the problems of the technologies described above, the present invention is a kind of method improving character identification rate in format document,
Comprise the steps:
Character original coding corresponding to book character same in described format document and the character universal standard are encoded compare and obtain coding comparison result that is identical or that encode different of encoding;
Described coding comparison result corresponding to multiple described book character is carried out probability statistics and obtain the probable value that described book character adopts character universal standard coding;
Described probable value and threshold value are compared, if exceed threshold value, then described book character shows according to the character original coding contrast character that obtains of universal standard character code storehouse described in it; Otherwise, directly show the character that described in this, book character is identified by OCR.
Improve a method for character identification rate in format document, before the step obtaining described coding comparison result, also comprise the steps:
Extract the glyph bitmap of each book character in described format document;
Extract the character original coding of each described book character in described format document;
Obtain identifying rear character after carrying out OCR identification to described glyph bitmap;
Character universal standard coding is obtained to character contrast universal standard character code storehouse after described identification.
Improve a method for character identification rate in format document, before the step extracting described character original coding, also comprise the steps:
The character in described format document with character original coding is screened as book character.
Improve a method for character identification rate in format document, using the character in described format document with character original coding as after the step that book character screens, also comprise the steps:
For each described book character carries out ID numbering.
Improve a method for character identification rate in format document, after the step of character original coding extracting each described book character in described format document, also comprise the steps:
Set up a character original coding table, the ID of described book character described character original coding is corresponding thereto stored in described character original coding table.
Improve a method for character identification rate in format document, after the step obtaining described character universal standard coding, also comprise the steps:
Set up a character standard coding schedule, by the ID of described book character described character standard code storage corresponding thereto in described character standard coding schedule.
Improve a method for character identification rate in format document, described probable value is compared to threshold value and before carrying out corresponding operation, also comprises the steps:
Set up one for show, revise and confirm described character can editing interface.
Improve a system for character identification rate in format document, comprise coding comparison device, probability statistics compiling device and probable value, threshold value comparison device, wherein,
Described coding comparison device, obtains for the character original coding corresponding to book character same in described format document and the character universal standard being encoded to compare coding comparison result that is identical or that encode different of encoding;
Described probability statistics compiling device, obtains for the described coding comparison result corresponding to multiple described book character being carried out probability statistics the probable value that described book character adopts character universal standard coding;
Described probable value, threshold value comparison device, for described probable value and threshold value being compared, if exceed threshold value, then described book character shows according to the character original coding contrast character that obtains of universal standard character code storehouse described in it; Otherwise, directly show the character that described in this, book character is identified by OCR.
Improve a system for character identification rate in format document, also comprise glyph bitmap extraction element, character original coding extraction element, OCR recognition device and character universal standard coding corresponding intrument, wherein,
Described glyph bitmap extraction element, for extracting the glyph bitmap of each book character in described format document;
Described character original coding extraction element, for extracting the character original coding of each described book character in described format document;
Described OCR recognition device, obtains identifying rear character after carrying out OCR identification to the described glyph bitmap extracted;
Described character universal standard coding corresponding intrument, for obtaining character universal standard coding to character contrast universal standard character code storehouse after described identification.
Improve a system for character identification rate in format document, also comprise book character screening plant, described book character screening plant is used for the character in described format document with character original coding to screen as book character.
Improve a system for character identification rate in format document, also comprise ID numbering device, described ID numbering device is used for carrying out ID numbering for each described book character.
A kind of system improving character identification rate in format document, also comprise character standard coding schedule apparatus for establishing, described character standard coding schedule apparatus for establishing, for setting up a character standard coding schedule, by the ID of described book character described character standard code storage corresponding thereto in described character standard coding schedule.
A kind of system improving character identification rate in format document, also comprise character standard coding schedule apparatus for establishing, described character standard coding schedule apparatus for establishing, for setting up a character standard coding schedule, by the ID of described book character described character standard code storage corresponding thereto in described character standard coding schedule.
Improve a system for character identification rate in format document, also comprising can editing interface apparatus for establishing, described can editing interface apparatus for establishing, for set up one for show, revise and confirm described character can editing interface.
Technique scheme of the present invention has the following advantages compared to existing technology:
1, at a kind of method and system improving character identification rate in format document of the present invention, character original coding corresponding to book character same in described format document and the character universal standard are encoded compare and obtain coding comparison result that is identical or that encode different of encoding, multiple described coding comparison result is carried out probability statistics and obtains probable value, described probable value and threshold value are compared, if exceed threshold value, then show the character that described character original coding contrast universal standard character code storehouse obtains; Otherwise, the character after display OCR identifies.The present invention by the method for probability statistics, select to show character that described character original coding contrast universal standard character code storehouse obtains or described format document display OCR identify after character, therefore effectively improve the accuracy of character recognition.
2, at a kind of method and system improving character identification rate in format document of the present invention, before the step obtaining described coding comparison result, the glyph bitmap of each book character extracted in described format document is also comprised the steps:.Extract the character original coding of each described book character in described format document.Obtain identifying rear character after carrying out OCR identification to described glyph bitmap; Character universal standard coding is obtained to character contrast universal standard character code storehouse after described identification.The present invention can know method for distinguishing by OCR and obtain identifying rear character, is convenient to obtain described character universal standard coding further.Described OCR recognition device is commercially available general module, has the advantage that price is low.
3, at a kind of method and system improving character identification rate in format document of the present invention, before the step extracting described character original coding, also comprise the step screened as book character by the character in described format document with character original coding, the operation of screening book character can reduce the number of times extracting and need to extract the character step of described glyph bitmap, effectively reduce working time of the present invention, improve operational efficiency.The present invention also comprises for each described book character carries out the step of ID numbering, adopts the mode of ID numbering conveniently can make character one_to_one corresponding after described book character and described character original coding or described identification accurately.The present invention also comprises the step set up a character original coding table He set up a character standard coding schedule, described character original coding table can effectively manage character original coding, described character standard coding schedule effectively can manage character standard coding, can reduce the time of operation of the present invention.
4, at a kind of method and system improving character identification rate in format document of the present invention, also comprise that set up can the step of editing interface, described can editing interface can show, revise and confirm shown by character, can error character shown by manual intervention, be convenient to correct a mistake.
Accompanying drawing explanation
In order to make content of the present invention be more likely to be clearly understood, below according to a particular embodiment of the invention and by reference to the accompanying drawings, the present invention is further detailed explanation, wherein
Fig. 1 is a kind of process flow diagram improving the method for character identification rate in format document of one embodiment of the invention;
Fig. 2 is a kind of structured flowchart improving the system of character identification rate in format document of one embodiment of the invention.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in detail.Should be understood that, embodiment described herein, only for instruction and explanation of the present invention, is not limited to the present invention.
Embodiment 1
As one embodiment of the present of invention, as shown in Figure 1, a kind of method improving character identification rate in format document, comprises the steps:
Character original coding corresponding to book character same in described format document and the character universal standard are encoded compare and obtain coding comparison result that is identical or that encode different of encoding.
Described coding comparison result corresponding to multiple described book character is carried out probability statistics and obtain the probable value that described book character adopts character universal standard coding.
Described probable value and threshold value are compared, if exceed threshold value, then described book character shows according to the character original coding contrast character that obtains of universal standard character code storehouse described in it.Otherwise, directly show the character that described in this, book character is identified by OCR.
The present invention is by the method for probability statistics, select to show described character original coding according to contrasting the character that universal standard character code storehouse obtains or the character shown after OCR identification, the present invention is when described book character adopts character universal standard coded system, the character after alternative OCR identification is carried out according to the character that contrast universal standard character code storehouse obtains with described character original coding, described character original coding is higher than the accuracy of OCR according to the accuracy of the character that contrast character universal standard character code storehouse obtains, therefore the present invention can improve the accuracy of Text region on the whole.
Embodiment 2
As one embodiment of the present of invention, on the basis of embodiment 1, before the step obtaining described coding comparison result, also comprise the steps:
Extract the glyph bitmap of each book character in described format document.
Obtain identifying rear character after carrying out OCR identification to the described glyph bitmap extracted.
Character universal standard coding is obtained to character contrast universal standard character code storehouse after described identification.Wherein, the described character universal standard is encoded to GB GB2312.
Extract the character original coding of each described book character in described format document.
Above-mentioned acquisition character universal standard coding and character original coding step, can perform respectively simultaneously, also can have certain sequencing, such as first obtains character universal standard coding, then obtain character original coding; Or first obtain character original coding, then obtain character universal standard coding.As long as get before comparison the described character universal standard coding and character original coding can realize object of the present invention.
The present invention can know method for distinguishing by OCR and obtain identifying rear character, is convenient to obtain described character universal standard coding further.
Embodiment 3
As one embodiment of the present of invention, on the basis of embodiment 2, before the step extracting described character original coding, also comprise the steps:
The character in described format document with character original coding is screened as book character.The operation of screening book character can reduce the number of times extracting and need to extract the character step of described glyph bitmap, effectively reduces working time of the present invention, improves operational efficiency.
Embodiment 4
As one embodiment of the present of invention, on the basis of embodiment 3, using the character in described format document with character original coding as after the step that book character screens, also comprise the steps:
For each described book character carries out ID numbering.Adopt the mode of ID numbering conveniently can make character one_to_one corresponding after described book character and described character original coding or described identification accurately.
Embodiment 5
As one embodiment of the present of invention, on the basis of embodiment 4, after the step of character original coding extracting each described book character in described format document, also comprise the steps:
Set up a character original coding table, the ID of described book character described character original coding is corresponding thereto stored in described character original coding table.Described character original coding table can effectively manage character original coding, can reduce the time of operation of the present invention.
Embodiment 6
As one embodiment of the present of invention, on the basis of embodiment 4 or embodiment 5, after the step obtaining described character universal standard coding, also comprise the steps:
Set up a character standard coding schedule, by the ID of described book character described character standard code storage corresponding thereto in described character standard coding schedule.Described character standard coding schedule effectively can manage character standard coding, can reduce the time of operation of the present invention.
Embodiment 7
As one embodiment of the present of invention, on the basis of above-described embodiment, described probable value is compared to threshold value and before carrying out corresponding operation, also comprises the steps:
Set up one for show, revise and confirm described character can editing interface.
Described can editing interface can show, revise and confirm shown by character, can error character shown by manual intervention, conveniently correct a mistake.
As one embodiment of the present of invention, on the basis of above-described embodiment, described threshold value is 90%.
Embodiment 8
As one embodiment of the present of invention, shown in Figure 2, a kind of system improving character identification rate in format document, comprises coding comparison device, probability statistics compiling device and probable value, threshold value comparison device.Wherein,
Described coding comparison device, obtains for the character original coding corresponding to book character same in described format document and the character universal standard being encoded to compare coding comparison result that is identical or that encode different of encoding.
Described probability statistics compiling device, obtains for the described coding comparison result corresponding to multiple described book character being carried out probability statistics the probable value that described book character adopts character universal standard coding.
Described probable value, threshold value comparison device, for described probable value and threshold value being compared, if exceed threshold value, then described book character shows according to the character original coding contrast character that obtains of universal standard character code storehouse described in it.Otherwise, directly show the character that described in this, book character is identified by OCR.
The present invention by the method for probability statistics, select to show character that described character original coding contrast universal standard character code storehouse obtains or described format document display OCR identify after character, therefore effectively improve the accuracy of Text region.
Embodiment 9
As one embodiment of the present of invention, on the basis of embodiment 8, also comprise glyph bitmap extraction element, character original coding extraction element, OCR recognition device and character universal standard coding corresponding intrument.Wherein,
Described glyph bitmap extraction element, for extracting the glyph bitmap of each book character in described format document.
Described character original coding extraction element, for extracting the character original coding of each described book character in described format document.
Described OCR recognition device, obtains identifying rear character after carrying out OCR identification to the described glyph bitmap extracted.
Described character universal standard coding corresponding intrument, for obtaining character universal standard coding to character contrast universal standard character code storehouse after described identification.
The present invention can know method for distinguishing by OCR and obtain identifying rear character, is convenient to obtain described character universal standard coding further.Described OCR recognition device is commercially available general module, has the advantage that price is low.
Embodiment 10
As one embodiment of the present of invention, on the basis of embodiment 9, also comprise book character screening plant, described book character screening plant is used for the character in described format document with character original coding to screen as book character.Described book character screening plant can reduce the number of times extracting and need to extract the character step of described glyph bitmap, effectively reduces working time of the present invention, improves operational efficiency.
Embodiment 11
As one embodiment of the present of invention, on the basis of embodiment 10, also comprise ID numbering device, described ID numbering device is used for carrying out ID numbering for each described book character.Described ID numbering device conveniently can make character one_to_one corresponding after described book character and described character original coding or described identification accurately.
Embodiment 12
As one embodiment of the present of invention, on the basis of embodiment 11, also comprise character original coding table apparatus for establishing, described character original coding table apparatus for establishing, for setting up a character original coding table, the ID of described book character described character original coding is corresponding thereto stored in described character original coding table.Described character original coding table apparatus for establishing can effectively manage character original coding, can reduce the time of operation of the present invention.
Embodiment 13
As one embodiment of the present of invention, on the basis of embodiment 11 or embodiment 12, also comprise character standard coding schedule apparatus for establishing, described character standard coding schedule apparatus for establishing, for setting up a character standard coding schedule, by the ID of described book character described character standard code storage corresponding thereto in described character standard coding schedule.Described character standard coding schedule apparatus for establishing, effectively can manage character standard coding, can reduce the time of operation of the present invention.
Embodiment 14
As one embodiment of the present of invention, on the basis of any one embodiment of embodiment 8-13, also comprising can editing interface apparatus for establishing, described can editing interface apparatus for establishing, for set up one for show, revise and confirm described character can editing interface.Described can editing interface can show, revise, confirm shown by character, can error character shown by manual intervention, there is the function of correcting a mistake.
As one embodiment of the present of invention, on the basis of above-described embodiment, described threshold value is 90%.
Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.
The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
Although describe the preferred embodiments of the present invention, those skilled in the art once obtain the basic creative concept of cicada, then can make other change and amendment to these embodiments.So claims are intended to be interpreted as comprising preferred embodiment and falling into all changes and the amendment of the scope of the invention.

Claims (10)

1. improve a method for character identification rate in format document, it is characterized in that, comprise the steps:
Character original coding corresponding to book character same in described format document and the character universal standard are encoded compare and obtain coding comparison result that is identical or that encode different of encoding;
Described coding comparison result corresponding to multiple described book character is carried out probability statistics and obtain the probable value that described book character adopts character universal standard coding;
Described probable value and threshold value are compared, if exceed threshold value, then described book character shows according to the character original coding contrast character that obtains of universal standard character code storehouse described in it; Otherwise, directly show the character that described in this, book character is identified by OCR.
2. a kind of method improving character identification rate in format document according to claim 1, is characterized in that, before the step obtaining described coding comparison result, also comprises the steps:
Extract the glyph bitmap of each book character in described format document;
Extract the character original coding of each described book character in described format document;
Obtain identifying rear character after carrying out OCR identification to described glyph bitmap;
Character universal standard coding is obtained to character contrast universal standard character code storehouse after described identification.
3. a kind of method improving character identification rate in format document according to claim 2, is characterized in that, before the step extracting described character original coding, also comprises the steps:
The character in described format document with character original coding is screened as book character.
4. a kind of method improving character identification rate in format document according to claim 3, is characterized in that, using the character in described format document with character original coding as after the step that book character screens, also comprises the steps:
For each described book character carries out ID numbering.
5. a kind of method improving character identification rate in format document according to claim 4, is characterized in that, after the step of character original coding extracting each described book character in described format document, also comprises the steps:
Set up a character original coding table, the ID of described book character described character original coding is corresponding thereto stored in described character original coding table.
6. a kind of method improving character identification rate in format document according to claim 4 or 5, is characterized in that, after the step obtaining described character universal standard coding, also comprises the steps:
Set up a character standard coding schedule, by the ID of described book character described character standard code storage corresponding thereto in described character standard coding schedule.
7. according to the arbitrary a kind of described method improving character identification rate in format document of claim 1-6, it is characterized in that, described probable value is compared to threshold value and before carrying out corresponding operation, also comprises the steps:
Set up one for show, revise and confirm described character can editing interface.
8. improve a system for character identification rate in format document, it is characterized in that, comprise coding comparison device, probability statistics compiling device and probable value, threshold value comparison device, wherein,
Described coding comparison device, obtains for the character original coding corresponding to book character same in described format document and the character universal standard being encoded to compare coding comparison result that is identical or that encode different of encoding;
Described probability statistics compiling device, obtains for the described coding comparison result corresponding to multiple described book character being carried out probability statistics the probable value that described book character adopts character universal standard coding;
Described probable value, threshold value comparison device, for described probable value and threshold value being compared, if exceed threshold value, then described book character shows according to the character original coding contrast character that obtains of universal standard character code storehouse described in it; Otherwise, directly show the character that described in this, book character is identified by OCR.
9. a kind of system improving character identification rate in format document according to claim 8, is characterized in that, also comprises glyph bitmap extraction element, character original coding extraction element, OCR recognition device and character universal standard coding corresponding intrument, wherein,
Described glyph bitmap extraction element, for extracting the glyph bitmap of each book character in described format document;
Described character original coding extraction element, for extracting the character original coding of each described book character in described format document;
Described OCR recognition device, obtains identifying rear character after carrying out OCR identification to the described glyph bitmap extracted;
Described character universal standard coding corresponding intrument, for obtaining character universal standard coding to character contrast universal standard character code storehouse after described identification.
10. a kind of system improving character identification rate in format document according to claim 9, it is characterized in that, also comprise book character screening plant, described book character screening plant is used for the character in described format document with character original coding to screen as book character.
CN201310450972.6A 2013-09-25 2013-09-25 The method and system of character identification rate in a kind of raising format document Expired - Fee Related CN104463153B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310450972.6A CN104463153B (en) 2013-09-25 2013-09-25 The method and system of character identification rate in a kind of raising format document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310450972.6A CN104463153B (en) 2013-09-25 2013-09-25 The method and system of character identification rate in a kind of raising format document

Publications (2)

Publication Number Publication Date
CN104463153A true CN104463153A (en) 2015-03-25
CN104463153B CN104463153B (en) 2018-09-04

Family

ID=52909169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310450972.6A Expired - Fee Related CN104463153B (en) 2013-09-25 2013-09-25 The method and system of character identification rate in a kind of raising format document

Country Status (1)

Country Link
CN (1) CN104463153B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038093A (en) * 2017-11-10 2018-05-15 万兴科技股份有限公司 PDF text extraction methods and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5955579A (en) * 1982-09-24 1984-03-30 Fujitsu Ltd Character recognizer
JPH06187505A (en) * 1992-12-21 1994-07-08 Hitachi Ltd Data entry system/method
CN101782896A (en) * 2009-01-21 2010-07-21 汉王科技股份有限公司 PDF character extraction method combined with OCR technology
CN102194503A (en) * 2010-03-12 2011-09-21 腾讯科技(深圳)有限公司 Player and character code detection method and device for subtitle file

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5955579B2 (en) * 2011-07-21 2016-07-20 日東電工株式会社 Protection sheet for glass etching

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5955579A (en) * 1982-09-24 1984-03-30 Fujitsu Ltd Character recognizer
JPH06187505A (en) * 1992-12-21 1994-07-08 Hitachi Ltd Data entry system/method
CN101782896A (en) * 2009-01-21 2010-07-21 汉王科技股份有限公司 PDF character extraction method combined with OCR technology
CN102194503A (en) * 2010-03-12 2011-09-21 腾讯科技(深圳)有限公司 Player and character code detection method and device for subtitle file

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038093A (en) * 2017-11-10 2018-05-15 万兴科技股份有限公司 PDF text extraction methods and device
CN108038093B (en) * 2017-11-10 2021-06-15 深圳市亿图软件有限公司 PDF character extraction method and device

Also Published As

Publication number Publication date
CN104463153B (en) 2018-09-04

Similar Documents

Publication Publication Date Title
CN109933756B (en) Image file transferring method, device and equipment based on OCR (optical character recognition), and readable storage medium
CN110751143A (en) Electronic invoice information extraction method and electronic equipment
CN102289667B (en) The user of the mistake occurred in the text document to experience optical character identification (OCR) process corrects
JP2005173730A (en) Business form ocr program, method, and device
JP6795195B2 (en) Character type estimation system, character type estimation method, and character type estimation program
CN104424165A (en) Messy code detection method and system for text documents
CN108319578B (en) Method for generating medium for data recording
CN101008940A (en) Method and device for automatic processing font missing
CN104809099A (en) Document file generating device and document file generation method
CN109582934B (en) Format document conversion method and device
CN115171143A (en) Method and system for extracting full-face information of electronic invoice
JP5950700B2 (en) Image processing apparatus, image processing method, and program
CN102467664B (en) Method and device for assisting with optical character recognition
CN104463153A (en) Method and system for increasing recognition rate of characters in format file
KR102024127B1 (en) Character recognition system, character recognition program and character recognition method
JP5720182B2 (en) Image processing apparatus and image processing program
CN103729634A (en) Character recognition apparatus and character recognition method
CN105653549A (en) Method and device for extracting document information
US9531906B2 (en) Method for automatic conversion of paper records to digital form
CN105335346B (en) A kind of Text Extraction and device of PDF document
CN105512100B (en) A kind of printed page analysis method and device
JP4347675B2 (en) Form OCR program, method and apparatus
CN104516579B (en) Information processing unit and information processing method
CN104412277B (en) Device and method for comparing two documents containing graphic elements and text elements
CN106959940A (en) A kind of document format for being easy to document automation typing and conversion and recognition methods

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220620

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: FOUNDER APABI TECHNOLOGY Ltd.

Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: FOUNDER APABI TECHNOLOGY Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180904