CN109447055A - One kind being based on OCR character recognition method familiar in shape - Google Patents

One kind being based on OCR character recognition method familiar in shape Download PDF

Info

Publication number
CN109447055A
CN109447055A CN201811211186.XA CN201811211186A CN109447055A CN 109447055 A CN109447055 A CN 109447055A CN 201811211186 A CN201811211186 A CN 201811211186A CN 109447055 A CN109447055 A CN 109447055A
Authority
CN
China
Prior art keywords
character
font
identification
character recognition
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811211186.XA
Other languages
Chinese (zh)
Other versions
CN109447055B (en
Inventor
席敬
焦勇
伏虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gansu Wanwei Co
Original Assignee
Gansu Wanwei Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gansu Wanwei Co filed Critical Gansu Wanwei Co
Priority to CN201811211186.XA priority Critical patent/CN109447055B/en
Publication of CN109447055A publication Critical patent/CN109447055A/en
Application granted granted Critical
Publication of CN109447055B publication Critical patent/CN109447055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Discrimination (AREA)

Abstract

The present invention relates to field of computer technology to relate more specifically to a kind of based on OCR character recognition method familiar in shape more particularly to pattern-recognition and deep learning field.Change traditional font identification method, character text and font can be identified, threshold value screening is compared and be added by multisample, text identification accuracy is not only substantially improved, but also effectively identify character font.It is particularly suitable for the character recognition of similar font and similar font, realizes that font and the dual of font accurately identify.It by each Character segmentation is 96*96 pixel at size by horizontal segmentation and vertical segmentation, convenient for the extraction of pixel characteristic information, it avoids interfering with each other between adjacent text, effectively promote recognition efficiency, Character segmentation each in the plurality of picture such as books, newspaper, clothes and screenshotss is carried out the extraction of character pixels characteristic information by designer of the present invention at 96*96 pixel, and recovery rate is close to 100%.

Description

One kind being based on OCR character recognition method familiar in shape
Technical field
The present invention relates to field of computer technology more specifically to relate to more particularly to pattern-recognition and deep learning field And it is a kind of based on OCR character recognition method familiar in shape.
Background technique
Optical character identification (Optical Character Recognition, abbreviation OCR) is to combine optical technology and meter Calculation machine technology converts the image file being imprinted on paper to a kind of mode of text file, and OCR identification can be used for bank money, big Measure the automatically scannings of bills such as documents and materials, archives folder, tax list and long-term storage.
OCR identification is usually using discrimination, recognition speed, printed page understanding and layout representation degree as the technical standard measured. The technology has relatively good discrimination to general character, but there is also certain for structure and font Chinese character field abundant Technical problem, especially for familiar in shape, such as: (noon, dry, dry), (running, bubble, big gun) character there are recognition efficiency lowly and The not high problem of precision.Furthermore the prior art can not judge the identical font different fonts of character, identical font different fonts It is very easy to that mistake occurs when being identified, repeatedly recognition result is different repeatedly, it is sometimes desirable to which manpower intervention error correction is greatly reduced Identify accuracy.
Summary of the invention
The present invention provides that a kind of discrimination is high, identification is quick and with high accuracy based on OCR character recognition method familiar in shape.
The technical solution adopted by the present invention to solve the technical problems are as follows:
One kind being based on OCR character recognition method familiar in shape, includes the following steps:
A, original OCR image preprocessing
Text correction is carried out to tilted character, to the noise remove in picture, ash is converted into picture contrast and Gamma correction Spend image;
B, pictograph detects
The extraction of character pixels characteristic information is carried out to pretreated gray level image, and character picture is carried out using CNN neural network The extraction of plain characteristic information is translated into the feature vector of one-hot coding form, as character recognition module character pixels feature The foundation of information identification;
C, identification calculates
Use the different fonts of standard character library as training sample n, every kind of different fonts of standard character library are denoted as n1、n2,, meter Calculate the Euclidean distance D of every kind of font of training samplen1、Dn2、、、,Character recognition module uses ***-Inception-v4 structure Frame carries out identification as identification sample p to pictograph to be identified, calculates the Euclidean distance D of identification sample pP, using as follows Formula calculates identification sample and different fonts training sample comparison threshold value a,,,,;
D, character text Character Font Recognition
Comparative selection threshold value a1、a2,, a training sample of middle 0.4-0.6, export corresponding identification character text and Font.
The extraction for carrying out character pixels characteristic information in the step B to pretreated gray level image, passes through horizontal segmentation With vertical segmentation by each Character segmentation at size be 96*96 pixel.
16 kinds of fonts of training sample n 3755 characters of first class word-base of the national standard in the step C.
Comparative selection threshold value a in the step D1、a2,, in closest to 0.5 a training sample, export it is corresponding Identify the text and font of character.
Character recognition module uses ***-Inception-v4 framework in the step C, by the two-dimensional convolution core of 5*5 Split into the one-dimensional convolution kernel of 1*5 and 5*1.
The invention has the benefit that
1, change traditional font identification method, character text and font can be identified, compared and be added by multisample Threshold value screening is not only substantially improved text identification accuracy, but also effectively identifies character font.It is particularly suitable for similar font and phase Like the character recognition of font, realize that font and the dual of font accurately identify.
2, it is 96*96 pixel at size by each Character segmentation by horizontal segmentation and vertical segmentation, is convenient for pixel characteristic The extraction of information avoids interfering with each other between adjacent text, effectively promotion recognition efficiency, designer of the present invention by books, newspaper, Each Character segmentation carries out the extraction of character pixels characteristic information at 96*96 pixel in the plurality of picture such as clothes and screenshotss, extracts Rate is close to 100%.
3, the present invention is in comparison threshold value a1、a2,, it is middle selection closest to 0.5 a training sample, export corresponding knowledge The text and font of malapropism symbol promote identification accuracy, avoid manpower intervention error correction.
4, character recognition module uses ***-Inception-v4 framework, and the two-dimensional convolution core of 5*5 is split into 1*5 With the one-dimensional convolution kernel of 5*1, not only prevent over-fitting from also increasing nonlinear extensions ability and reserved character characteristic polymorphic.
Detailed description of the invention
Fig. 1 is identification schematic diagram of the invention.
Specific embodiment
One kind being based on OCR character recognition method familiar in shape, includes the following steps:
A, original OCR image preprocessing
Text correction is carried out to tilted character, to the noise remove in picture, ash is converted into picture contrast and Gamma correction Spend image;
B, pictograph detects
The extraction of character pixels characteristic information is carried out to pretreated gray level image, and character picture is carried out using CNN neural network The extraction of plain characteristic information is translated into the feature vector of one-hot coding form, as character recognition module character pixels feature The foundation of information identification;
C, identification calculates
Use the different fonts of standard character library as training sample n, every kind of different fonts of standard character library are denoted as n1、n2,, meter Calculate the Euclidean distance D of every kind of font of training samplen1、Dn2、、、,Character recognition module uses ***-Inception-v4 structure Frame carries out identification as identification sample p to pictograph to be identified, calculates the Euclidean distance D of identification sample pP, using as follows Formula calculates identification sample and different fonts training sample comparison threshold value a,,,,;
D, character text Character Font Recognition
Comparative selection threshold value a1、a2,, a training sample of middle 0.4-0.6, export corresponding identification character text and Font.
The extraction for carrying out character pixels characteristic information in the step B to pretreated gray level image, passes through horizontal segmentation With vertical segmentation by each Character segmentation at size be 96*96 pixel.
16 kinds of fonts of training sample n 3755 characters of first class word-base of the national standard in the step C.
Comparative selection threshold value a in the step D1、a2,, in closest to 0.5 a training sample, export it is corresponding Identify the text and font of character.
Character recognition module uses ***-Inception-v4 framework in the step C, by the two-dimensional convolution core of 5*5 Split into the one-dimensional convolution kernel of 1*5 and 5*1.
Comparative test 1
The Song typeface to do three words is tested as case:
Font distracter is set in, noon;
Set font distracter black matrix and imitation Song-Dynasty-style typeface;
Test method is as follows: the screening Song typeface is dry, black matrix is dry, imitation Song-Dynasty-style typeface is dry;The Song typeface in, black matrix in, imitation Song-Dynasty-style typeface in;The Song typeface noon, the black matrix noon, The 9 width picture such as imitation Song-Dynasty-style typeface noon;It is to do 3, in 3 and the noon 3 through manual identified;
It is repeatedly right to be carried out using the orc software v8.1 that the Han Wang OCR Free Chinese version of ZOL software download net, starting point software centre are netted Than test, specific comparing result is as follows:
The present invention Han Wang OCR Orc software v8.1
For the first time Do 3, in 3, the noon 3 Do 4, in 2, the noon 3 Do 3, in 3, the noon 3
Second Do 3, in 3, the noon 3 Do 5, in 3, the noon 1 Do 1, in 4 and the noon 4
For the third time Do 3, in 3, the noon 3 Do 3, in 3, the noon 3 Do 2, in 2 and the noon 5
Interpretation of result from single picture font identification from, the present invention, ZOL software download net Han Wang OCR Free Chinese version, The orc software v8.1 of starting point software centre net can identify font text, but there are unstability, distracters for the prior art Certain influence can be generated on existing identification software, recognition result is unstable, needs manpower intervention error correction.The present invention is to 9 width pictures Comparison threshold value be selected from one in 0.4-0.6 closest to 0.5, it is right if picture is serious unintelligible or can not effectively identify It can be fallen between 0.1-0.3 or 0.7-0.9 than threshold value, realize automatic error-correcting prompt.
Comparative test 2
The technology for distributed optical character identification and distributed machines language translation of the present invention and Google's application (CN201580029025.7) technology is compared, and the present invention judges font by the degree of approach of comparison threshold value 0.5.Documents The technology of CN201580029025.7 can not judge font.
This case in summary is particularly suitable for the character recognition of similar font and similar font, realizes the dual of font and font It accurately identifies.The not dual technology accurately identified discloses in the prior art, furthermore method of the invention convenience and existing software It is implanted into, guarantees that identification software development difficulty is greatly reduced on the basis of recognition efficiency.

Claims (5)

1. one kind is based on OCR character recognition method familiar in shape, it is characterised in that include the following steps:
A, original OCR image preprocessing
Text correction is carried out to tilted character, to the noise remove in picture, ash is converted into picture contrast and Gamma correction Spend image;
B, pictograph detects
The extraction of character pixels characteristic information is carried out to pretreated gray level image, and character picture is carried out using CNN neural network The extraction of plain characteristic information is translated into the feature vector of one-hot coding form, as character recognition module character pixels feature The foundation of information identification;
C, identification calculates
Use the different fonts of standard character library as training sample n, every kind of different fonts of standard character library are denoted as n1、n2,, meter Calculate the Euclidean distance D of every kind of font of training samplen1、Dn2、、、,Character recognition module uses ***-Inception-v4 structure Frame carries out identification as identification sample p to pictograph to be identified, calculates the Euclidean distance D of identification sample pP, using as follows Formula calculates identification sample and different fonts training sample comparison threshold value a,,,,;
D, character text Character Font Recognition
Comparative selection threshold value a1、a2,, a training sample of middle 0.4-0.6, export the text and word of corresponding identification character Body.
2. a kind of according to claim 1 be based on OCR character recognition method familiar in shape, it is characterised in that in the step B The extraction that character pixels characteristic information is carried out to pretreated gray level image, by horizontal segmentation and vertical segmentation by each character Being cut into size is 96*96 pixel.
3. a kind of according to claim 1 be based on OCR character recognition method familiar in shape, it is characterised in that in the step C 16 kinds of fonts of training sample n 3755 characters of first class word-base of the national standard.
4. a kind of according to claim 1 be based on OCR character recognition method familiar in shape, it is characterised in that in the step D Comparative selection threshold value a1、a2,, in closest to 0.5 a training sample, export it is corresponding identification character text and word Body.
5. a kind of according to claim 1 be based on OCR character recognition method familiar in shape, it is characterised in that in the step C Character recognition module uses ***-Inception-v4 framework, and the two-dimensional convolution core of 5*5 is split into the one of 1*5 and 5*1 Tie up convolution kernel.
CN201811211186.XA 2018-10-17 2018-10-17 OCR (optical character recognition) -based character similarity recognition method Active CN109447055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811211186.XA CN109447055B (en) 2018-10-17 2018-10-17 OCR (optical character recognition) -based character similarity recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811211186.XA CN109447055B (en) 2018-10-17 2018-10-17 OCR (optical character recognition) -based character similarity recognition method

Publications (2)

Publication Number Publication Date
CN109447055A true CN109447055A (en) 2019-03-08
CN109447055B CN109447055B (en) 2022-05-03

Family

ID=65547338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811211186.XA Active CN109447055B (en) 2018-10-17 2018-10-17 OCR (optical character recognition) -based character similarity recognition method

Country Status (1)

Country Link
CN (1) CN109447055B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110443269A (en) * 2019-06-17 2019-11-12 平安信托有限责任公司 A kind of document comparison method and device
CN110781898A (en) * 2019-10-21 2020-02-11 南京大学 Unsupervised learning method for Chinese character OCR post-processing
CN111626281A (en) * 2020-04-27 2020-09-04 国家电网有限公司 Chinese annotation information identification method and system for paper image map based on adaptive learning
CN111860317A (en) * 2020-07-20 2020-10-30 青岛特利尔环保集团股份有限公司 Boiler operation data acquisition method, system, equipment and computer medium
CN116597453A (en) * 2023-05-16 2023-08-15 暗物智能科技(广州)有限公司 Shape near word single word recognition method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0498978A1 (en) * 1991-02-13 1992-08-19 International Business Machines Corporation Mechanical recognition of characters in cursive script
CN1979529A (en) * 2005-12-09 2007-06-13 佳能株式会社 Optical character recognization
CN101331520A (en) * 2005-12-19 2008-12-24 微软公司 Stroke contrast in font hinting
CN101782896A (en) * 2009-01-21 2010-07-21 汉王科技股份有限公司 PDF character extraction method combined with OCR technology
CN102707222A (en) * 2012-05-15 2012-10-03 中国电子科技集团公司第五十四研究所 Abnormal frequency point identification method based on character string comparison
CN104462068A (en) * 2013-09-12 2015-03-25 北大方正集团有限公司 Character conversion system and method
CN105335689A (en) * 2014-08-06 2016-02-17 阿里巴巴集团控股有限公司 Character recognition method and apparatus
CN106611174A (en) * 2016-12-29 2017-05-03 成都数联铭品科技有限公司 OCR recognition method for unusual fonts

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0498978A1 (en) * 1991-02-13 1992-08-19 International Business Machines Corporation Mechanical recognition of characters in cursive script
CN1979529A (en) * 2005-12-09 2007-06-13 佳能株式会社 Optical character recognization
CN101331520A (en) * 2005-12-19 2008-12-24 微软公司 Stroke contrast in font hinting
CN101782896A (en) * 2009-01-21 2010-07-21 汉王科技股份有限公司 PDF character extraction method combined with OCR technology
CN102707222A (en) * 2012-05-15 2012-10-03 中国电子科技集团公司第五十四研究所 Abnormal frequency point identification method based on character string comparison
CN104462068A (en) * 2013-09-12 2015-03-25 北大方正集团有限公司 Character conversion system and method
CN105335689A (en) * 2014-08-06 2016-02-17 阿里巴巴集团控股有限公司 Character recognition method and apparatus
CN106611174A (en) * 2016-12-29 2017-05-03 成都数联铭品科技有限公司 OCR recognition method for unusual fonts

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
NIKOLA LJUBESIC等: "Language Indentification: How to Distinguish Similar Languages?", 《2007 29TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY INTERFACES》 *
周凤香: "工业生产线标签字符识别***的设计与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
杨富元: "血袋字符高速识别***的研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
杨振罡: "视频超分辨率重建技术在人脸识别中的应用", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110443269A (en) * 2019-06-17 2019-11-12 平安信托有限责任公司 A kind of document comparison method and device
CN110781898A (en) * 2019-10-21 2020-02-11 南京大学 Unsupervised learning method for Chinese character OCR post-processing
CN111626281A (en) * 2020-04-27 2020-09-04 国家电网有限公司 Chinese annotation information identification method and system for paper image map based on adaptive learning
CN111860317A (en) * 2020-07-20 2020-10-30 青岛特利尔环保集团股份有限公司 Boiler operation data acquisition method, system, equipment and computer medium
CN116597453A (en) * 2023-05-16 2023-08-15 暗物智能科技(广州)有限公司 Shape near word single word recognition method

Also Published As

Publication number Publication date
CN109447055B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN109447055A (en) One kind being based on OCR character recognition method familiar in shape
US10489682B1 (en) Optical character recognition employing deep learning with machine generated training data
US11977534B2 (en) Automated document processing for detecting, extracting, and analyzing tables and tabular data
Singh Optical character recognition techniques: a survey
US10896357B1 (en) Automatic key/value pair extraction from document images using deep learning
Naz et al. Segmentation techniques for recognition of Arabic-like scripts: A comprehensive survey
US11379690B2 (en) System to extract information from documents
CN112434690A (en) Method, system and storage medium for automatically capturing and understanding elements of dynamically analyzing text image characteristic phenomena
Clausner et al. ICDAR2019 competition on recognition of early Indian printed documents–REID2019
Peanho et al. Semantic information extraction from images of complex documents
Ranjan et al. OCR using computer vision and machine learning
CN111639566A (en) Method and device for extracting form information
CN112464845A (en) Bill recognition method, equipment and computer storage medium
Li et al. Multilingual text detection with nonlinear neural network
Lakshmi et al. An optical character recognition system for printed Telugu text
Nayak et al. Odia running text recognition using moment-based feature extraction and mean distance classification technique
Shihab et al. Badlad: A large multi-domain bengali document layout analysis dataset
Cascianelli et al. Learning to read L’Infinito: handwritten text recognition with synthetic training data
Igorevna et al. Document image analysis and recognition: a survey
CN109508712A (en) A kind of Chinese written language recognition methods based on image
Shah et al. A math formula extraction and evaluation framework for PDF documents
Kumar et al. Survey paper of script identification of Telugu language using OCR
Panichkriangkrai et al. Character segmentation and transcription system for historical Japanese books with a self-proliferating character image database
Hartel et al. An ocr pipeline and semantic text analysis for comics
Kumar et al. Line based robust script identification for indianlanguages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 730000 Zhangsutan 553, Chengguan District, Lanzhou City, Gansu Province

Applicant after: China Power World Wide Information Technology Co.,Ltd.

Address before: 730000 Zhangsutan 553, Chengguan District, Lanzhou City, Gansu Province

Applicant before: GANSU WANWEI CO.

GR01 Patent grant
GR01 Patent grant