CN109447055A

CN109447055A - One kind being based on OCR character recognition method familiar in shape

Info

Publication number: CN109447055A
Application number: CN201811211186.XA
Authority: CN
Inventors: 席敬; 焦勇; 伏虎
Original assignee: Gansu Wanwei Co
Current assignee: Gansu Wanwei Co
Priority date: 2018-10-17
Filing date: 2018-10-17
Publication date: 2019-03-08
Anticipated expiration: 2038-10-17
Also published as: CN109447055B

Abstract

The present invention relates to field of computer technology to relate more specifically to a kind of based on OCR character recognition method familiar in shape more particularly to pattern-recognition and deep learning field.Change traditional font identification method, character text and font can be identified, threshold value screening is compared and be added by multisample, text identification accuracy is not only substantially improved, but also effectively identify character font.It is particularly suitable for the character recognition of similar font and similar font, realizes that font and the dual of font accurately identify.It by each Character segmentation is 96*96 pixel at size by horizontal segmentation and vertical segmentation, convenient for the extraction of pixel characteristic information, it avoids interfering with each other between adjacent text, effectively promote recognition efficiency, Character segmentation each in the plurality of picture such as books, newspaper, clothes and screenshotss is carried out the extraction of character pixels characteristic information by designer of the present invention at 96*96 pixel, and recovery rate is close to 100%.

Description

One kind being based on OCR character recognition method familiar in shape

Technical field

The present invention relates to field of computer technology more specifically to relate to more particularly to pattern-recognition and deep learning field And it is a kind of based on OCR character recognition method familiar in shape.

Background technique

Optical character identification (Optical Character Recognition, abbreviation OCR) is to combine optical technology and meter Calculation machine technology converts the image file being imprinted on paper to a kind of mode of text file, and OCR identification can be used for bank money, big Measure the automatically scannings of bills such as documents and materials, archives folder, tax list and long-term storage.

OCR identification is usually using discrimination, recognition speed, printed page understanding and layout representation degree as the technical standard measured. The technology has relatively good discrimination to general character, but there is also certain for structure and font Chinese character field abundant Technical problem, especially for familiar in shape, such as: (noon, dry, dry), (running, bubble, big gun) character there are recognition efficiency lowly and The not high problem of precision.Furthermore the prior art can not judge the identical font different fonts of character, identical font different fonts It is very easy to that mistake occurs when being identified, repeatedly recognition result is different repeatedly, it is sometimes desirable to which manpower intervention error correction is greatly reduced Identify accuracy.

Summary of the invention

The present invention provides that a kind of discrimination is high, identification is quick and with high accuracy based on OCR character recognition method familiar in shape.

The technical solution adopted by the present invention to solve the technical problems are as follows:

One kind being based on OCR character recognition method familiar in shape, includes the following steps:

A, original OCR image preprocessing

Text correction is carried out to tilted character, to the noise remove in picture, ash is converted into picture contrast and Gamma correction Spend image；

B, pictograph detects

The extraction of character pixels characteristic information is carried out to pretreated gray level image, and character picture is carried out using CNN neural network The extraction of plain characteristic information is translated into the feature vector of one-hot coding form, as character recognition module character pixels feature The foundation of information identification；

C, identification calculates

Use the different fonts of standard character library as training sample n, every kind of different fonts of standard character library are denoted as n₁、n₂,, meter Calculate the Euclidean distance D of every kind of font of training sample_n1、D_n2、、、,Character recognition module uses ***-Inception-v4 structure Frame carries out identification as identification sample p to pictograph to be identified, calculates the Euclidean distance D of identification sample p_P, using as follows Formula calculates identification sample and different fonts training sample comparison threshold value a,、,,,；

D, character text Character Font Recognition

Comparative selection threshold value a₁、a₂,, a training sample of middle 0.4-0.6, export corresponding identification character text and Font.

The extraction for carrying out character pixels characteristic information in the step B to pretreated gray level image, passes through horizontal segmentation With vertical segmentation by each Character segmentation at size be 96*96 pixel.

16 kinds of fonts of training sample n 3755 characters of first class word-base of the national standard in the step C.

Comparative selection threshold value a in the step D₁、a₂,, in closest to 0.5 a training sample, export it is corresponding Identify the text and font of character.

Character recognition module uses ***-Inception-v4 framework in the step C, by the two-dimensional convolution core of 5*5 Split into the one-dimensional convolution kernel of 1*5 and 5*1.

The invention has the benefit that

1, change traditional font identification method, character text and font can be identified, compared and be added by multisample Threshold value screening is not only substantially improved text identification accuracy, but also effectively identifies character font.It is particularly suitable for similar font and phase Like the character recognition of font, realize that font and the dual of font accurately identify.

2, it is 96*96 pixel at size by each Character segmentation by horizontal segmentation and vertical segmentation, is convenient for pixel characteristic The extraction of information avoids interfering with each other between adjacent text, effectively promotion recognition efficiency, designer of the present invention by books, newspaper, Each Character segmentation carries out the extraction of character pixels characteristic information at 96*96 pixel in the plurality of picture such as clothes and screenshotss, extracts Rate is close to 100%.

3, the present invention is in comparison threshold value a₁、a₂,, it is middle selection closest to 0.5 a training sample, export corresponding knowledge The text and font of malapropism symbol promote identification accuracy, avoid manpower intervention error correction.

4, character recognition module uses ***-Inception-v4 framework, and the two-dimensional convolution core of 5*5 is split into 1*5 With the one-dimensional convolution kernel of 5*1, not only prevent over-fitting from also increasing nonlinear extensions ability and reserved character characteristic polymorphic.

Detailed description of the invention

Fig. 1 is identification schematic diagram of the invention.

Specific embodiment

A, original OCR image preprocessing

B, pictograph detects

C, identification calculates

D, character text Character Font Recognition

Comparative test 1

The Song typeface to do three words is tested as case:

Font distracter is set in, noon；

Set font distracter black matrix and imitation Song-Dynasty-style typeface；

Test method is as follows: the screening Song typeface is dry, black matrix is dry, imitation Song-Dynasty-style typeface is dry；The Song typeface in, black matrix in, imitation Song-Dynasty-style typeface in；The Song typeface noon, the black matrix noon, The 9 width picture such as imitation Song-Dynasty-style typeface noon；It is to do 3, in 3 and the noon 3 through manual identified；

It is repeatedly right to be carried out using the orc software v8.1 that the Han Wang OCR Free Chinese version of ZOL software download net, starting point software centre are netted Than test, specific comparing result is as follows:

	The present invention	Han Wang OCR	Orc software v8.1
				For the first time	Do 3, in 3, the noon 3	Do 4, in 2, the noon 3	Do 3, in 3, the noon 3
Second	Do 3, in 3, the noon 3	Do 5, in 3, the noon 1	Do 1, in 4 and the noon 4
				For the third time	Do 3, in 3, the noon 3	Do 3, in 3, the noon 3	Do 2, in 2 and the noon 5

Interpretation of result from single picture font identification from, the present invention, ZOL software download net Han Wang OCR Free Chinese version, The orc software v8.1 of starting point software centre net can identify font text, but there are unstability, distracters for the prior art Certain influence can be generated on existing identification software, recognition result is unstable, needs manpower intervention error correction.The present invention is to 9 width pictures Comparison threshold value be selected from one in 0.4-0.6 closest to 0.5, it is right if picture is serious unintelligible or can not effectively identify It can be fallen between 0.1-0.3 or 0.7-0.9 than threshold value, realize automatic error-correcting prompt.

Comparative test 2

The technology for distributed optical character identification and distributed machines language translation of the present invention and Google's application (CN201580029025.7) technology is compared, and the present invention judges font by the degree of approach of comparison threshold value 0.5.Documents The technology of CN201580029025.7 can not judge font.

This case in summary is particularly suitable for the character recognition of similar font and similar font, realizes the dual of font and font It accurately identifies.The not dual technology accurately identified discloses in the prior art, furthermore method of the invention convenience and existing software It is implanted into, guarantees that identification software development difficulty is greatly reduced on the basis of recognition efficiency.

Claims

1. one kind is based on OCR character recognition method familiar in shape, it is characterised in that include the following steps:

A, original OCR image preprocessing

B, pictograph detects

C, identification calculates

D, character text Character Font Recognition

Comparative selection threshold value a₁、a₂,, a training sample of middle 0.4-0.6, export the text and word of corresponding identification character Body.

2. a kind of according to claim 1 be based on OCR character recognition method familiar in shape, it is characterised in that in the step B The extraction that character pixels characteristic information is carried out to pretreated gray level image, by horizontal segmentation and vertical segmentation by each character Being cut into size is 96*96 pixel.

3. a kind of according to claim 1 be based on OCR character recognition method familiar in shape, it is characterised in that in the step C 16 kinds of fonts of training sample n 3755 characters of first class word-base of the national standard.

4. a kind of according to claim 1 be based on OCR character recognition method familiar in shape, it is characterised in that in the step D Comparative selection threshold value a₁、a₂,, in closest to 0.5 a training sample, export it is corresponding identification character text and word Body.

5. a kind of according to claim 1 be based on OCR character recognition method familiar in shape, it is characterised in that in the step C Character recognition module uses ***-Inception-v4 framework, and the two-dimensional convolution core of 5*5 is split into the one of 1*5 and 5*1 Tie up convolution kernel.