CN107704859A

CN107704859A - A kind of character recognition method based on deep learning training framework

Info

Publication number: CN107704859A
Application number: CN201711057406.3A
Authority: CN
Inventors: 张钦宇; 洪靖轩; 韩啸; 雷飞; 肖乔; 赵鹏
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2017-11-01
Filing date: 2017-11-01
Publication date: 2018-02-16

Abstract

The invention provides a kind of character recognition method based on deep learning training framework, comprise the following steps：S1, input picture is shot by camera；S2, picture is input to the Text region model formed by deep learning, obtains corresponding word content.The beneficial effects of the invention are as follows：Recognition accuracy is improved, simplifies identification condition, is advantageous to the identification of Chinese character.

Description

A kind of character recognition method based on deep learning training framework

Technical field

The present invention relates to character recognition method, more particularly to a kind of Text region side based on deep learning training framework Method.

Background technology

Character recognition technology, it is a key areas of application of pattern recognition.Start the fifties to inquire into normal words identification Method, and develop optical character recognition reader.There is the practical machine using magnetic ink and sytlized font the sixties.60 years For the later stage, there are multiple fonts and handwritten text cognitron, but accuracy of identification and machine performance are all very undesirable.70 years Generation, the basic theories and the high performance Text region machine of development of main study text identification, and focus on the research of Chinese Character Recognition. Nowadays character recognition technology has improved a lot.Nonetheless, now to mix chart discrimination be not still it is too high, Mess code wrongly written character still often occurs.Therefore this project realizes block letter Chinese and English using the method for the neutral net of deep learning High-precision identification is carried out with numeral, while realizes the efficient identification of hand-written English and numeral.

Existing Text region has following several algorithms most in use：

Strokelets:A Learned Multi-scale Representation for Scene Text Recognition (CVPR 2014) learns middle level stroke feature by clustering image block, is then voted using Hough (HOG) Algorithm detects character.On the basis of stroke feature and HOG features, character classification is carried out using random forest grader.

End-to-end scene text recognition (2011) use for reference the general target detection side of computer vision Method, it is proposed that a new text recognition system.They are given using the space constraint relation between character confidence level and character Go out most probable detection and recognition result.But the algorithm is only used for the detection identification of the text of horizontal direction arrangement.

End-to-End Text Recognition with Hybrid HMM Maxout Models (2013) and PhotoOCR:Reading Text in Uncontrolled Conditions (2013) et al. pass through two unsupervised classification Technology or the grader for having supervision, word image is divided into potential character zone.

End-to-End Text Recognition with Hybrid HMM Maxout Models (2013) use one Kind complexity, the CNN networks of segmentation, correction and character recognition are included, the HMM of fixed lexicon is used in combination (HMM) final recognition result, is generated.

PhotoOCR systems use the neural network classifier based on HOG features, and the candidate result that segmentation obtains is carried out Marking, using the Beam searching algorithms for combining N gram language models (N-gram), obtain candidate characters set.Finally, further Candidate characters combination is resequenced using language model and shape.

Deep Features for Text Spotting (2014) combine one non-textual grader of text, character point Class device, binary language mode classifiers, the dense scanning based on sliding window is carried out to whole figure.Finally combine fixed word Allusion quotation, the word in picture is analyzed.

Existing character recognition technology has the following disadvantages：

(1) recognizer degree of intelligence is relatively low, low to Printed Font Recognition efficiency, and classifying hand-written characters completely can not be effective Know.Producing the reason for such is algorithm so that based on the feature manually extracted, and is limited to the feature extraction degree of accuracy of people, this Sample is difficult fundamentally to obtain important breakthrough.

(2) the cumbersome load of identification process, derailed completely with real life.Such as the Text region skill of current Han Wang company Art, it is necessary to using specific scanning device, by text page by page be scanned identification, be entered among specific computer software, By machine recognition, the artificial check and correction of complexity is also carried out.Interactive application based on this series of process and current mobile terminal It is not inconsistent completely.

(3) identification for Chinese character is the difficult point of whole industry.Based on the new high-tech enterprises such as American Universities, Microsoft Industry technology leader, there is more in-depth study with hand-written English for block letter is English, but at home, for printing The research of body Chinese character and handwritten Chinese character does not break through significantly also, contrast America and Europe researcher technical scheme, database magnitude, Many reverse side such as application product also have significance difference away from.

The content of the invention

In order to solve the problems of the prior art, the invention provides a kind of word knowledge based on deep learning training framework Other method.

The invention provides a kind of character recognition method based on deep learning training framework, comprise the following steps：

S1, input picture is shot by camera；

S2, picture is input to the Text region model formed by deep learning, obtains corresponding word content.

As a further improvement on the present invention, the deep learning process of the Text region model includes：Construct convolution god Through network and convolutional neural networks solution is carried out, convolutional neural networks, which solve, includes procedure below：

(1) training group, is selected, randomly seeks N number of sample respectively from sample set as training group；

(2), by each weights, threshold value, be set to it is small close to 0 random value, and initialize Accuracy Controlling Parameter and study Rate；

(3), take an input pattern to be added to convolutional neural networks from training group, and provide its target output vector；

(4) intermediate layer output vector, is calculated, calculates the reality output vector of network；

(5) element in output vector, is calculated into output error compared with the element in object vector；For The hidden unit in intermediate layer also calculates error；

(6) adjustment amount of each weights and the adjustment amount of threshold value, are calculated successively；

(7) weights and adjustment threshold value, are adjusted；

(8), after M is undergone, whether judge index meets required precision, if be unsatisfactory for, returns (3), continues iteration； If satisfaction is put into next step；(9), training terminates, and weights and threshold value are preserved hereof；Now, each weights have been Reach stable, grader has been formed, has been trained again, directly exports weights from file and threshold value is trained, be not required to Initialized.

The beneficial effects of the invention are as follows：By such scheme, recognition accuracy is improved, simplifies identification condition, favorably In the identification of Chinese character.

Brief description of the drawings

Fig. 1 is the basic procedure of convolutional neural networks.

Embodiment

The invention will be further described for explanation and embodiment below in conjunction with the accompanying drawings.

Text region is divided into classical mode and deep learning pattern, is usually extraction in advance in the pattern-recognition of classics Feature.After extracting all multiple features, correlation analysis carried out to these features, find the feature that can most represent character, remove pair Classify unrelated and autocorrelative feature.However, the extraction of these features too relies on the experience and subjective consciousness of people, extract The difference of feature is very big on classification performance influence, or even the order of the feature of extraction can also influence last classification performance.Meanwhile The quality of image preprocessing also influences whether the feature of extraction.So, how using feature extraction, this process is adaptive as one Answer, the process of self study, the optimal feature of classification performance is found by machine learning

The unit extraction image local feature of each hidden layer of convolutional Neural member, maps it onto a plane, feature is reflected Penetrate function and use activation primitive of the sigmoid functions as convolutional network so that Feature Mapping has shift invariant.Each god It is connected through member with the local receptor field of preceding layer.Notice that above we say, be not that locally-attached neuron weights are identical, and Be same plane layer neuron weights it is identical, have displacement, the rotational invariance of same degree.Closelyed follow after each feature extraction One to be used for asking local average and the subsampling layer of second extraction.This distinctive structure of feature extraction twice causes network pair Input sample has higher distortion tolerance.That is, convolutional neural networks pass through local receptor field, shared weights and Asia Sample to ensure the robustness of image alignment shifting, scaling, distortion.Fig. 1 is the basic procedure of convolutional neural networks.

The invention provides a kind of character recognition method based on deep learning training framework, including

(1) deep learning --- convolutional neural networks

CNN was proposed by the Yann LeCun of New York University in 1998.CNN is substantially a multi-layer perception (MLP), its into The reason for work(, key was the mode of local connection and shared weights used by it, and the quantity of the weights on the one hand reduced makes Obtain network to be easy to optimize, on the other hand reduce the risk of over-fitting.CNN is one kind in neutral net, and its weights are shared Network structure is allowed to be more closely similar to biological neural network, reduces the complexity of network model, reduces the quantity of weights.This is excellent What point showed when the input of network is multidimensional image becomes apparent, and allows input of the image directly as network, avoids Complicated feature extraction and data reconstruction processes in tional identification algorithm.There are numerous advantages, such as network in two dimensional image processing Voluntarily abstract image feature it can include color, texture, the topological structure of shape and image；

Convolutional neural networks overall architecture：Convolutional neural networks are a kind of supervised learning neutral nets of multilayer, hidden layer Convolutional layer and pond sample level be to realize the nucleus module of convolutional neural networks feature extraction functions.The network model by using Gradient descent method minimizes loss function and the weight parameter in network is successively reversely adjusted, and is improved by frequently repetitive exercise The precision of network.The low hidden layer of convolutional neural networks is alternately made up of convolutional layer and maximum pond sample level, and high level is to connect entirely The hidden layer and logistic regression grader of the corresponding conventional multilayer perceptron of layer.The input of first full articulamentum be by convolutional layer and Sub-sampling layer carries out the characteristic image that feature extraction obtains.Last layer of output layer is a grader, can be returned using logic Return, Softmax returns even SVMs and input picture is classified.

Convolutional neural networks structure includes：Convolutional layer, down-sampled layer, full linking layer.Each layer has multiple characteristic patterns, each Characteristic pattern extracts a kind of feature of input by a kind of convolution filter, and each characteristic pattern has multiple neurons.

Convolutional layer：The use of the important feature that the reason for convolutional layer is convolution algorithm is that, by convolution algorithm, can make Original signal feature strengthens, and reduces noise.

Down-sampled layer：Using it is down-sampled the reason for be that according to the principle of image local correlation, sub-sampling is carried out to image Amount of calculation can be reduced, while keeps image rotation consistency.

The purpose of sampling is mainly to obscure the particular location of feature, because after some feature is found out, its particular location Inessential, we only need this feature and other relative positions, such as one " 8 ", above we have obtained When one " o ", we require no knowledge about its particular location in image, it is only necessary to know below it and be one " o " we just It is known that be one ' 8' because in picture " 8 " in picture it is to the left or it is to the right do not affect us and recognize it, it is this Obscuring the strategy of particular location, the picture that deformed and distort can be identified.

Full articulamentum：Connected entirely using softmax, the picture that obtained activation value i.e. convolutional neural networks extract is special Sign.

Map number of convolutional layer is specified in netinit, and the map of convolutional layer size is by convolution kernel and upper What one layer of input map size determined, it is assumed that the map sizes of last layer are that n*n, the size of convolution kernel are k*k, then this layer Map sizes are (n-k+1) * (n-k+1).

Sample level is a sampling processing to last layer map, and sample mode here is to the adjacent small of last layer map Region carries out aggregate statistics, area size scale*scale, and some realizations are the maximums for taking zonule, and in ToolBox The realization in face is the average using 2*2 zonules.Pay attention to, the calculation window of convolution has calculation window that is overlapping, and sampling Do not have overlapping, sampling is calculated inside ToolBox and with convolution (conv2 (A, K, ' valid')) come what is realized, convolution kernel is 2* 2, each element is 1/4, and removing has overlapping part in the convolution results being calculated.

CNN basic structure includes two kinds of special neuronal layers, and one is convolutional layer, and the input of each neuron is with before One layer of local-connection, and extract the local feature；The second is pond layer, for asking local susceptibility and Further Feature Extraction Computation layer.This structure of feature extraction twice reduces feature resolution, reduces the number of parameters that needs optimize.

CNN is Partially connected networks, and its bottom is feature extraction layer (convolutional layer), followed by pond layer (Pooling), It may then continue with increase convolution, pond or full articulamentum.For the CNN of pattern classification, generally used in final layer softmax.

Generally, CNN structure type is：Input layer -->Conv layers -->Pooling layers -->(repetition Conv, Pooling layers) ... -->FC (Full-connected) layer -->Output result.The integral multiple that layer size is generally 2 is commonly entered, Such as 32,64,96,224,384.Usual convolutional layer uses less filter, such as 3*3, and maximum is also with regard to 5*5.Pooling layers are used In convolution results are carried out with reduction dimension, such as selection 2*2 region carries out reduction dimension to convolutional layer, then selects 2*2 regions Maximum as output, the first half that the dimension of such convolutional layer is just reduced to.

Usually, CNN basic structure includes two layers, and one is characterized extract layer, the input of each neuron with it is previous The local acceptance region of layer is connected, and extracts the local feature.After the local feature is extracted, it is between further feature Position relationship is also decided therewith；The second is Feature Mapping layer, each computation layer of network is made up of multiple Feature Mappings, often Individual Feature Mapping is a plane, and the weights of all neurons are equal in plane.Feature Mapping structure is small using influence function core Activation primitive of the sigmoid functions as convolutional network so that Feature Mapping has shift invariant.Further, since one Neuron on mapping face shares weights, thus reduces the number of network freedom parameter.Each in convolutional neural networks Convolutional layer all followed by one is used for asking the computation layer of local average and second extraction, this distinctive feature extraction structure twice Reduce feature resolution.

Input layer reads in the image by simple regularization (unified size).Unit in each layer is by preceding layer The unit of one group of small local neighbor is as input.This local perceptron for connecting viewpoint and deriving from early stage, and and Local sensing that Hubel, Wiesel have found from the vision system of cats, directionally selective neuron are consistent.Pass through office Portion perceives field, and neuron can extract some basic visual signatures, such as directed edge, end point, corner etc..These features Then used by the neuron of higher.Also, equally also tend to fit suitable for some local foundation characteristic withdrawal device For whole image.By using this feature, convolutional neural networks are distributed in each diverse location of image but tool using a component There is the unit of identical weight vector, to obtain the feature of image and form a width characteristic pattern (Feature Map).In each position Put, the unit from different characteristic figure obtains respective different types of feature.Different units in one characteristic pattern are restricted to Same operation is carried out to the local data of each diverse location in input figure.It is this operation be equal to by input picture for One small core carries out convolution.Multiple characteristic patterns with different weight vectors are generally comprised in one convolutional layer so that same One position can obtain a variety of different features.For example, first hidden layer includes 4 characteristic patterns, each characteristic pattern is by 5*5 Local sensing region form.Once a feature is detected, as long as it does not change relative to the relative position of other features Become, then its absolute position in the picture just becomes not being especially important.Therefore, each convolutional layer is followed by a drop Sample level.Down-sampled layer carries out local average and down-sampled operation, reduces the resolution ratio of characteristic pattern, while it is defeated to reduce network Go out the sensitivity for displacement and deformation.Second hidden layer carries out the 2*2 down-sampled operation of equalization.Follow-up convolution Layer and down-sampled layer are all alternately distributed connection, form " double pyramids " structure：The increasing number of characteristic pattern, Er Qiete The resolution ratio of sign figure gradually reduces.

Generally in convolutional neural networks, convolutional layer and down-sampled layer alternately link together, and are calculated for reducing Time simultaneously progressively sets up higher space and data structure invariance, and these are special by smaller down-sampled coefficient Property is maintained.

CNN disaggregated model and the difference of conventional model are that it can be directly by a width two dimensional image input model In, then provide classification results in output end.It is advantageous that the pretreatment of complexity is not required to, by feature extraction, pattern classification It is put into completely in a black box, parameter needed for network is obtained by constantly optimizing, required classification, net is provided in output layer Network core is exactly structure design and the solution of network of network.Many algorithms performance is higher than ever for this solution structure.

CNN is the neutral net of a multilayer, and every layer is made up of multiple two dimensional surfaces, and each plane is by multiple independent god Formed through member.Comprising simply member (S- members) and complexity member (C- members) in network, S- members, which condense together, forms S- faces, the polymerization of S- faces S- layers are formed together.Similar relation be present between C- members, C- faces and C- layers.The center section of network is by S- layers and C- layer strings Connect and form, for input stage containing only one layer, it directly receives two-dimensional view mode.Sample characteristics extraction step has been inserted into convolutional Neural In the interconnection architecture of network model.

Typically, S is characterized extract layer, and the input of each neuron is connected with the local receptor field of preceding layer, and extracts The local feature, once the local feature is extracted, its position relationship between further feature is just determined；C is feature Mapping layer, each computation layer of network are made up of multiple Feature Mappings, and each Feature Mapping is a plane, all god in plane Weights through member are identical.Feature Mapping structure is using activation letter of the small Sigmoid functions of influence function core as convolutional network Number so that Feature Mapping has shift invariant.Because the neuron weights on each mapping face are shared, reduce network oneself By number of parameters, the complexity that network parameter selects is reduced.Each feature extraction layer (S- layers) in CNN follows one For seeking the computation layer of local average and second extraction (C- layers), this distinctive structure of feature extraction twice is identifying network When have higher distortion tolerance to input sample.

CNN networks also have middle convolutional layer, sample layer and full articulamentum are straight by original image except input and output layer Connect and be input to input layer, the size of original image determines the size of input vector, and neuron extracts the local feature of image, often Individual neuron is all connected with the local receptor field of preceding layer, by the sampling layer (S) that is alternately present and convolutional layer (C) and last Full articulamentum, the output of network is provided in output layer.There are several characteristic patterns in convolutional layer and sampling layer, each layer has multiple Plane, in every layer in the neuron extraction image of each plane specific region local feature, such as edge feature, direction character etc., The weights of S- layer neurons are constantly corrected in training.Neuron weights in same aspect are identical, can so there is identical journey The displacement of degree, rotational invariance.Because weights are shared, rolled up so the mapping from a plane to next plane can be regarded as Product computing, S- layers are considered as fuzzy filter, play a part of Further Feature Extraction.Spatial resolution between hidden layer and hidden layer Successively decrease, the number of planes contained by every layer is incremented by, and so can be used for detecting more characteristic informations.

In convolutional layer, the characteristic pattern of preceding layer and the core that can learn carry out convolution, and the result of convolution is by activation letter Output after number forms the neuron of this layer, so as to form this layer of characteristic pattern.Convolutional layer is separated out existing, convolutional layer with sampling interlayer The characteristic pattern of each output may be with the convolution opening relationships of several characteristic patterns of preceding layer.Each characteristic pattern can have difference Convolution kernel.The main task of convolutional layer is exactly to select each angle character of preceding layer characteristic pattern to make its tool from different angles There is shift invariant.The essence of convolution is exactly that the characteristic pattern of preceding layer is handled, to obtain the characteristic pattern of this layer.Sampling Layer main function is the spatial resolution for reducing network, and the torsion of skew and image is eliminated by reducing the spatial resolution of image It is bent.

The number of parameters of hidden layer and the neuron number of hidden layer are unrelated, only and wave filter size and wave filter species it is more Rare pass.The neuron number of hidden layer, it and original image, that is, size (neuron number), the size of wave filter inputted It is all relevant with the sliding step of wave filter in the picture.

(2) convolutional neural networks solve

CNN realizes the displacement of identification image, scaling and distortion consistency, i.e. local receptive field, power by three methods The shared and secondary sampling of value.Local receptive field refers to the neuron of each layer network and the god in a small neighbourhood of last layer Connected through unit, by local receptive field, each neuron can extract primary visual signature, such as direction line segment, end points and Angle point etc.；Weights are shared to cause CNN to have less parameter, it is necessary to relatively little of training data；Secondary sampling can reduce feature Resolution ratio, realize to displacement, scaling and other forms distortion consistency.After convolutional layer generally with sample for one time layer come Reduce and calculate time, the consistency established on space and structure.

Network has been constructed afterwards, it is necessary to be solved to network, if the allocation of parameters as traditional neural network, often One connection can all have unknown parameter.And CNN shares using weights, so pass through the neuron on a width characteristic pattern Share same weights can and greatly reduce free parameter, this can be used for detecting what identical feature represented in different angle Effect.Generally all it is that sampling layer is alternately present with convolutional layer in network design, the preceding layer of full articulamentum is usually convolutional layer. In CNN, right value update is to be based on back-propagation algorithm.

CNN is inherently a kind of mapping for being input to output, and it can learn reflecting between substantial amounts of input and output Relation is penetrated, without the accurate mathematic(al) representation between any input and output, as long as pattern is to convolution net known to using Network training, network just have the mapping ability between inputoutput pair.What convolutional network performed is supervised training, so its Sample set be by shaped like：Input vector, the vector of preferable output vector is to composition.All these vectors are right, should all come Network is come from i.e. by the actual " RUN " structure of simulation system, they can gather to come from actual motion system.Starting Before training, all power should all be initialized with some different random numbers." small random number " is used for ensureing that network will not Enter saturation state because weights are excessive, so as to cause failure to train；" difference " is used for ensureing that network can normally learn.It is real On border, if with identical number deinitialization weight matrix, network learning disabilities.

Training algorithm mainly includes four steps, and this four step is divided into two stages：

First stage, forward propagation stage：

(1) sample, is taken from sample set, inputs network；

(2) corresponding reality output, is calculated；In this stage, information, by conversion step by step, is sent to output from input layer Layer.This process is also the process that network performs in normal execution after completing to train.

Second stage, back-propagation stage：

(1) difference of the reality output with corresponding preferable output, is calculated；

(2), weight matrix is adjusted by the method for minimization error.

The work in the two stages should typically be controlled by required precision.

The training process of network is as follows：

(3), take an input pattern to be added to network from training group, and provide its target output vector；

(5) element in output vector, is calculated into output error compared with the element in object vector；For The hidden unit in intermediate layer is also required to calculate error；

(7) weights and adjustment threshold value, are adjusted；

(8), after M is undergone, whether judge index meets required precision, if be unsatisfactory for, returns (3), continues iteration； If satisfaction is put into next step；

(9), training terminates, and weights and threshold value are preserved hereof.At this moment it is considered that each weights have reached steady Fixed, grader has been formed.It is trained again, directly exports weights from file and threshold value is trained, it is not necessary to carry out Initialization.

(3)：The application and its principle that gesture recognition system is realized

By on the basis of above-mentioned theory, can further realize that abundant in content, diversification of forms function should With.

1. the exploitation of the mobile terminal APP based on Text region

The technology that is described above is realized, it is necessary to application using suitable platform progress technology.With reference to current market Situation, the mobile terminal APP for developing Text region are undoubtedly best selection.

By developing mobile terminal technology, user can be carried out quickly and easily picture by mobile phone camera and be inputted, can be with Save the scanner input of very complicated.Shoot picture after, by picture import corresponding to APP softwares, formed by deep learning Text region model, it is possible to draw corresponding word content, be the most efficiently actual application scheme of the present invention.

A kind of character recognition method based on deep learning training framework provided by the invention, its feature are as follows：

(1) deep learning algorithm is with thinking that Text region is combined, Model Calculating Method, parameter, the model generated, And application interface

(2) for block letter English, handwritten form English, printing digital, handwriting digital, block letter Chinese, handwritten form Chinese carries out a set of identification technology in all directions, it is intended to realizes the application broken through on Text region algorithm with mobile terminal.

The invention provides a kind of character recognition method based on deep learning training framework, depth technically make use of The method of habit does Text region, and frontline technology, method is advanced, the manually extraction feature of conventional method is updated to utilize Deep learning training pattern independently extracts feature, and this is an adaptive, process for self study.

It is to be based on character recognition technology the invention provides a kind of character recognition method based on deep learning training framework The word in image is identified and changed with the nerual network technique of deep learning, is by the optics input mode such as take pictures The word of various newpapers and periodicals, books, manuscript and other printed matters is converted into image information, recycles character recognition technology by image Information is converted into the computer input technology that can be used.Can be widely used in future a large amount of written historical materials, archives folder, The typing of official documents and correspondence and process field.

Above content is to combine specific preferred embodiment further description made for the present invention, it is impossible to is assert The specific implementation of the present invention is confined to these explanations.For general technical staff of the technical field of the invention, On the premise of not departing from present inventive concept, some simple deduction or replace can also be made, should all be considered as belonging to the present invention's Protection domain.

Claims

1. a kind of character recognition method based on deep learning training framework, it is characterised in that comprise the following steps：

S1, input picture is shot by camera；

2. the character recognition method according to claim 1 based on deep learning training framework, it is characterised in that the text The deep learning process of word identification model includes：Construction convolutional neural networks simultaneously carry out convolutional neural networks solution, convolutional Neural Solution To The Network includes procedure below：

(2), by each weights, threshold value, be set to it is small close to 0 random value, and initialize Accuracy Controlling Parameter and learning rate；

(5) element in output vector, is calculated into output error compared with the element in object vector；For centre The hidden unit of layer also calculates error；

(7) weights and adjustment threshold value, are adjusted；

(8), after M is undergone, whether judge index meets required precision, if be unsatisfactory for, returns (3), continues iteration；If Satisfaction is put into next step；

(9), training terminates, and weights and threshold value are preserved hereof；Now, each weights have reached stable, and grader is Through being formed, it is trained again, directly exports weights from file and threshold value is trained, it is not necessary to initialized.