CN105654129A

CN105654129A - Optical character sequence recognition method

Info

Publication number: CN105654129A
Application number: CN201511020570.8A
Authority: CN
Inventors: 刘世林; 何宏靖; 陈炳章; 吴雨浓; 姚佳
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2015-12-30
Filing date: 2015-12-30
Publication date: 2016-06-08

Abstract

The invention belongs to the image character recognition field and relates to an optical character sequence recognition method. According to the method of the invention, CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) technologies are adopted; feature extraction is performed on a whole picture containing a plurality of characters through a CNN; identical features are transmitted to an RNN so as to be subjected to repeatedly recursive use; and continuous prediction of the plurality of characters can be realized. With the method adopted, a defect that picture segmentation is required before OCR (optical character recognition) can be eliminated, the early-stage processing process of picture character recognition can be simplified, and the efficiency of character recognition can be significantly improved; and since the RNN recursively uses output data of the last round, and in model training, a language model of dependency relationships between characters and words can be obtained through learning, and therefore, a step in an OCR method, according to which a language model is required to be additionally built for post-processing after individual characters are recognized, can be avoided; and therefore, the recognition accuracy of character and word sequences can be better improved, and the processing efficiency of character recognition can be further improved.

Description

A kind of optical character recognition sequence method

Technical field

The present invention relates to pictograph identification field, in particular to a kind of optical character recognition sequence method.

Background technology

Along with the development of society, create a large amount of demand to paper media digitizings such as ancient books, document, bill, business cards, here digitizing is not limited only to use scanner or camera to carry out " photo ", the more important thing is to change into these paper documents and store with the document can read, can edit, realize this process to need the picture scanned is carried out pictograph identification, and traditional pictograph is identified as optical character identification (OCR), optical character is identified in and is scanned on the basis of electronics image by paper document to be identified to identify. It is contemplated that the quality of the quality of scanning effect, paper document itself is (not such as printing quality, font clarity, font standard degree etc.), contents and distribution's (arranging situation of word, than plain text and table text and bill) difference, the actual effect of OCR does not always allow people satisfied. And for different paper documents recognition accuracy require difference, the identification of such as bill, being very high to the requirement of accuracy rate, because if a digit recognition mistake is it is possible to cause fatal consequence, traditional OCR identifies the identification requirement that can not meet such high precision.

Conventional OCR method includes the cutting of picture, and feature is extracted, the treating processess such as monocase identification, and wherein the cutting of picture contains a large amount of Image semantic classification processes, such as Slant Rectify, background denoising, the extraction of monocase; These treating processess are not only loaded down with trivial details consuming time, and it would furthermore be possible to make the picture a lot of available information of loss; And when picture to be identified comprises the character string of multiple word, traditional OCR method needs that former character string is cut into some little pictures comprising single word and identifies respectively, mainly there is two large problems in the method: one, the cutting difficulty of monocase picture, particularly it is mixed with the man of left and right radical, letter, numeral, symbol, or when the distortion of background noise, character, bonding, cutting is more difficult.And once problem has occurred in cutting, just it is difficult to obtain recognition result accurately. Two, character string is cut into the difference recognition methods that the sub-pictures comprising single character carries out identifying, do not make full use of the dependence between word in natural language, word, although it is supplementary that extra language model can be used to be optimized by recognition result, it is contemplated that the building process of language model and recognizer is separate, it is local finite that the optimization of this kind of mode supplements.

Need in the face of huge identification to be badly in need of a kind of can image character recognition method rapidly and efficiently.

Summary of the invention

It is an object of the invention to overcome above-mentioned deficiency existing in prior art, it is provided that a kind of optical character recognition sequence method. Invention applies convolutional neural networks (CNN) and the technology of recurrence neural network (RNN), by CNN, the whole picture comprising multiple character is carried out feature extraction, then same feature feeding RNN is carried out recurrence to reuse, to realize the object predicting multiple character continuously. The optical character recognition sequence that the inventive method realizes, overcoming before OCR identifies of system first to be carried out the drawback of picture cutting, greatly improve the recognition efficiency of pictograph, again due in the process of model training and application RNN recurrence employ a recognition result taken turns and export data, also learn to obtain in the lump by the language model of dependence between word, word like this, at the recognition efficiency further increasing pictograph of the recognition accuracy promoting word, word sequence simultaneously.

In order to realize foregoing invention object, the present invention provides following technical scheme:

A kind of optical character recognition sequence method, comprises following performing step:

(1) convolutional neural networks and recurrent neural networks model is built, each moment input signal of wherein said recurrence neural network comprises: the sample characteristics data that described convolutional neural networks extracts, the vector data that the words that the output data of a upper moment recurrence neural network and a upper moment recurrence neural network recognization go out changes into;

(2) training sample set is used to train described convolutional neural networks and recurrent neural networks model;

(3) in described convolutional neural networks pictograph sequence inputting to be identified trained and recurrence neural network, the characteristic of picture to be identified is extracted by described convolutional neural networks, it is input in described recurrence neural network, through the iteration successively of described recurrence neural network, export the complete recognition result of pictograph sequence to be identified.

Concrete, the calculation formula of the recurrence neural network forward algorithm used in the inventive method is as follows:

a_{h}^{t} = Σ_{i}^{I} - w_{i h} x_{i}^{t} + Σ_{l}^{V} w_{l h} v_{l}^{t - 1} + Σ_{h^{'}}^{H} w_{h^{'} h} b_{h^{'}}^{t - 1}

b_{h}^{t} = θ (a_{h}^{t})

a_{k}^{t} = Σ_{h}^{H} w_{h k} b_{h}^{t}

y_{k}^{t} = \frac{\exp (a_{k}^{t})}{Σ_{k^{'}}^{k} \exp (a_{k^{'}}^{t})}

Wherein I is the dimension degree of input vector, and V is the word of vectorization or the dimension degree of word, and H is the neuron number of hidden layer, K is the neuron number of output layer, x is the characteristic that convolutional neural networks extracts, and v is the vector data that the word that identifies of RNN or word change into through dictionary mapping tableFor the input of hidden layer neuron in current time recurrence neural network,For the output of current time recurrence neural network hidden layer neuron; w_ih, w_lh, w_h��h, forCorresponding weight parameter.For the current time recurrence neuronic input of neural network output layer; w_hkFor the weight that each neurone of output layer is corresponding;For the current time recurrence neuronic output of neural network output layer,It is a probable value, represents the ratio adding sum of the corresponding neuron output value neuron output value all relative to output layer of current time.

Input data comprise 3 aspects of hidden layer neuron in the recurrence neural network used the inventive method can be found out from above-mentioned formula, the learning sample feature that CNN extracts, the output data of a upper moment recurrence neural network hidden layer, and a upper moment recurrent neural networks prediction result (words identified) carries out the data of vectorization through dictionary mapping table. Therefore the recurrence neural network that the present invention uses, when the word (word) of prediction current time, had both relied on the feature of image, had also relied on the feature (language model) of upper moment output.

Further, in the inventive method, signal is just to the parameter w used when transmitting_ih, w_lh, w_h��hAll share across sequential, this avoid the linear increase of model complexity, cause possible over-fitting.

Further, the present invention adopts above-mentioned forward algorithm to transmit computing data step by step in convolutional neural networks and recurrence neural network, identification (prediction) data are got at output layer, when the annotation results with learning sample that predicts the outcome has deviation, adjust each weight in neural network by error backpropagation algorithm classical in neural network.

Further, in neural network training process, checked the training result of neural network by exploitation collection, the training direction of adjustment neural network in time, prevent the generation of over-fitting situation, in model training process, only it is retained in the training model that the upper recognition accuracy of exploitation collection is the highest.

Further, the neural network training process of this optical character recognition sequence method comprises following performing step:

(2-1) learning sample manually marked is input in convolutional neural networks;

(2-2) by described convolution network, input learning sample is carried out feature extraction;

(2-3) characteristic extracted by described convolutional neural networks inputs in the first moment recurrence neural network as the first data;

(2-4) calculating through the first moment recurrence neural network exports the first predicted data; Obtain the words recognition result of this moment recurrence neural network according to the first predicted data, this result is defined as: the first recognition result;

(2-5) corresponding vector data and by the first recognition result is changed into;

(2-6) by the first data, first recognition result of the first predicted data and vectorization is as the input data of the 2nd moment recurrence neural network, calculating through recurrence neural network exports the 2nd predicted data, and obtains two recognition result corresponding by the 2nd predicted data;

(2-7) corresponding vector data and by the 2nd recognition result it is converted into;

(2-8) by the first data, the 2nd recognition result of the 2nd predicted data and vectorization is as the input data of the 3rd moment recurrence neural network;

Recurrence successively, until when reaching the recurrence number of times of setting or export null value, terminating identifying; Each moment RNN is measured in advance word (or word) is recorded successively and just final is obtained complete string content.

Concrete, in described step (2-5) and (2-7), vectorization is carried out by dictionary mapping table, described dictionary mapping table is a two-dimentional matrix, line number is the size of dictionary, row number (the dimension degree of row vector) sets according to the size of dictionary and the scale of data, the object of dictionary mapping table is by word (or word) characterization, vectorization, in fact simple, dictionary mapping table is exactly a two-dimentional matrix, the wherein corresponding word of each row vector or a word, and the corresponding relation of this kind of row vector and words is arranged when building this dictionary mapping table.

Further, in the process building dictionary mapping table, if the unit identified is word, then first natural language can be carried out word segmentation processing, such as " this thing is very good " be become " this thing is very good ".

Further, when carrying out model training, comprising and be normalized by learning sample icon and manually mark process, wherein normalized process comprises: most long word (or word) number that setting picture sentence is possible, and the length such as setting sentence is 20.

Further, it is being normalized in process, in order to avoid data to be out of shape, the zoom of size uses the mode of equal proportion, mends neat with background colour with the region of target size disappearance.

Further, normalized picture is manually marked, if the sentence number of words of mark is less than 20, it may also be useful to a special word:<SP>carries out mending neat (length to 20), then the data choosing 75% at random are as training set, and the data of 25% are as exploitation collection.

Compared with prior art, the useful effect of the present invention: the present invention provides a kind of optical character recognition sequence method, the present invention adopts convolutional neural networks that word sequence picture to be identified is carried out entirety and levies extraction, and the characteristic extracted is input in the recurrence neural network in each moment as the first data, the pictograph recognition sequence that the inventive method realizes, the overall feature of picture is extracted by convolutional neural networks, in the identification not needing to carry out achieving whole word sequence on the basis of single character cutting and noise filtration, relative to traditional OCR method, present invention, avoiding the inaccurate irreversible identification mistake that may cause of character segmentation, greatly simplify the treating processes in early stage of pictograph identification, significantly improve the efficiency of Text region.

In addition the inventive method realizes the continuous identification of character in word sequence by recurrence neural network, when using recurrence neural network to identify character, the vector data that the words that the input signal of each moment recurrence neural network also comprises the output data of a moment recurrence neural network and a upper moment recurrence neural network recognization goes out changes into, each moment recurrence neural network is when carrying out corresponding Text region, namely the overall feature of picture that convolutional neural networks extracts has been relied on, also output data and the recognition result of a upper moment recurrence neural network has been relied on, identifying on the basis of words respectively like this, by word, between word, the language model of dependence also learns in the lump and has recognized, compared to OCR method, no longer need to be optimized supplementary by additionally building language model to monocase recognition result, simplify the post-processed process identifying word, recognition efficiency is higher, recognition result is more accurately and reliably.

In a word, the inventive method simplifies the treating processes of pictograph recognition sequence, significantly improve recognition efficiency and accuracy rate, developer is made can more to pay close attention to the tuning of model and the deposit of data, improving development efficiency, the inventive method has extremely high using value and application prospect widely in pictograph identification field.

Accompanying drawing illustrates:

Fig. 1 is the process that the realizes schematic diagram of the inventive method.

Fig. 2 is convolutional neural networks structural representation.

Fig. 3 is that the inventive method word sequence recognition process signal flows to schematic diagram.

Embodiment

Below in conjunction with test example and embodiment, the present invention is described in further detail. But this should not being interpreted as, the scope of the above-mentioned theme of the present invention is only limitted to following embodiment, and all technology realized based on content of the present invention all belong to the scope of the present invention.

The present invention provides a kind of optical character recognition sequence method. Invention applies convolutional neural networks (CNN) and the technology of recurrence neural network (RNN), by CNN, the whole picture comprising multiple character is carried out feature extraction, then same feature feeding RNN is carried out recurrence to reuse, to realize the object predicting multiple character continuously.The optical character recognition sequence that the inventive method realizes, overcoming before OCR identifies of system first to be carried out the drawback of picture cutting, greatly improve the recognition efficiency of pictograph, developer is made more to pay close attention to the tuning of model and the deposit of data, improve development efficiency, again due in the process of model training and application RNN recurrence employ a recognition result taken turns and export data, like this by word, between word, the language model of dependence also learns to obtain in the lump, at lifting word, the recognition efficiency simultaneously further increasing pictograph of the recognition accuracy of word sequence.

Following technical scheme is provided: a kind of optical character recognition sequence method, comprises following performing step as shown in Figure 1 in order to realize foregoing invention object the present invention:

(1) convolutional neural networks and recurrent neural networks model is built, each moment input signal of wherein said recurrence neural network comprises: the sample characteristics data that described convolutional neural networks extracts, the vector data that the words that the output data of a upper moment recurrence neural network and a upper moment recurrence neural network recognization go out changes into; As shown in Figure 2: described convolutional neural networks is mainly used for the automatic study of picture feature. Wherein, each characteristic pattern (featuremap, shown in vertical setting of types rectangle in figure) generation be all (namely such as the little rectangle frame in Fig. 2 by an own convolution core, it is shared in the characteristic pattern specified) carry out preliminary feature extraction, the feature that convolutional layer is extracted by double sampling layer is sampled and is mainly solved the redundancy that convolutional layer is extracted feature. In brief, described convolutional neural networks extracts the different characteristics of picture by convolutional layer, by double sampling layer, the feature extracted is sampled, (multiple convolutional layer can be comprised in a convolutional neural networks to remove redundant information, double sampling layer and full articulamentum), finally by full articulamentum different characteristic patterns is together in series and forms final full picture feature, the inventive method uses a convolutional neural networks, whole pictures is carried out disposable feature extraction, completely avoid the irreversible identification mistake that picture cutting may cause.

(2) training sample set is used to train described convolutional neural networks and recurrent neural networks model.

a_{h}^{t} = Σ_{i}^{I} - w_{i h} x_{i}^{t} + Σ_{l}^{V} w_{l h} v_{l}^{t - 1} + Σ_{h^{'}}^{H} w_{h^{'} h} b_{h^{'}}^{t - 1}

b_{h}^{t} = θ (a_{h}^{t})

a_{k}^{t} = Σ_{h}^{H} w_{h k} b_{h}^{t}

y_{k}^{t} = \frac{\exp (a_{k}^{t})}{Σ_{k^{'}}^{k} \exp (a_{k^{'}}^{t})}

Wherein I is the dimension degree of input vector, V is the word of vectorization or the dimension degree of word, H is the neuron number of hidden layer, K is the neuron number of output layer, x is the characteristic that convolutional neural networks extracts, and v is vector data (the special v that the word that identifies of RNN or word change into through dictionary mapping table⁰=0),For the input of hidden layer neuron in current time recurrence neural network,For the output (b of current time recurrence neural network hidden layer neuron⁰=0), �� () isArriveFunction; w_ih, w_lh, w_h��h, forCorresponding weight parameter, in a forward algorithm transmittance process, parameter w_ih, w_lh, w_h��hAll sharing across sequential, so-called sharing across sequential refers to recurrence neural network at signal just in transmittance process, each moment w_ih, w_lh, w_h��hIdentical (the not w of value_ih=w_lh=w_h��h), the not w of RNN in the same time_ih, w_lh, w_h��hIt is worth identical, reduces the complexity of model parameter, it also avoid the linear increase of model complexity and cause possible over-fitting.For the current time recurrence neuronic input of neural network output layer; w_hkFor the weight that each neurone of output layer is corresponding;For the current time recurrence neuronic output of neural network output layer,It is a probable value, represents the ratio adding sum of the corresponding neuron output value neuron output value all relative to output layer of current time, generally, will selectThe classification that directly maximum output neuron is corresponding is the recognition result of this moment recurrence neural network.

Further, the present invention adopts above-mentioned forward algorithm to transmit computing data step by step in convolutional neural networks and recurrence neural network, identification (prediction) data are got at output layer, when the annotation results with learning sample that predicts the outcome has deviation, each weight in neural network is adjusted by error backpropagation algorithm classical in neural network, error back propagation method by error step by step backpropagation share all neurones of each layer, obtain the neuronic error signal of each layer, and then revise each neuronic weight. Computing data are transmitted by layer, and the process revising a neuronic weight gradually by backward algorithm is exactly the training process of neural network by forward algorithm; Repeating said process, until the accuracy predicted the outcome reaches the threshold value of setting, deconditioning, now can think that neural network model has been trained.

Further, the neural network training process of this optical character recognition sequence method comprises following performing step as shown in Figure 3:

Recurrence successively, the vector that the words (recognition result) that characteristic (the first data), the output data (predicted data) in upper moment RNN and upper moment RNN extracted by CNN is identified is corresponding, as the input data of current time RNN, the prediction through RNN exports a word (or word); Until when reaching the recurrence number of times of setting, terminating identifying; Each moment RNN is measured in advance word (or word) is recorded successively and just final is obtained complete string content.

Concrete, in described step (2-5) and (2-7), vectorization is carried out by dictionary mapping table, described dictionary mapping table is a two-dimentional matrix, line number is the size of dictionary, arrange and several set according to the size of dictionary and the scale of data, the object of dictionary mapping table is by word (or word) characterization, vectorization, in fact simple, dictionary mapping table is exactly a two-dimentional matrix, wherein the corresponding word of each row vector or a word, and the corresponding relation of this kind of row vector and words is arranged when building this dictionary mapping table.

Further, in the process building dictionary mapping table, it is possible to first natural language is carried out word segmentation processing, such as " this thing is very good " is become " this thing is very good ".

Further, when carrying out model training, comprise and learning sample icon is normalized and manually marks process, normalized sample, making the basic parameter of sample equal, data unrelated complexity when reducing model training, is conducive to simplifying model training process; Wherein normalized process comprises: most long word (or word) number that setting picture sentence is possible, the length such as setting sentence is 20, the length of word sequence to be identified is corresponding with the maximum recurrence number of times of recurrence neural network, the most long word symbol number that word sequence to be identified is set when carrying out learning sample and prepare can be corresponding the maximum recurrence number of times of default recurrence neural network, increase the stability of model and predictable.

Further, normalized picture is manually marked, if sentence word (or word) number of mark is less than the maximum word (or word) number (less than 20) of setting, it may also be useful to a special word carries out mending neat (such as using "<SP>" to mend less than the samples pictures of 20 characters (or word) together to the length of 20 characters (or word)).

Further, after above-mentioned normalized and artificial mark, the data choosing 75% at random are as training sample set, and the data choosing 25% are as development sample collection. Neural network is only kept at the highest model of the upper recognition accuracy of exploitation collection in the training process, and the uniform format of development sample and learning sample, is conducive to improving the training effectiveness of neural network.

Claims

1. an optical character recognition sequence method, it is characterised in that, comprise following performing step:

(3), in the described convolutional neural networks trained by pictograph sequence inputting to be identified and recurrence neural network, the complete recognition result of pictograph sequence to be identified is exported.

2. the method for claim 1, it is characterised in that: the recurrent neural networks model used in present method adopts following forward algorithm formula:

a_{h}^{t} = Σ_{i}^{I} w_{i h} x_{i}^{t} + Σ_{l}^{V} w_{l h} v_{l}^{t - 1} + Σ_{h^{'}}^{H} w_{h^{'} h} b_{h^{'}}^{t - 1}

b_{h}^{t} = θ (a_{h}^{t})

a_{k}^{t} = Σ_{h}^{H} w_{h k} b_{h}^{t}

y_{k}^{t} = \frac{\exp (a_{k}^{t})}{Σ_{k^{'}}^{k} \exp (a_{k^{'}}^{t})}

Wherein I is the dimension degree of input vector, and V is the word of vectorization or the dimension degree of word, and H is the neuron number of hidden layer, K is the neuron number of output layer, x is the characteristic that convolutional neural networks extracts, and v is the vector data that the word that goes out of recurrence neural network recognization or word change intoFor the input of hidden layer neuron in current time recurrence neural network,For the output of current time recurrence neural network hidden layer neuron;For the current time recurrence neuronic input of neural network output layer;For the current time recurrence neuronic output of neural network output layer,It is a probable value, represents the ratio adding sum of the corresponding neuron output value neuron output value all relative to output layer of current time.

3. method as claimed in claim 2, it is characterised in that: the w that each moment uses in signal forward transmittance process_ih, w_lh, w_h��hIt is worth identical.

4. method as claimed in claim 3, it is characterised in that: in neural network training process, checked the training result of neural network by exploitation collection, only it is retained in the highest convolutional neural networks of the upper recognition accuracy of exploitation collection and recurrent neural networks model.

5. method as described in one of claims 1 to 3, it is characterised in that: comprise following performing step:

Recurrence successively, until when reaching the recurrence number of times of setting or export null value, terminating calculating.

6. method as claimed in claim 5, it is characterized in that: in described step (2-5) and (2-7), vectorization is carried out by dictionary mapping table, described dictionary mapping table is a two-dimentional matrix, the wherein corresponding word of each row vector or a word, and the corresponding relation of this kind of row vector and words is arranged when building this dictionary mapping table.

7. method as claimed in claim 6, it is characterised in that: build in the process of dictionary mapping table, if fundamental unit is word, then natural language is carried out word segmentation processing.

8. method as claimed in claim 7, it is characterised in that: when preparing learning sample and development sample, samples pictures being normalized, described normalized comprises: the most long word number arranging that picture to be identified allows or word number.

9. method as claimed in claim 8, it is characterised in that: when the sample being normalized manually is marked, when the number of words comprised in samples pictures is less than the most long word number arranged, it may also be useful to the number of words in samples pictures is mended neat by<SP>mark symbol.