CN110717336A

CN110717336A - Scene text recognition method based on semantic relevance prediction and attention decoding

Info

Publication number: CN110717336A
Application number: CN201910898753.1A
Authority: CN
Inventors: 陈晓雪; 金连文; 王天玮; 毛慧芸; 朱远志; 罗灿杰
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2020-01-21

Abstract

The invention discloses a scene text recognition method based on semantic relevance prediction and attention decoding, which comprises the following steps: s1, data acquisition: acquiring a synthetic training data set, a real evaluation data set and a common root statistical table; the common root statistical table is used as semantic guidance; s2, data processing: stretching and transforming the synthetic training data set and the real evaluation data set to a uniform standard; s3, deep neural network model training, S4, scene text recognition: and inputting the scene text image to be recognized into a deep neural network model, accurately recognizing the scene text image to be recognized by the deep neural network model, and returning a string of characters as a recognition result. The semantic relevancy prediction module of the invention takes the root of word statistical table as semantic guidance to provide more accurate high-order prior information guidance for a semantic attention mechanism, and the learned parameters can be more suitable for the image characteristics of real scene texts, and the recognition accuracy is higher.

Description

Scene text recognition method based on semantic relevance prediction and attention decoding

Technical Field

The invention relates to the technical field of pattern recognition and artificial intelligence, in particular to a scene text recognition method based on semantic relevancy prediction and attention decoding.

Background

The text is rich in a large amount of accurate and rich semantic information, and the information is suitable for a plurality of practical application scenes, such as intelligent retrieval, automatic driving, auxiliary equipment for visually impaired people and the like. Thus, scene text recognition is one of the long-standing research topics in the field of computer vision. Unlike optical character recognition in scanned documents, scene text recognition is very challenging because of the variety of text fonts, low image resolution, and susceptibility of images to light and shadow variations. In recent years, with the rapid development of deep neural networks, the innovative application of artificial intelligence technology is greatly promoted. The deep neural network model, particularly the deep neural network model based on the attention mechanism, achieves better performance in scene text recognition. The recognition network based on the attention mechanism focuses on text regions, and high-order prior information of adjacent characters is embedded implicitly, so that a high-order statistical language model is provided for a subsequent transcription process, and the recognition performance is improved. However, the attention mechanism widely used in existing scene text recognition lacks the selectivity of high-order prior information. The method provides equally important prior information guidance for all recognition situations, so that the relevance of characters with strong semantics is not weakened, and the relevance of irrelevant characters is enhanced.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a scene text recognition method based on semantic relevancy prediction and attention decoding, which has high recognition precision, no additional computational overhead in a test stage and high recognition speed.

The purpose of the invention is realized by the following technical scheme:

a scene text recognition method based on semantic relatedness prediction and attention decoding comprises the following steps:

s1, data acquisition: acquiring a synthetic training data set, a real evaluation data set and a common root statistical table; the common root statistical table is used as semantic guidance;

s2, data processing: stretching and transforming the synthetic training data set and the real evaluation data set to a uniform standard;

s3, deep neural network model training: inputting a unified and standard synthetic training data set, corresponding tagged text data and a common root statistical table into a deep neural network model for training, wherein the tagged text data and semantic guidance are adopted for supervised parameter learning in the training process; the deep neural network model comprises a semantic relevancy prediction module and a semantic attention mechanism decoding module;

s4, scene text recognition: and inputting the scene text image to be recognized into a deep neural network model, accurately recognizing the scene text image to be recognized by the deep neural network model, and returning a string of characters as a recognition result.

Preferably, the scene texts in the synthetic training data set and the real evaluation data set occupy more than two thirds of the area of the scene text image, the text part of the synthetic training data set comprises N different font styles, N is more than or equal to 2, and the real evaluation data set is obtained by shooting by a camera; the common root statistical table comprises 707 common roots, and the root length range is between 2 and 10 characters.

Preferably, the operation of stretch-transforming in step S2 is a bilinear interpolation or downsampling operation.

Preferably, step S3 includes:

s31, constructing a deep neural network model;

s32, setting parameters during the deep neural network model training; wherein, the iteration times are as follows: 1,000,000, optimizer: adapelta, learning rate: 1.0;

and S33, training the deep neural network under the set initialization parameters.

Preferably, the model structure of the deep neural network model is table 1:

TABLE 1 model Structure of deep neural network model

TABLE 2 model Structure of residual layer

The model structure of the residual error layer in the model structure of the deep neural network model is shown in table 2, and nonlinear layers in the residual error layer all adopt a ReLU activation function; the downsampling layer is implemented by a convolution layer and a batch normalization layer.

Preferably, step S4 includes: the method comprises the steps that a scene text image to be recognized obtains high-level feature expression with robustness through a deep convolutional neural network model, and a semantic relevancy prediction module predicts to obtain semantic relevancy parameters of adjacent characters by taking a common root statistical table as semantic guidance; and the semantic attention mechanism decoding module performs transcription and correction according to the adjacent character semantic relatedness parameter and the high-level feature expression of the text image to obtain a string of characters as a recognition result.

Preferably, the steps S3 and S4 further include: testing a deep neural network model; the deep neural network model test comprises the following steps: inputting the real evaluation data set into a deep neural network model, accurately identifying the real evaluation data set by the deep neural network model, and returning a string of characters as an identification result; and if the recognition result is consistent with the labeled text data corresponding to the real evaluation data set, the recognition capability of the deep neural network model reaches the preset requirement.

Compared with the prior art, the invention has the following advantages:

(1) the deep neural network model comprises a semantic correlation degree prediction module and a semantic attention mechanism decoding module; the semantic relevancy prediction module takes the root statistical table as semantic guidance, obtains semantic relevancy parameters of adjacent characters through prediction, provides more accurate high-order prior information guidance for a semantic attention mechanism, obtains the parameters through learning more suitable for image characteristics of real scene texts, and is higher in recognition accuracy.

(2) The semantic attention mechanism only depends on a common root statistical table as semantic guidance, and the semantic relevancy marking information does not need a manual marking process, so that a large amount of manpower and material resources are saved; in practical application, the identification accuracy can be effectively improved.

(3) And a back propagation algorithm is adopted, and the convolution kernel parameters are automatically adjusted, so that a more robust filter is obtained, and the method can adapt to application scenes such as image blurring, perspective transformation, light change and the like.

(4) Compared with a manual mode, the scheme can automatically complete scene text recognition, and manpower and material resources can be saved.

(5) Compared with the traditional attention mechanism method based on computer vision, the method selectively constructs semantic relevance, and has the characteristics of simplicity in implementation, high recognition precision, no additional computational cost in a test stage, high recognition speed and the like.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart illustrating a scene text recognition method based on semantic relatedness prediction and attention decoding according to the present invention.

Detailed Description

The invention is further illustrated by the following figures and examples.

Referring to fig. 1, a scene text recognition method based on semantic relatedness prediction and attention decoding is characterized by comprising:

s1, data acquisition: acquiring a synthetic training data set, a real evaluation data set and a common root statistical table; the common root statistical table is used as semantic guidance; the scene texts in the synthetic training data set and the real evaluation data set occupy more than two thirds of the area of the scene text image, the text part of the synthetic training data set comprises N different font styles, N is more than or equal to 2, and the synthetic training data set is allowed to cover certain degree of light and shadow change and resolution change. The real evaluation data set is obtained by shooting through a camera; in the shooting process, the text in the normalized scene text image occupies more than two thirds of the image area, and certain inclination and blurring are allowed. The common root statistical table comprises 707 common roots, and the root length range is between 2 and 10 characters. The training data set and the real evaluation data set both cover various different font styles, light and shadow changes and resolution changes;

the natural scene picture or image refers to a picture or image obtained by an electronic device such as a mobile phone, for example, a street view image such as a street sign or a signboard. Scene character recognition refers to recognizing character information in a natural scene picture. Because characters in natural scene pictures are rich in display forms, the image background is complex, the resolution ratio is low and the like, the difficulty is far higher than that of character recognition in traditional scanned document images.

S2, data processing: stretching and transforming the synthetic training data set and the real evaluation data set to a uniform standard with the size of 32 x 100, so that the depth neural network model can be parallelized conveniently; the operation of stretch transform in step S2 is a bilinear interpolation or downsampling operation.

S3, deep neural network model training: inputting a unified and standard synthetic training data set, corresponding tagged text data and a common root statistical table into a deep neural network model for training, wherein the tagged text data and semantic guidance are adopted for supervised parameter learning in the training process; the deep neural network model comprises a semantic relevancy prediction module and a semantic attention mechanism decoding module; the semantic relevancy prediction module takes a root statistical table as semantic guidance to predict semantic relevancy parameters of adjacent characters; and more accurate high-order prior information guidance is provided for the semantic attention mechanism.

In step S3, the corresponding annotation text data indicates an annotation to a text included in an image in the synthetic training data set. For example, if a street view image contains the word "china", the annotation text data of the image is "china". Each image corresponds to a particular line of annotated text data.

The steps between S3 and S4 further include: testing a deep neural network model; the deep neural network model test comprises the following steps: inputting the real evaluation data set into a deep neural network model, accurately identifying the real evaluation data set by the deep neural network model, and returning a string of characters as an identification result; and if the recognition result is consistent with the labeled text data corresponding to the real evaluation data set, the recognition capability of the deep neural network model reaches the preset requirement.

It should be noted that, the steps of testing the deep neural network model and recognizing the scene text are consistent, and the difference between the two steps is that the images input into the deep neural network model are different. The deep neural network model test inputs a text image of a real evaluation data set, wherein texts in the text image of the real evaluation data set are known in advance. And if the recognition result is consistent with the text in the text image known in advance after the deep neural network model is recognized, the recognition capability of the deep neural network model is good. The input of the scene text recognition is a scene text image to be recognized, the scene text image to be recognized is input into a deep neural network model with good testing recognition capability, the deep neural network model recognizes the scene text image to be recognized, and a string of characters is returned to serve as a text in the scene text image to be recognized.

Further, step S4 includes: the method comprises the steps that a scene text image to be recognized obtains high-level feature expression with robustness through a deep convolutional neural network model, and a semantic relevancy prediction module predicts to obtain semantic relevancy parameters of adjacent characters by taking a common root statistical table as semantic guidance; and the semantic attention mechanism decoding module performs transcription and correction according to the adjacent character semantic relatedness parameter and the high-level feature expression of the text image to obtain a string of characters as a recognition result.

In the present embodiment, step S3 includes:

s31, constructing a deep neural network model;

The model structure of the deep neural network model is shown in table 1:

TABLE 1 model Structure of deep neural network model

TABLE 2 model Structure of residual layer

The model structure of the residual error layer in the model structure of the deep neural network model is shown in table 2, and nonlinear layers in the residual error layer all adopt a ReLU activation function; the downsampling layer is implemented by a convolution layer and a batch normalization layer. And finally, the step size of the 3 layers of residual layers is changed from 2 x 2 to 2 x 1, so that the method is more suitable for the aspect ratio requirement of the scene text image and is convenient for extracting robust spatial features.

The semantic relevancy prediction module takes a common root statistical table as semantic guidance and provides more accurate high-order prior information guidance for a semantic attention mechanism. After statistics, the common root statistical table after removing the repeated roots and the single letter roots contains 707 common roots in total. The root length is mainly distributed between 2-10 characters, wherein the root ratio of 3-4 characters is the largest, and is about 71.99%, and typical roots are like 'ing' and 'ane'. Very few roots exceed 8 characters.

Given the input picture I and the real annotation information g ═ g (g)₁，g₂...g_L) By means of symbols

Score gamma representing higher order prior information_tThe real annotation information.The value of (d) represents the semantic relevance between adjacent characters. Vector quantity

The length is L-1. Then gamma is_tThe construction process of the labeling information is as follows:

let the scene text picture label information be "information", and the character length be 11 characters, so

The character length is 10 characters. If two adjacent characters form the root, thenIncreases by 1 and vice versa by 0. The annotation information "contains 7 roots in total, which are 'at', 'position', 'or', 'for', 'form', 'in' and 'ion', respectively, and the above process is repeated to obtain the final high-level semantic vector

Is [1, 0, 2, 3, 1,0, 2, 1, 2]. In the course of the deep neural network training process,

is normalized to the interval [0, 1 ]]. The process does not require manual labeling.

Further defining a semantic prior loss function L_pIn order to realize the purpose,

where mselos represents the mean square error between the predicted value and the true tag.

And the semantic attention mechanism decoding module carries out targeted transcription and correction according to the semantic relevancy parameters and the high-level feature expression of the text image obtained through deep convolutional neural network processing to obtain a string of character recognition results.

By F_e(I)＝(h₁，h₂...h_n) Representing a deep convolutional neural network encoding process, a semantic attention mechanism-based decoding module is used to convert the prediction sequence y to (y)₁，y₂...y_T) And true notation g ═ g (g)₁，g₂...g_L) And (4) aligning. The letter T represents the maximum decoding step length, and at the time T, the output y of the depth recognition model_tCan be expressed as a number of times as,

y_t＝Softmax(W_os_t+b_o)， (2)

wherein s is_tRepresenting the Gated secure unit (GRU) hidden layer state at time t. GRU is a variant of recurrent neural networks, often used to model long-term semantic dependencies of text sequences. s_tThe way in which (a) is calculated is expressed as,

s_t＝GRU((p′_t，c_t)，s_t-1). (3)

p′_trepresenting the last bit output y_t-1Is different from p 'of the traditional attention mechanism and the semantic attention mechanism'_tAs shown in the following description, alternatively,

p′_t＝γ_tp_t， (4)

wherein gamma is_tReflects the adjacent character string y_tAnd y_t-1The degree of correlation of (c). Gamma ray_tA larger value of (A) represents a stronger semantic correlation between adjacent characters, whereas γ_tThe smaller the value of (a), the weaker the semantic correlation between adjacent characters. When gamma is_tWhen 0, it means that there is no semantic correlation between adjacent characters. Accordingly, γ_tThe way of calculating (a) is as follows,

γ_t＝f_emb(c_t，c_t-1)， (5)

further, a priori function f_embThe calculation method is that,

f_emb(c_t，c_t-1)＝σ(V_cTanh(W_pc_t-1+W_cc_t+b_c)， (6)

wherein, σ is an activation function Sigmoid function and symbol c_tRepresenting a semantic vector, represented by the weighted sum of features,

the symbol N represents the length of the feature vector. Alpha is alpha_t，jIs a weight vector for the attention mechanism, generally denoted,

e_t，j＝f_attn(s_t-1，h_j). (9)

wherein the alignment function f_attnThe calculation method is as follows: a

f_attn(s_t-1，h_j)＝V_aTanh(W_ss_t-1+W_fh_j+b). (10)

W mentioned above_o，b_o，V_a，W_s，W_f，b，V_c，W_p，W_cAnd b_cAre all learnable parameters. When the recognition model predicts the terminator EOS, the semantic attention mechanism decoding module finishes the transcriptionThe process.

Notation L for attention mechanism loss function_attnIt is expressed in the following way,

where θ represents all learnable parameters of the deep neural network model.

Semantic prior loss function L provided by combining semantic relevance prediction module_pAnd the final optimization function of the deep network identification model is defined as,

L＝L_attn+λL_p. (12)

wherein the hyper-parameter λ is used to balance the attention mechanism loss function and the semantic prior loss function. The constant 1 was set during the experiment.

In the network model training, a back propagation algorithm is adopted, and all parameters of the network model are updated by calculating a transfer gradient from the last layer and transferring layer by layer. The training strategy adopts a supervision mode: and training a universal deep network recognition model by utilizing the artificially synthesized image data, the corresponding annotation information and the root table. The input of the recognition model is a standard scene text image, the output is a character sequence in the image, and the training loss function is the aforementioned L.

The scene text recognition of the scheme can be used for automatic recognition of the guideboard, intelligent retrieval, storage of image data and the like.

The method for recognizing the scene text based on the semantic relevance prediction and the attention decoding fully utilizes the semantic guidance capability of the common root table, is based on the antagonistic learning capability of the deep network model and the physical significance of the back propagation residual error, and provides an accurate scene text recognition method based on the semantic relevance prediction and the attention decoding through the distribution of learning data samples. The method has the characteristics of simple realization, high identification precision, no additional computational cost in the test stage, high identification speed and the like, and has better practical value.

The above-mentioned embodiments are preferred embodiments of the present invention, and the present invention is not limited thereto, and any other modifications or equivalent substitutions that do not depart from the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. A scene text recognition method based on semantic relatedness prediction and attention decoding is characterized by comprising the following steps:

s1, data acquisition: acquiring a synthetic training data set, a real evaluation data set and a common root statistical table;

s3, deep neural network model training: inputting a unified and standardized synthetic training data set, corresponding label text data and a common root statistical table into a deep neural network model for training, wherein the deep neural network model comprises a semantic correlation degree prediction module and a semantic attention mechanism decoding module; the semantic relevancy prediction module takes a root statistical table as semantic guidance to predict semantic relevancy parameters of adjacent characters;

2. The scene text recognition method based on semantic relatedness prediction and attention decoding as claimed in claim 1, wherein: scene texts in the synthetic training data set and the real evaluation data set occupy more than two thirds of the area of a scene text image, the text part of the synthetic training data set comprises N different font styles, N is more than or equal to 2, and the real evaluation data set is obtained by shooting by a camera; the common root statistical table comprises 707 common roots, and the root length range is between 2 and 10 characters.

3. The scene text recognition method based on semantic relatedness prediction and attention decoding as claimed in claim 1, wherein: the operation of stretch transform in step S2 is a bilinear interpolation or downsampling operation.

4. The scene text recognition method based on semantic relatedness prediction and attention decoding as claimed in claim 1, wherein step S3 comprises:

s31, constructing a deep neural network model;

5. The scene text recognition method based on semantic relatedness prediction and attention decoding as claimed in claim 4, wherein the model structure of the deep neural network model is as shown in Table 1:

TABLE 1 model Structure of deep neural network model

TABLE 2 model Structure of residual layer

6. The scene text recognition method based on semantic relatedness prediction and attention decoding as claimed in claim 1, wherein step S4 comprises:

the method comprises the steps that a scene text image to be recognized obtains high-level feature expression with robustness through a deep convolutional neural network model, and a semantic relevancy prediction module predicts to obtain semantic relevancy parameters of adjacent characters by taking a common root statistical table as semantic guidance; and the semantic attention mechanism decoding module performs transcription and correction according to the adjacent character semantic relatedness parameter and the high-level feature expression of the text image to obtain a string of characters as a recognition result.

7. The method for scene text recognition based on semantic relevance prediction and attention decoding as claimed in claim 1, further comprising between steps S3 and S4: testing a deep neural network model;

the deep neural network model test comprises the following steps: inputting the real evaluation data set into a deep neural network model, accurately identifying the real evaluation data set by the deep neural network model, and returning a string of characters as an identification result; and if the recognition result is consistent with the labeled text data corresponding to the real evaluation data set, the recognition capability of the deep neural network model reaches the preset requirement.