CN112070114A

CN112070114A - Scene character recognition method and system based on Gaussian constraint attention mechanism network

Info

Publication number: CN112070114A
Application number: CN202010767079.6A
Authority: CN
Inventors: 王伟平; 乔峙; 秦绪功; 周宇
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2020-08-03
Filing date: 2020-08-03
Publication date: 2020-12-11
Anticipated expiration: 2040-08-03
Also published as: CN112070114B

Abstract

The invention provides a scene character recognition method and a scene character recognition system based on a Gaussian constraint attention mechanism network, which relate to the field of image information recognition, and a two-dimensional feature map is obtained by extracting visual features of a picture to be recognized; converting the two-dimensional feature map into a one-dimensional feature sequence, and extracting global semantic information according to the one-dimensional feature sequence; inputting global semantic information into a first time step to initialize a decoding hidden state, calculating an original attention weight according to the hidden state and a two-dimensional feature map in each time step, and obtaining an original weighted feature vector by weighting and summing the weights; constructing a two-dimensional Gaussian distribution mask according to the hidden state and the original weighted feature vector, multiplying the mask by the original attention weight to obtain a corrected attention weight, and obtaining a corrected weighted feature vector according to the weight; and fusing the original weighted feature vector and the corrected weighted feature vector together to predict characters of the picture to be recognized, so that the condition of distraction can be solved.

Description

Scene character recognition method and system based on Gaussian constraint attention mechanism network

Technical Field

The invention relates to the field of image information identification, in particular to a scene character identification method and a scene character identification system based on a Gaussian constraint attention mechanism network.

Background

Text detection and recognition of scene images is a research hotspot in recent years, wherein character recognition is a core part of the whole process and has the task of transcribing characters in pictures into a form which can be directly edited by a computer. With the development of deep learning, the field is rapidly improved. Inspired by the field of machine translation, the current mainstream method is based on the structure of a coder decoder, the coder extracts rich visual features through a convolutional neural network and a cyclic neural network, and the decoder obtains required features through an attention mechanism to predict each character in a sequence according to the sequence of a text sequence.

However, the prior art has the following defects:

1. character recognition requires only a specific region of each character in a text picture at each time step of decoding, and existing methods do not fully utilize this feature of text recognition.

2. The existing method does not consider the constraint attention weight, but leads the model to freely predict the attention weight, and the problem of attention diffusion occurs in a part of pictures, namely the weight cannot be concentrated on a specific character.

3. Although there are some existing approaches that use gaussian-distributed labels for each character's position to supervise the attention weight, thereby implicitly constraining the attention weight. However, the problem of distraction still occurs in some pictures because there is no process to introduce a display constraint.

Disclosure of Invention

The invention aims to provide a scene character recognition method and a scene character recognition system based on a Gaussian constraint attention mechanism network, which can correct the original attention weight by introducing a display constraint in the process of calculating the attention weight, so that the corrected attention weight can be more concentrated in the region corresponding to a character, and the condition of attention dispersion can be solved.

In order to achieve the purpose, the invention adopts the following technical scheme:

a scene character recognition method based on a Gaussian constraint attention mechanism network comprises the following steps:

extracting visual features of the picture to be identified to obtain a two-dimensional feature map;

converting the two-dimensional feature map into a one-dimensional feature sequence, and extracting global semantic information according to the one-dimensional feature sequence;

inputting global semantic information into a first time step to initialize a decoding hidden state, calculating an original attention weight according to the hidden state and a two-dimensional feature map in each time step, and obtaining an original weighted feature vector by weighting and summing the weights;

constructing a two-dimensional Gaussian distribution mask according to the hidden state and the original weighted feature vector, multiplying the mask by the original attention weight to obtain a corrected attention weight, and obtaining a corrected weighted feature vector according to the weight;

and fusing the original weighted feature vector and the corrected weighted feature vector together to predict characters of the picture to be recognized.

A scene character recognition system based on a Gaussian constraint attention mechanism network comprises:

the characteristic extraction module comprises a multilayer residual error network and is responsible for extracting visual characteristics of the picture to be identified to obtain a two-dimensional characteristic diagram;

the encoder module comprises a unidirectional double-layer long-short-time memory network (LSTM) and is responsible for converting the two-dimensional characteristic diagram into a one-dimensional characteristic sequence and inputting the one-dimensional characteristic sequence into the LSTM to extract global semantic information;

the decoder module comprises an attention mechanism-based unidirectional double-layer long-short time memory network AM-LSTM, and is responsible for updating the hidden state of the AM-LSTM at the first time step based on global semantic information, calculating the original attention weight according to the hidden state of the AM-LSTM and the two-dimensional feature map at each time step, and obtaining an original weighted feature vector by using the weighted sum; fusing the original weighted feature vector and the corrected weighted feature vector together to predict characters of the picture to be recognized;

and the correction module based on Gaussian constraint is responsible for constructing a two-dimensional Gaussian distribution mask according to the hidden state of the AM-LSTM and the original weighted feature vector, multiplying the mask by the original attention weight to obtain a corrected attention weight, and obtaining a corrected weighted feature vector according to the weight.

Further, the feature extraction module includes a 31-layer residual network.

Further, the encoder module is responsible for performing maximum pooling on the two-dimensional feature map and converting the two-dimensional feature map into a one-dimensional feature sequence.

Further, the decoder module is responsible for updating the hidden state of the AM-LSTM at each time step starting from the second time step based on the decoding result of the previous time step.

Furthermore, the correction module based on the Gaussian constraint is responsible for splicing the hidden state of the AM-LSTM and the original weighted feature vector, predicting a group of parameters of Gaussian distribution through a full connection layer, and constructing a two-dimensional Gaussian distribution mask by using the parameters, wherein the parameters comprise a mean value and a variance.

Further, the present system optimizes training by calculating a character recognition penalty optimized by calculating a cross entropy penalty between the predicted character probability and the recognition token and an attention weight penalty optimized by calculating an L1 regression penalty between the predicted character attention distribution and the character position token.

Compared with the existing method, the invention provides a brand-new correction module based on Gaussian constraint, and the module predicts a Gaussian mask to correct the original attention weight. Since characters in the character recognition task are usually regular shapes, the model predicts a Gaussian mask as a display constraint to correct the original attention weight. The attention weight after correction is more concentrated on the area corresponding to the character, so that the condition of attention dispersion can be solved. Experiments show that the invention can obtain more excellent performance on the existing data set, and the module provided by the invention is very flexible and can be used in the existing attention-based method.

Drawings

Fig. 1 is a schematic structural diagram of a scene character recognition network based on a gaussian constraint attention mechanism network according to an embodiment.

Fig. 2 is a schematic diagram of a decoder according to an embodiment.

Fig. 3 is a comparison graph of visualization of recognition results of the present invention and the prior art method.

Detailed Description

In order to make the technical solution of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The embodiment discloses a scene character recognition method and a scene character recognition system (GCAN) based on a Gaussian constraint attention mechanism network, and as shown in FIG. 1, the GCAN is a recognition model based on a two-dimensional attention mechanism and introduces a brand-new correction module (GSRM) based on Gaussian constraint. The GSRM has the input of the characteristics weighted by the original unconstrained attention weight and the output of the characteristics weighted by the corrected attention weight. The two feature vectors are fused and then used for predicting characters of a current decoding time step, wherein the time step refers to one step of predicting the number-th character in a word during iterative decoding, and the decoding is one-step and one-step iteration, and each step predicts corresponding characters. The system consists of four parts: the device comprises a feature extraction module, an encoder module, a decoder module and a correction module based on Gaussian constraint.

And the characteristic extraction module is composed of a residual error network with 31 layers, and the residual error network can extract abundant visual characteristics for the subsequent coding and decoding processes.

And the encoder module consists of a unidirectional double-layer long-time memory network (LSTM). Firstly, performing maximum pooling on a two-dimensional feature map output by a feature extraction module along the vertical direction to obtain a one-dimensional feature sequence. And then the one-dimensional characteristic sequence is input into the LSTM to extract context information to obtain global semantic information, and the output of the encoder module is in a hidden state at the last moment of the LSTM and is used as global semantic information to guide a decoder.

The decoder module is composed of a attention mechanism-based unidirectional double-layer long-short time memory network LSTM (AM-LSTM for short), and the structure of the decoder is shown in figure 2. The global semantic information finally output by the encoder is input at the first decoded time step, and then the decoding result of the last time step is input at each time step to update the hidden state of the AM-LSTM decoder. At each time step, the hidden state of the decoder AM-LSTM and the feature graph output by the feature extraction module calculate the original attention weight, and the feature graph is subjected to weighted summation based on the original attention weight to obtain an original weighted feature vector. Finally fusing the original weighted feature vector and the corrected weighted feature vector and predicting the character of the current time step

The correction module based on Gaussian constraint predicts a group of parameters (mean and variance) of Gaussian distribution through a full connection layer after splicing the hidden state of the decoder AM-LSTM corresponding time step and the original weighted feature vector at each decoding time step, constructs a two-dimensional Gaussian distribution as a mask by using the parameters, and finally multiplies the mask by the original attention weight to obtain a corrected attention weight, and calculates a new corrected weighted feature vector by using the corrected attention weight. Compared with the original attention weight, the corrected attention is more concentrated, so that the problem of attention dispersion can be solved.

The whole process of recognizing the scene characters by adopting the method and the system comprises the following steps:

1. and (4) extracting visual features of the input picture through a feature extraction module to obtain a two-dimensional feature map.

2. The extracted visual features are passed through an encoder module to extract global semantic information, which is then input into a decoder module.

3. The decoder module adopts an attention mechanism, calculates original attention weight through a characteristic diagram output by the hidden state and characteristic extraction module, and then obtains weighted original weighted characteristic vector.

4. And inputting the weighted original weighted feature vector and the hidden state into a correction module based on Gaussian constraint, and respectively correcting the original attention weight predicted by the decoder by using a two-dimensional Gaussian mask to obtain a corrected attention weight and then obtain a corrected weighted feature vector.

5. And fusing the corrected weighted feature vector and the original weighted feature vector together to predict a corresponding character.

6. The entire model optimizes training by calculating the loss of character recognition and attention weight. Wherein the character recognition penalty is optimized by calculating a cross entropy penalty between the predicted character probability and the recognition token, and the penalty for attention weight is optimized by calculating an L1 regression penalty between the predicted character attention distribution and the character position token.

Extensive experiments were conducted to evaluate the effectiveness of the GCAN of the present invention. The GCAN trains on two generated data Syn90K and SynthText, testing on several mainstream scene text data sets. 3000 images of IIIT 5K; mostly high quality horizontal images; the SVT has 647 images, and most of the SVT is horizontal text; there are 645 images of SVT-perspective (SVTP), in which most of the text has stronger deformation; ICDAR2013(IC13) has 1015 images, mostly high-quality horizontal text; ICDAR2015(IC15) had 1811 images, mostly arbitrarily shaped and low quality text images; the cut has 288 images, most of which are high quality curvilinear text.

Table 1 shows the comparison of the effects between the GCAN modules, and the result proves that the new module GCRM provided by the present invention can bring obvious improvement, and the performance of the existing method can not be obviously improved only by character supervision. Table 2 shows the comparison of the effect of the present invention on the test data set with other mainstream methods, and the present invention achieves the best performance on a plurality of data sets, thus proving the effectiveness of the present invention. Fig. 3 shows the visualization of the recognition result and the attention weight of the conventional method and the present invention, and for each recognized picture on the left, the first line on the right is the recognition result of the conventional method, the second line is the recognition result of the present invention, the white ring in the picture is marked as the recognition position, and the lower letter is the recognized character, so that it can be found that the present invention can effectively solve the phenomenon of attention distraction, and can obtain a better recognition result.

Table 1 comparative experiments on the respective modules

TABLE 2 comparison of GCAN with other methods on individual datasets

The above embodiments are only intended to illustrate the technical solution of the present invention, but not to limit it, and a person skilled in the art can modify the technical solution of the present invention or substitute it with an equivalent, and the protection scope of the present invention is subject to the claims.

Claims

1. A scene character recognition method based on a Gaussian constraint attention mechanism network is characterized by comprising the following steps:

2. The method of claim 1, wherein, starting at the second time step, the decoding result of the previous time step is inputted for each time step to update the hidden state.

3. A scene character recognition system based on a Gaussian constraint attention mechanism network is characterized by comprising:

the encoder module comprises a unidirectional double-layer long-short-time memory network (LSTM) and is responsible for converting the two-dimensional feature map into a one-dimensional feature sequence, inputting the one-dimensional feature sequence into the LSTM to extract global semantic information and outputting a hidden state of the LSTM at the last moment;

the decoder module comprises an attention mechanism-based unidirectional double-layer long-short time memory network AM-LSTM, and is responsible for updating the hidden state of the AM-LSTM at each time step based on global semantic information, calculating the original attention weight according to the hidden state of the AM-LSTM and the two-dimensional feature map at each time step, and obtaining an original weighted feature vector by using the weighted sum; fusing the original weighted feature vector and the corrected weighted feature vector together to predict characters of the picture to be recognized;

4. The system of claim 3, wherein the feature extraction module comprises a 31-layer residual network.

5. The system of claim 3, wherein the encoder module is responsible for max pooling the two-dimensional feature map into a one-dimensional feature sequence.

6. The system of claim 3, wherein the decoder module is responsible for inputting the global semantic information at a first time step of decoding to obtain a decoding result at a next time step, and then updating the hidden state of the AM-LSTM at each time step according to the decoding result at the previous time step.

7. The system of claim 3, wherein the Gaussian constraint based unscrambling module is responsible for concatenating the hidden state of the AM-LSTM with the original weighted eigenvector and then predicting a set of parameters of Gaussian distribution through a full connection layer, and using the parameters to construct the two-dimensional Gaussian distribution mask.

8. The system of claim 7, wherein the parameters of the gaussian distribution include a mean and a variance.

9. The system of claim 1, wherein the system optimizes training by calculating a character recognition penalty and an attention weight penalty.

10. The system of claim 9, wherein the character recognition penalty is optimized by calculating a cross entropy penalty between the predicted character probability and the recognition token, and the attention weight penalty is optimized by calculating an L1 regression penalty between the predicted character attention distribution and the character position token.