AU2021104479A4

AU2021104479A4 - Text recognition method and system based on decoupled attention mechanism

Info

Publication number: AU2021104479A4
Application number: AU2021104479A
Authority: AU
Inventors: Lianwen JIN; Huiyun MAO; Qianying Wang; Tianwei WANG; Yaqiang Wu; Zhiyuan Zhu
Original assignee: Guangdong Artificial Intelligence And Digital Economy Laboratory Guangzhou; South China University of Technology SCUT; Lenovo Beijing Ltd
Current assignee: Guangdong Artificial Intelligence And Digital Economy Laboratory Guangzhou China; South China University of Technology SCUT; Lenovo Beijing Ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2021-08-26
Anticipated expiration: 2029-07-23

Abstract

The present invention discloses a text recognition method and system based on decoupled attention mechanism, mainly including feature encoding module, convolutional alignment module and text decoding module, the feature encoding module extracts visual features from the input image based on a deep convolutional neural network. The convolutional alignment module replaces the traditional score-based recursive alignment module by extracting multiscale visual features from the feature encoding module as input, and generating attention maps on a channel-by-channel basis using a fully convolutional neural network. The text decoding module obtains the final prediction result by combining the feature map and attention map through the gated recursive unit, which is simple to implement, high recognition accuracy, effectiveness, flexibility and robustness. It has outstanding performance in various text recognition fields such as scene text recognition and handwritten text recognition, with good practical application value. 1/2 FIGURES Feature encoding module Convolutional alignment module Text decoding module Figure I

Description

1/2

FIGURES

Feature encoding module

Convolutional alignment module

Text decoding module

Figure I

Text recognition method and system based on decoupled attention mechanism

TECHNICAL FIELD

The present invention belongs to the technical field of pattern recognition and

artificial intelligence, and particularly relates to a method for accurate image

recognition associated with deep neural networks.

BACKGROUND

In recent years, text recognition has attracted the research interest of most

scholars. Thanks to deep learning and research on sequence problems, many text

recognition techniques have achieved remarkable success. Connected-time

classification techniques and attention mechanism techniques are two popular

approaches to solve sequential problems, among them, attention mechanism

techniques have shown more outstanding performance and have been widely studied

in recent years.

Attention mechanism techniques were first proposed in solving machine

translation problems and gradually used to deal with scene text recognition problems.

Since then, attention mechanism techniques have dominated a part of the development

in the field of text recognition. Attention mechanism techniques in text recognition are

used to align and recognize characters. In previous work, the alignment operations of

attention mechanism techniques were always combined with decoding operations.

Specifically, the alignment operations of traditional attention mechanism techniques

utilize two types of information to achieve this. One is the feature map, which is the

visual information obtained from encoding the image by the encoder. The second is

the historical decoding information, which can be the hidden layer state during recursion, or the embedding vector of the previous decoding result. The main idea behind the attention mechanism technique is matching, that is, given a portion of features in the feature map, it calculates an attention score by scoring how well this portion of features matches the historical decoded information.

Traditional attention mechanism techniques often face serious alignment

problems due to the inevitable accumulation and propagation of errors brought about

by the relationship between alignment and decoding operations combined together.

Match-based alignment operations are very easily affected by the decoding result, for

example, when there are two similar substrings in a string, decoding information

through history tends to make the attention of attention mechanism techniques jump

from one substring to another, which is the reason why it has been observed in the

literature that attention mechanism techniques have difficulty in aligning long

sequences. The reason is that the longer the sequence the more likely it is to have

similar substrings. Therefore, this encourages us to find a way to decouple the

alignment operation from the historical decoding information, thus mitigating this

negative effect.

SUMMARY

The object of the present invention provides a method and system for text

recognition based on a decoupled attention mechanism, by decoupling a conventional

attention mechanism module into an alignment module and a text decoding module. It

avoids the accumulation and propagation of decoding errors, and solves the existing

alignment problem by aligning first and then identifying.

To achieve the above purpose, the present invention provides the following

solutions:

A text recognition method based on a decoupled attention mechanism,

comprising the following steps:

SI, extracting image features based on the text image and encoding to obtain a

feature map.

S2, aligning the feature map to obtain a target image, constructing a deep

convolutional neural network model, processing the target image to obtain an

attention map based on the deep convolutional neural network model and conducting

training.

S3, accurate text recognition of the feature map and the attention map based on

the deep convolutional neural network recognition model.

Preferably, the text image is a scene text image and/or a handwritten text image.

Preferably, the scene text image and/or handwritten text image is characterized

in:

Scene text image features including scene text training data set and scene text

real evaluation data set, scene text training data set and scene text real evaluation data

set covering many different font styles, light and shadow variations and resolution

variations.

Handwritten text image features including handwritten text real training data set

and handwritten text real evaluation data set, handwritten text real training data set

and handwritten text real evaluation data set containing different writing styles.

Preferably, the text portion of the scene text image training data set is complete

and occupies more than two-thirds of the image area, contains a variety of different

font styles, and allows for coverage of light and shadow variations as well as

resolutionvariations.

Preferably, the scene text real evaluation data set is obtained by cell phone,

special hardware camera equipment, during the shooting process, the text in the

normalized text image of the scene should occupy more than two-thirds of the image

area. Allow for the presence of skew, blur, and the captured text images of the scenes

should cover a wide range of application scenes with different font styles.

Preferably, the real training data of the handwritten text and the real evaluation

data of the handwritten text are written and collected by different people, the training

data and the evaluation data are independent from each other.

Preferably, the text image alignment processing method is:

Stretching and transforming the scene text training data set and the scene text

real evaluation data set image data to a uniform size.

The handwritten text real training data set and the handwritten text real

evaluation data set are downscaled by keeping the original image scale, and then the

surrounding area is filled until the uniform size.

Preferably, the deep convolutional neural network is constructed as follows:

Extracting multi-scale visual features based on feature coding.

Constructing deep convolutional neural network models by convolution and

deconvolution with fully convolutional neural networks.

The deconvolution phase, where each output feature is summed by the

corresponding feature map from the convolution phase.

The convolution process is downsampled, the deconvolution process is

upsampled, and all convolution and deconvolution processes except the last one are

followed by a nonlinear layer at the end, and using the ReLu function.

Preferably, the network structure of the deep convolutional neural network model

is an input layer, a convolutional layer, and a residual layer.

Preferably, the residual layer is divided into a first convolutional layer, a first

batch normalization layer, a first nonlinear layer, a second convolutional layer, a

second batch normalization layer, a downsampling layer, and a second nonlinear

layer.

Preferably, a back propagation algorithm is used in the training of the deep

convolutional neural network model in S2 to update all parameters of the network

model by calculating the transfer gradient from the last layer, layer by layer.

Preferably, the deep convolutional neural network model training strategy is in a

supervised manner: a generic the deep network recognition model is trained by using

the text image data and the corresponding annotation information.

Preferably, the input image of the deep convolutional neural network model is a

handwritten text image and/or a scene text image, and the output is a sequence of

characters in the text image and/or the scene text image.

Preferably, the parameters of the deep convolutional neural network model

training are set as follows:

Deep convolutional neural network iteration count of 1,000,000.

Deep convolutional neural network optimizer is Adadelta.

The learning rate of the deep convolutional neural network is 1.0.

Deep convolutional neural network learning rate update strategy: reduced to

one-tenth of the original at 50% and 75% of the total number of iterations,

respectively.

Preferably, the specific method of S3 text recognition is:

Fx,y represents the feature map, at,x,y represents the attention map at moment t

obtained by convolutional alignment, and the semantic vector ct is calculated by

equation (1),

c =x Y=E 1 at,x,yFx,y(1

where W and H are the width and height of the feature map, at the moment t

that,

The output Yt is: yt = Wht + b,

(2)

where W and be are parameters and ht represents the hidden layer state of

the gated recursive unit at moment t.

The calculation of ht is expressed as:

ht = GRU((et_1, ct), ht_1)

(3),

et represents the coding vector of the previous outputyt1; the final loss

function is calculated as shown below,

Loss = -Z.J logP(gt|I,O)

(4)

where Orepresents all learnable parameters of the deep neural network model

and gt represents the sample label value at moment t.

A system for text recognition based on a decoupled attention mechanism,

comprising a feature encoding module, a convolutional alignment module and a text

decoding module.

Feature encoding module extracts visual features from text images based on deep

convolutional neural networks.

The convolutional alignment module extracts multiscale visual features from the

feature encoding module and generates attention maps channel-by-channel via deep

convolutional neural networks.

The text decoding module obtains the final prediction result by combining the

feature map and attention map through the gated recursive unit.

Preferably, the network structure of the deep convolutional neural network unit is

an input layer unit, a convolutional layer unit, and a residual layer unit.

Preferably, the residual layer unit is divided into a first convolutional layer unit, a

first batch normalization layer unit, a first nonlinear layer unit, a second convolutional

layer unit, a second batch normalization layer unit, a downsampling layer unit, and a

second nonlinear layer unit.

Preferably, the nonlinear layer units within the residual layer units are all using

the ReLU activation function.

Preferably, the downsampling layer unit is implemented through the

convolutional layer unit and the batch normalization layer unit.

Technical effects of the present invention:

(1) The present invention decouples the conventional attention mechanism

module. Compared with traditional attention mechanism techniques, the present

invention does not require the information returned at the decoding stage to be aligned,

avoiding the accumulation and propagation of decoding errors, thus enabling a higher

recognition accuracy.

(2) The invention is simple to use, it can be easily embedded into other models,

and it is also flexible enough to freely convert between one-dimensional text and

two-dimensional text.

(3) A back-propagation algorithm is used to automatically adjust the

convolutional kernel parameters, resulting in a more robust filter that can adapt to a

variety of complex environments.

(4) Compared with the manual method, the present invention can automatically

complete the recognition of scene text and handwritten text, which can save

manpower and material resources.

(5) The present invention can provide more reliable alignment performance in

the attention mechanism by decoupling the attention algorithm, especially when faced

with long text, the present invention has more robust characteristics compared with

the traditional attention mechanism.

BRIEF DESCRIPTION OF THE FIGURES

In order to illustrate more clearly the technical solutions in the embodiments of

the invention or in the prior art, the following is a brief description of the

accompanying figures that need to be used in the embodiments. Obviously, the figures

in the following description are only some embodiments of the present invention, and

other figures may be obtained from these figures for those of ordinary skill in the art

without creative labor.

Figure 1 is the structural block diagram of the deep convolutional network

recognition model of the present invention.

Figure 2 is a flow chart of the text recognition method based on the decoupled

attention mechanism of the present invention.

DESCRIPTION OF THE INVENTION

The technical solutions in the embodiments of the present invention will be

clearly and completely described below in conjunction with the accompanying figures

in the embodiments of the present invention. Obviously, the described embodiments

are only a part of the embodiments in the present invention, and not all of them. Based

on the embodiments in the present invention, all other embodiments obtained by a

person of ordinary skill in the art without making creative labor belong to the scope of

protection of the present invention.

Example 1: A text recognition system based on a decoupled attention mechanism,

as shown in Figure 1, comprising a feature encoding module, a convolutional

alignment module and a text decoding module.

Feature encoding module extracts visual features from text images based on deep convolutional neural networks.

The convolutional alignment module extracts multiscale visual features from the

convolutional neural networks.

The text decoding module obtains the final prediction result by combining the

feature map and attention map through the gated recursive unit.

As shown in Figure 2, the specific steps of the text recognition method based on

the decoupled attention mechanism are:

In the first step, the scene text image and/or handwritten text image is encoded

by feature extraction through the feature encoding module to form a feature map.

Scene text image features including, scene text training data set and scene text

realistic evaluation data set, scene text training data set and scene text realistic

evaluation data set covering many different font styles, light and shadow variations

and resolution variations.

Handwritten text image features including, handwritten text real training data set

Scene text image training data with the text portion complete and occupying

more than two-thirds of the image, containing a variety of different font styles,

allowing for some degree of light and shadow variation and resolution variation.

The real evaluation data set of scene text is obtained from camera equipment

such as cell phones and special hardware, and the text in the normalized scene text image should occupy more than two-thirds of the image area during the shooting process, allowing for a certain extent of skewing, blurring, and the captured scene text images should cover a wide range of application scenarios with different font styles.

The real training data of handwritten text and the real evaluation data of

handwritten text are written and collected by different people, the training data and

evaluation data are independent from each other.

In the second step, the scene text images and/or handwritten text images are

convolutionally aligned through the convolutional alignment module, the structure of

which is shown in Table 1.

Stretching and transforming the image data of the scene text training data set and

the scene text real evaluation data set to a uniform size.

Deflate the handwritten text real training data set and the handwritten text real

evaluation data set keeping the original image scale, and then fill the surrounding area

until the uniform size.

Table 1

Network layer Specific operation Down/up sampling ratio

Convolutional [Convolution kernel 3*3, number of 2 * 1 layer channels 64] * 5 Deconvolution 2* 1 ler [Convolution kernel 3*3, number of layer channels 64] * 4 (upsampling) Deconvolution Convolution kernel 3*3, number of 2*1 layer channels maxT (upsampling) Nonlinear layer

The deep convolutional neural network was constructed as shown in Table 2 and

trained, the deep convolutional neural network was constructed by: Extracting visual

features based on convolutional neural networks from the scene text images and/or

handwritten text images, extracting multi-scale visual features from the feature

encoding module as input, convolution and deconvolution through a fully

convolutional neural network, where each the output feature is summed by the

corresponding feature mapping in the convolutional phase. The convolution process is

downsampling, the deconvolution process is upsampling, and all convolution and

deconvolution processes except the last one are followed by a nonlinear layer at the

end, which uses the ReLu function. The number of output channels of the last layer of

the deconvolution layer is maxT, and different values are determined according to

different text types, where 25 for scene text and 150 for handwritten text. The final

layer of the nonlinear layer uses a Sigmoid function to keep the output attention map

between 0 and 1. A back-propagation algorithm is used in the training of the deep

neural network model, which updates all parameters of the network model by

calculating the transfer gradient from the last layer, layer by layer.

Table 2

Downsampling ratio Scene Text Network Handw laetork Specific operations ritte One Two layer ritten text Dimensio Dimension n s

Inputlayer Convolutio 2* 1 1* 1 1* 1 nallayer Convolution kernel 3*3, number of channels 32

Convolution kernel 1 * 1, number of channles 32 Residual [Convolution kernel 3 * 3, number of channles 32 2*2 2*2 2*2 layers *3

RConvolution kernel 1 * 1, number of channles 64 les Convolution kernel 3 * 3, number of channles 64 2*2 2*2 1 *1 layers *4

rConvolution kernel 1 * 1, number of channles 128 les Convolution kernel 3 * 3, number of channles 128 2* 1 2* 1 2*2 layers *6

rConvolution kernel 1 * 1, number of channles 256 lConvolution kernel 3 * 3, number of channles 2561 2*2 2*1 1*1 layers *6

[Convolution kernel 1 * 1, number of channles 512 lConvolution kernel 3 * 3, number of channles 512] 2*2 2*1 1*1 layers *3

Table 3

Network layer Specific operations Convolutional layer Convolution kernel 1*1, step size 1*1 Batch normalization layer Nonlinear layer Convolutional layer Convolution kernel 3*3, step size 1*1, padding 1*1 Batch normalization layer

Lower sampling Convolution kernel 1 * 1, Residual layer step size layer Batch normalization layer |

Nonlinear layer

As shown in Table 3, the residual layers are divided into afirst convolutional

layer, a first batch normalization layer, a first nonlinear layer, a second convolutional layer, a second batch normalization layer, a downsampling layer, and a second nonlinear layer.

ReLU activation function is used for all nonlinear layers within the residual

layer.

Downsampling layer implemented by convolutional and batch normalization

layers.

Deep neural network model training strategy using a supervised approach:

training a generic described deep network recognition model using text image data

and corresponding annotation information.

The input images to the deep neural network model are handwritten text images

and/or scene text images, and the output is a sequence of characters in the text images

and/or scene text images.

The parameters of the deep neural network model training are set as follows:

The number of iterations of the deep neural network is 1,000,000.

The deep neural network optimizer is Adadelta.

The deep Neural Network learning rate of 1.0.

Deep neural network learning rate update strategy: reduced to one-tenth of the

original at 50% and 75% of the total number of iterations, respectively.

In the third step, text recognition is performed by the feature map and attention

map through the text recognition module, and the images are accurately recognized by

inputting the feature map and attention map, based on a deep network recognition

model with decoupled attention mechanism.

The specific methods for performing text recognition are:

equation (1),

Ct -=z4 Z (1) c = Y E1 at,x,yFx,y(1

where W and H are the width and height of the feature map, at the moment t,

The output Yt is: yt = Wht + b,

(2)

Where W and be are parameters and ht represents the hidden layer state of

the gated recursive unit at moment t.

The calculation of ht is expressed as follows:

ht = GR U((et_1, ct), ht_1) (3)

et represents the encoding vector of the last output yt-1; the final loss function

is calculated as shown below,

Loss = -Z._j 1logP(gt|I,O) (4)

where 0 represent all learnable parameters of the deep neural network model,

gt representing the sample label values at the moment t.

Input a text image, a deep network recognition model based on decoupled

attention mechanism performs accurate recognition of the image and gets the words in

the text image.

The above described embodiments are only a description of the preferred way of

the present invention, not a limitation on the scope of the present invention. Without departing from the spirit of the design of the present invention, all kinds of deformations and improvements made to the technical solutions of the present invention by a person of ordinary skill in the art shall belong to the scope of protection determined in the claims of the present invention.

Claims

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:

1. A text recognition method based on a decoupled attention mechanism,

characterized in that it comprises the following steps:

S1, extracting image features based on the text image and encoding to obtain a

feature map;

S2, aligning the feature map to obtain a target image, constructing a deep

convolutional neural network model, processing the target image to obtain an

training;

S3, accurate text recognition of the feature map and the attention map based on

the deep convolutional neural network recognition model.

2. A text recognition method based on a decoupled attention mechanism

according to claim 1, characterized in that:

the text image is a scene text image and/or a handwritten text image;

the scene text image and/or handwritten text image is characterized in:

scene text image features including scene text training data set and scene text

variations;

3. A text recognition method based on a decoupled attention mechanism

according to claim 2, characterized in that:

the text portion of the scene text training data set is complete and occupies more

than two-thirds of the image area, contains a variety of different font styles, and

allows for coverage of light and shadow variations as well as resolution variations;

the scene text real evaluation data set is obtained by cell phone, special hardware

camera equipment, during the shooting process, the text in the normalized text image

of the scene should occupy more than two-thirds of the image area. Allow for the

presence of skew, blur, and the captured text images of the scenes should cover a wide

range of application scenes with different font styles;

the real training data of the handwritten text and the real evaluation data of the

the evaluation data are independent from each other.

4. A text recognition method based on a decoupled attention mechanism

according to claim 2, characterized in that:

the text image alignment processing method is:

stretching and transforming the scene text training data set and the scene text real

evaluation data set image data to a uniform size;

the handwritten text real training data set and the handwritten text real evaluation

data set are downscaled by keeping the original image scale, and then the surrounding

area is filled until the uniform size.

5. A text recognition method based on a decoupled attention mechanism according to claim 1, characterized in that: in the S2, the deep convolutional neural network construction method is: extracting multi-scale visual features based on feature coding; constructing deep convolutional neural network models by convolution and deconvolution with fully convolutional neural networks; the deconvolution phase, where each output feature is summed by the corresponding feature map from the convolution phase; the convolution process is downsampled, the deconvolution process is upsampled, and all convolution and deconvolution processes except the last one are followed by a nonlinear layer at the end, and using the ReLu function; preferably, the network structure of the deep convolutional neural network model is an input layer, a convolutional layer, and a residual layer; preferably, the residual layer is divided into a first convolutional layer, a first batch normalization layer, a first nonlinear layer, a second convolutional layer, a second batch normalization layer, a downsampling layer, and a second nonlinear layer.

6. A text recognition method based on a decoupled attention mechanism

according to claim 1, characterized in that:

a back propagation algorithm is used in the training of the deep convolutional

neural network model in S2 to update all parameters of the network model by

calculating the transfer gradient from the last layer, layer by layer;

the deep convolutional neural network model training strategy is in a supervised manner: a generic the deep network recognition model is trained by using the text image data and the corresponding annotation information; the input image of the deep convolutional neural network model is a handwritten text image and/or a scene text image, and the output is a sequence of characters in the text image and/or the scene text image.

7. A text recognition method based on a decoupled attention mechanism

according to claim 6, characterized in that:

the parameters of the deep convolutional neural network model training are set as

follows:

the deep convolutional neural network iteration count of 1,000,000;

the deep convolutional neural network optimizer is Adadelta;

the learning rate of the deep convolutional neural network is 1.0;

the deep convolutional neural network learning rate update strategy: reduced to

one-tenth of the original at 50% and 75% of the total number of iterations,

respectively.

8. A text recognition method based on a decoupled attention mechanism

according to claim 1, characterized in that:

the specific method of S3 text recognition is:

Fx,y represents the the feature map, at,x,y represents the the attention map at

moment t obtained by convolutional alignment, and the semantic vector ct is

calculated by equation (1),

Ct (1)WJZ where W and H are the width and height of the feature map, at the moment t that, the output Yt is: yt = Wht + b,

(2)

where W and be are parameters and ht represents the hidden layer state of

the gated recursive unit at moment t;

the calculation of ht is expressed as:

ht = GRU((et_1, ct), ht_1)

(3),

et represents the coding vector of the previous outputyt1; the final loss

function is calculated as shown below,

Loss = -Z.j 1 logP(gt|I,O)

(4)

where Orepresents all learnable parameters of the deep neural network model

and gt represents the sample label value at moment t.

9. A system for text recognition based on a decoupled attention mechanism,

characterized in that it includes a feature encoding module, a convolutional alignment

module, and a text decoding module;

the feature encoding module extracts visual features from text images based on

deep convolutional neural networks;

the convolutional alignment module extracts multiscale visual features from the

feature encoding module and generates attention maps channel-by-channel via deep convolutional neural networks; the text decoding module obtains the final prediction result by combining the feature map and attention map through the gated recursive unit.

10. A system for text recognition based on a decoupled attention mechanism of

claim 9, characterized in that:

the network structure of the deep convolutional neural network unit is an input

layer unit, a convolutional layer unit, and a residual layer unit;

the residual layer unit is divided into a first convolutional layer unit, a first batch

normalization layer unit, a first nonlinear layer unit, a second convolutional layer unit,

a second batch normalization layer unit, a downsampling layer unit, and a second

nonlinear layer unit;

the nonlinear layer units within the residual layer units are all using the ReLU

activation function;

the downsampling layer unit is implemented through the convolutional layer unit

and the batch normalization layer unit.

FIGURES 1/2

Figure 1