AU2021104479A4 - Text recognition method and system based on decoupled attention mechanism - Google Patents

Text recognition method and system based on decoupled attention mechanism Download PDF

Info

Publication number
AU2021104479A4
AU2021104479A4 AU2021104479A AU2021104479A AU2021104479A4 AU 2021104479 A4 AU2021104479 A4 AU 2021104479A4 AU 2021104479 A AU2021104479 A AU 2021104479A AU 2021104479 A AU2021104479 A AU 2021104479A AU 2021104479 A4 AU2021104479 A4 AU 2021104479A4
Authority
AU
Australia
Prior art keywords
text
layer
neural network
convolutional neural
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
AU2021104479A
Inventor
Lianwen JIN
Huiyun MAO
Qianying Wang
Tianwei WANG
Yaqiang Wu
Zhiyuan Zhu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Artificial Intelligence And Digital Economy Laboratory Guangzhou China
South China University of Technology SCUT
Lenovo Beijing Ltd
Original Assignee
Guangdong Artificial Intelligence And Digital Economy Laboratory Guangzhou
South China University of Technology SCUT
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Artificial Intelligence And Digital Economy Laboratory Guangzhou, South China University of Technology SCUT, Lenovo Beijing Ltd filed Critical Guangdong Artificial Intelligence And Digital Economy Laboratory Guangzhou
Priority to AU2021104479A priority Critical patent/AU2021104479A4/en
Application granted granted Critical
Publication of AU2021104479A4 publication Critical patent/AU2021104479A4/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/18162Extraction of features or characteristics of the image related to a structural representation of the pattern
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/20Combination of acquisition, preprocessing or recognition functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/16Automatic learning of transformation rules, e.g. from examples
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

The present invention discloses a text recognition method and system based on decoupled attention mechanism, mainly including feature encoding module, convolutional alignment module and text decoding module, the feature encoding module extracts visual features from the input image based on a deep convolutional neural network. The convolutional alignment module replaces the traditional score-based recursive alignment module by extracting multiscale visual features from the feature encoding module as input, and generating attention maps on a channel-by-channel basis using a fully convolutional neural network. The text decoding module obtains the final prediction result by combining the feature map and attention map through the gated recursive unit, which is simple to implement, high recognition accuracy, effectiveness, flexibility and robustness. It has outstanding performance in various text recognition fields such as scene text recognition and handwritten text recognition, with good practical application value. 1/2 FIGURES Feature encoding module Convolutional alignment module Text decoding module Figure I

Description

1/2
FIGURES
Feature encoding module
Convolutional alignment module
Text decoding module
Figure I
Text recognition method and system based on decoupled attention mechanism
TECHNICAL FIELD
The present invention belongs to the technical field of pattern recognition and
artificial intelligence, and particularly relates to a method for accurate image
recognition associated with deep neural networks.
BACKGROUND
In recent years, text recognition has attracted the research interest of most
scholars. Thanks to deep learning and research on sequence problems, many text
recognition techniques have achieved remarkable success. Connected-time
classification techniques and attention mechanism techniques are two popular
approaches to solve sequential problems, among them, attention mechanism
techniques have shown more outstanding performance and have been widely studied
in recent years.
Attention mechanism techniques were first proposed in solving machine
translation problems and gradually used to deal with scene text recognition problems.
Since then, attention mechanism techniques have dominated a part of the development
in the field of text recognition. Attention mechanism techniques in text recognition are
used to align and recognize characters. In previous work, the alignment operations of
attention mechanism techniques were always combined with decoding operations.
Specifically, the alignment operations of traditional attention mechanism techniques
utilize two types of information to achieve this. One is the feature map, which is the
visual information obtained from encoding the image by the encoder. The second is
the historical decoding information, which can be the hidden layer state during recursion, or the embedding vector of the previous decoding result. The main idea behind the attention mechanism technique is matching, that is, given a portion of features in the feature map, it calculates an attention score by scoring how well this portion of features matches the historical decoded information.
Traditional attention mechanism techniques often face serious alignment
problems due to the inevitable accumulation and propagation of errors brought about
by the relationship between alignment and decoding operations combined together.
Match-based alignment operations are very easily affected by the decoding result, for
example, when there are two similar substrings in a string, decoding information
through history tends to make the attention of attention mechanism techniques jump
from one substring to another, which is the reason why it has been observed in the
literature that attention mechanism techniques have difficulty in aligning long
sequences. The reason is that the longer the sequence the more likely it is to have
similar substrings. Therefore, this encourages us to find a way to decouple the
alignment operation from the historical decoding information, thus mitigating this
negative effect.
SUMMARY
The object of the present invention provides a method and system for text
recognition based on a decoupled attention mechanism, by decoupling a conventional
attention mechanism module into an alignment module and a text decoding module. It
avoids the accumulation and propagation of decoding errors, and solves the existing
alignment problem by aligning first and then identifying.
To achieve the above purpose, the present invention provides the following
solutions:
A text recognition method based on a decoupled attention mechanism,
comprising the following steps:
SI, extracting image features based on the text image and encoding to obtain a
feature map.
S2, aligning the feature map to obtain a target image, constructing a deep
convolutional neural network model, processing the target image to obtain an
attention map based on the deep convolutional neural network model and conducting
training.
S3, accurate text recognition of the feature map and the attention map based on
the deep convolutional neural network recognition model.
Preferably, the text image is a scene text image and/or a handwritten text image.
Preferably, the scene text image and/or handwritten text image is characterized
in:
Scene text image features including scene text training data set and scene text
real evaluation data set, scene text training data set and scene text real evaluation data
set covering many different font styles, light and shadow variations and resolution
variations.
Handwritten text image features including handwritten text real training data set
and handwritten text real evaluation data set, handwritten text real training data set
and handwritten text real evaluation data set containing different writing styles.
Preferably, the text portion of the scene text image training data set is complete
and occupies more than two-thirds of the image area, contains a variety of different
font styles, and allows for coverage of light and shadow variations as well as
resolutionvariations.
Preferably, the scene text real evaluation data set is obtained by cell phone,
special hardware camera equipment, during the shooting process, the text in the
normalized text image of the scene should occupy more than two-thirds of the image
area. Allow for the presence of skew, blur, and the captured text images of the scenes
should cover a wide range of application scenes with different font styles.
Preferably, the real training data of the handwritten text and the real evaluation
data of the handwritten text are written and collected by different people, the training
data and the evaluation data are independent from each other.
Preferably, the text image alignment processing method is:
Stretching and transforming the scene text training data set and the scene text
real evaluation data set image data to a uniform size.
The handwritten text real training data set and the handwritten text real
evaluation data set are downscaled by keeping the original image scale, and then the
surrounding area is filled until the uniform size.
Preferably, the deep convolutional neural network is constructed as follows:
Extracting multi-scale visual features based on feature coding.
Constructing deep convolutional neural network models by convolution and
deconvolution with fully convolutional neural networks.
The deconvolution phase, where each output feature is summed by the
corresponding feature map from the convolution phase.
The convolution process is downsampled, the deconvolution process is
upsampled, and all convolution and deconvolution processes except the last one are
followed by a nonlinear layer at the end, and using the ReLu function.
Preferably, the network structure of the deep convolutional neural network model
is an input layer, a convolutional layer, and a residual layer.
Preferably, the residual layer is divided into a first convolutional layer, a first
batch normalization layer, a first nonlinear layer, a second convolutional layer, a
second batch normalization layer, a downsampling layer, and a second nonlinear
layer.
Preferably, a back propagation algorithm is used in the training of the deep
convolutional neural network model in S2 to update all parameters of the network
model by calculating the transfer gradient from the last layer, layer by layer.
Preferably, the deep convolutional neural network model training strategy is in a
supervised manner: a generic the deep network recognition model is trained by using
the text image data and the corresponding annotation information.
Preferably, the input image of the deep convolutional neural network model is a
handwritten text image and/or a scene text image, and the output is a sequence of
characters in the text image and/or the scene text image.
Preferably, the parameters of the deep convolutional neural network model
training are set as follows:
Deep convolutional neural network iteration count of 1,000,000.
Deep convolutional neural network optimizer is Adadelta.
The learning rate of the deep convolutional neural network is 1.0.
Deep convolutional neural network learning rate update strategy: reduced to
one-tenth of the original at 50% and 75% of the total number of iterations,
respectively.
Preferably, the specific method of S3 text recognition is:
Fx,y represents the feature map, at,x,y represents the attention map at moment t
obtained by convolutional alignment, and the semantic vector ct is calculated by
equation (1),
c =x Y=E 1 at,x,yFx,y(1
where W and H are the width and height of the feature map, at the moment t
that,
The output Yt is: yt = Wht + b,
(2)
where W and be are parameters and ht represents the hidden layer state of
the gated recursive unit at moment t.
The calculation of ht is expressed as:
ht = GRU((et_1, ct), ht_1)
(3),
et represents the coding vector of the previous outputyt1; the final loss
function is calculated as shown below,
Loss = -Z.J logP(gt|I,O)
(4)
where Orepresents all learnable parameters of the deep neural network model
and gt represents the sample label value at moment t.
A system for text recognition based on a decoupled attention mechanism,
comprising a feature encoding module, a convolutional alignment module and a text
decoding module.
Feature encoding module extracts visual features from text images based on deep
convolutional neural networks.
The convolutional alignment module extracts multiscale visual features from the
feature encoding module and generates attention maps channel-by-channel via deep
convolutional neural networks.
The text decoding module obtains the final prediction result by combining the
feature map and attention map through the gated recursive unit.
Preferably, the network structure of the deep convolutional neural network unit is
an input layer unit, a convolutional layer unit, and a residual layer unit.
Preferably, the residual layer unit is divided into a first convolutional layer unit, a
first batch normalization layer unit, a first nonlinear layer unit, a second convolutional
layer unit, a second batch normalization layer unit, a downsampling layer unit, and a
second nonlinear layer unit.
Preferably, the nonlinear layer units within the residual layer units are all using
the ReLU activation function.
Preferably, the downsampling layer unit is implemented through the
convolutional layer unit and the batch normalization layer unit.
Technical effects of the present invention:
(1) The present invention decouples the conventional attention mechanism
module. Compared with traditional attention mechanism techniques, the present
invention does not require the information returned at the decoding stage to be aligned,
avoiding the accumulation and propagation of decoding errors, thus enabling a higher
recognition accuracy.
(2) The invention is simple to use, it can be easily embedded into other models,
and it is also flexible enough to freely convert between one-dimensional text and
two-dimensional text.
(3) A back-propagation algorithm is used to automatically adjust the
convolutional kernel parameters, resulting in a more robust filter that can adapt to a
variety of complex environments.
(4) Compared with the manual method, the present invention can automatically
complete the recognition of scene text and handwritten text, which can save
manpower and material resources.
(5) The present invention can provide more reliable alignment performance in
the attention mechanism by decoupling the attention algorithm, especially when faced
with long text, the present invention has more robust characteristics compared with
the traditional attention mechanism.
BRIEF DESCRIPTION OF THE FIGURES
In order to illustrate more clearly the technical solutions in the embodiments of
the invention or in the prior art, the following is a brief description of the
accompanying figures that need to be used in the embodiments. Obviously, the figures
in the following description are only some embodiments of the present invention, and
other figures may be obtained from these figures for those of ordinary skill in the art
without creative labor.
Figure 1 is the structural block diagram of the deep convolutional network
recognition model of the present invention.
Figure 2 is a flow chart of the text recognition method based on the decoupled
attention mechanism of the present invention.
DESCRIPTION OF THE INVENTION
The technical solutions in the embodiments of the present invention will be
clearly and completely described below in conjunction with the accompanying figures
in the embodiments of the present invention. Obviously, the described embodiments
are only a part of the embodiments in the present invention, and not all of them. Based
on the embodiments in the present invention, all other embodiments obtained by a
person of ordinary skill in the art without making creative labor belong to the scope of
protection of the present invention.
Example 1: A text recognition system based on a decoupled attention mechanism,
as shown in Figure 1, comprising a feature encoding module, a convolutional
alignment module and a text decoding module.
Feature encoding module extracts visual features from text images based on deep convolutional neural networks.
The convolutional alignment module extracts multiscale visual features from the
feature encoding module and generates attention maps channel-by-channel via deep
convolutional neural networks.
The text decoding module obtains the final prediction result by combining the
feature map and attention map through the gated recursive unit.
As shown in Figure 2, the specific steps of the text recognition method based on
the decoupled attention mechanism are:
In the first step, the scene text image and/or handwritten text image is encoded
by feature extraction through the feature encoding module to form a feature map.
Scene text image features including, scene text training data set and scene text
realistic evaluation data set, scene text training data set and scene text realistic
evaluation data set covering many different font styles, light and shadow variations
and resolution variations.
Handwritten text image features including, handwritten text real training data set
and handwritten text real evaluation data set, handwritten text real training data set
and handwritten text real evaluation data set containing different writing styles.
Scene text image training data with the text portion complete and occupying
more than two-thirds of the image, containing a variety of different font styles,
allowing for some degree of light and shadow variation and resolution variation.
The real evaluation data set of scene text is obtained from camera equipment
such as cell phones and special hardware, and the text in the normalized scene text image should occupy more than two-thirds of the image area during the shooting process, allowing for a certain extent of skewing, blurring, and the captured scene text images should cover a wide range of application scenarios with different font styles.
The real training data of handwritten text and the real evaluation data of
handwritten text are written and collected by different people, the training data and
evaluation data are independent from each other.
In the second step, the scene text images and/or handwritten text images are
convolutionally aligned through the convolutional alignment module, the structure of
which is shown in Table 1.
Stretching and transforming the image data of the scene text training data set and
the scene text real evaluation data set to a uniform size.
Deflate the handwritten text real training data set and the handwritten text real
evaluation data set keeping the original image scale, and then fill the surrounding area
until the uniform size.
Table 1
Network layer Specific operation Down/up sampling ratio
Convolutional [Convolution kernel 3*3, number of 2 * 1 layer channels 64] * 5 Deconvolution 2* 1 ler [Convolution kernel 3*3, number of layer channels 64] * 4 (upsampling) Deconvolution Convolution kernel 3*3, number of 2*1 layer channels maxT (upsampling) Nonlinear layer
The deep convolutional neural network was constructed as shown in Table 2 and
trained, the deep convolutional neural network was constructed by: Extracting visual
features based on convolutional neural networks from the scene text images and/or
handwritten text images, extracting multi-scale visual features from the feature
encoding module as input, convolution and deconvolution through a fully
convolutional neural network, where each the output feature is summed by the
corresponding feature mapping in the convolutional phase. The convolution process is
downsampling, the deconvolution process is upsampling, and all convolution and
deconvolution processes except the last one are followed by a nonlinear layer at the
end, which uses the ReLu function. The number of output channels of the last layer of
the deconvolution layer is maxT, and different values are determined according to
different text types, where 25 for scene text and 150 for handwritten text. The final
layer of the nonlinear layer uses a Sigmoid function to keep the output attention map
between 0 and 1. A back-propagation algorithm is used in the training of the deep
neural network model, which updates all parameters of the network model by
calculating the transfer gradient from the last layer, layer by layer.
Table 2
Downsampling ratio Scene Text Network Handw laetork Specific operations ritte One Two layer ritten text Dimensio Dimension n s
Inputlayer Convolutio 2* 1 1* 1 1* 1 nallayer Convolution kernel 3*3, number of channels 32
Convolution kernel 1 * 1, number of channles 32 Residual [Convolution kernel 3 * 3, number of channles 32 2*2 2*2 2*2 layers *3
RConvolution kernel 1 * 1, number of channles 64 les Convolution kernel 3 * 3, number of channles 64 2*2 2*2 1 *1 layers *4
rConvolution kernel 1 * 1, number of channles 128 les Convolution kernel 3 * 3, number of channles 128 2* 1 2* 1 2*2 layers *6
rConvolution kernel 1 * 1, number of channles 256 lConvolution kernel 3 * 3, number of channles 2561 2*2 2*1 1*1 layers *6
[Convolution kernel 1 * 1, number of channles 512 lConvolution kernel 3 * 3, number of channles 512] 2*2 2*1 1*1 layers *3
Table 3
Network layer Specific operations Convolutional layer Convolution kernel 1*1, step size 1*1 Batch normalization layer Nonlinear layer Convolutional layer Convolution kernel 3*3, step size 1*1, padding 1*1 Batch normalization layer
Lower sampling Convolution kernel 1 * 1, Residual layer step size layer Batch normalization layer |
Nonlinear layer
As shown in Table 3, the residual layers are divided into afirst convolutional
layer, a first batch normalization layer, a first nonlinear layer, a second convolutional layer, a second batch normalization layer, a downsampling layer, and a second nonlinear layer.
ReLU activation function is used for all nonlinear layers within the residual
layer.
Downsampling layer implemented by convolutional and batch normalization
layers.
Deep neural network model training strategy using a supervised approach:
training a generic described deep network recognition model using text image data
and corresponding annotation information.
The input images to the deep neural network model are handwritten text images
and/or scene text images, and the output is a sequence of characters in the text images
and/or scene text images.
The parameters of the deep neural network model training are set as follows:
The number of iterations of the deep neural network is 1,000,000.
The deep neural network optimizer is Adadelta.
The deep Neural Network learning rate of 1.0.
Deep neural network learning rate update strategy: reduced to one-tenth of the
original at 50% and 75% of the total number of iterations, respectively.
In the third step, text recognition is performed by the feature map and attention
map through the text recognition module, and the images are accurately recognized by
inputting the feature map and attention map, based on a deep network recognition
model with decoupled attention mechanism.
The specific methods for performing text recognition are:
Fx,y represents the feature map, at,x,y represents the attention map at moment t
obtained by convolutional alignment, and the semantic vector ct is calculated by
equation (1),
Ct -=z4 Z (1) c = Y E1 at,x,yFx,y(1
where W and H are the width and height of the feature map, at the moment t,
The output Yt is: yt = Wht + b,
(2)
Where W and be are parameters and ht represents the hidden layer state of
the gated recursive unit at moment t.
The calculation of ht is expressed as follows:
ht = GR U((et_1, ct), ht_1) (3)
et represents the encoding vector of the last output yt-1; the final loss function
is calculated as shown below,
Loss = -Z._j 1logP(gt|I,O) (4)
where 0 represent all learnable parameters of the deep neural network model,
gt representing the sample label values at the moment t.
Input a text image, a deep network recognition model based on decoupled
attention mechanism performs accurate recognition of the image and gets the words in
the text image.
The above described embodiments are only a description of the preferred way of
the present invention, not a limitation on the scope of the present invention. Without departing from the spirit of the design of the present invention, all kinds of deformations and improvements made to the technical solutions of the present invention by a person of ordinary skill in the art shall belong to the scope of protection determined in the claims of the present invention.

Claims (10)

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:
1. A text recognition method based on a decoupled attention mechanism,
characterized in that it comprises the following steps:
S1, extracting image features based on the text image and encoding to obtain a
feature map;
S2, aligning the feature map to obtain a target image, constructing a deep
convolutional neural network model, processing the target image to obtain an
attention map based on the deep convolutional neural network model and conducting
training;
S3, accurate text recognition of the feature map and the attention map based on
the deep convolutional neural network recognition model.
2. A text recognition method based on a decoupled attention mechanism
according to claim 1, characterized in that:
the text image is a scene text image and/or a handwritten text image;
the scene text image and/or handwritten text image is characterized in:
scene text image features including scene text training data set and scene text
real evaluation data set, scene text training data set and scene text real evaluation data
set covering many different font styles, light and shadow variations and resolution
variations;
handwritten text image features including handwritten text real training data set
and handwritten text real evaluation data set, handwritten text real training data set
and handwritten text real evaluation data set containing different writing styles.
3. A text recognition method based on a decoupled attention mechanism
according to claim 2, characterized in that:
the text portion of the scene text training data set is complete and occupies more
than two-thirds of the image area, contains a variety of different font styles, and
allows for coverage of light and shadow variations as well as resolution variations;
the scene text real evaluation data set is obtained by cell phone, special hardware
camera equipment, during the shooting process, the text in the normalized text image
of the scene should occupy more than two-thirds of the image area. Allow for the
presence of skew, blur, and the captured text images of the scenes should cover a wide
range of application scenes with different font styles;
the real training data of the handwritten text and the real evaluation data of the
handwritten text are written and collected by different people, the training data and
the evaluation data are independent from each other.
4. A text recognition method based on a decoupled attention mechanism
according to claim 2, characterized in that:
the text image alignment processing method is:
stretching and transforming the scene text training data set and the scene text real
evaluation data set image data to a uniform size;
the handwritten text real training data set and the handwritten text real evaluation
data set are downscaled by keeping the original image scale, and then the surrounding
area is filled until the uniform size.
5. A text recognition method based on a decoupled attention mechanism according to claim 1, characterized in that: in the S2, the deep convolutional neural network construction method is: extracting multi-scale visual features based on feature coding; constructing deep convolutional neural network models by convolution and deconvolution with fully convolutional neural networks; the deconvolution phase, where each output feature is summed by the corresponding feature map from the convolution phase; the convolution process is downsampled, the deconvolution process is upsampled, and all convolution and deconvolution processes except the last one are followed by a nonlinear layer at the end, and using the ReLu function; preferably, the network structure of the deep convolutional neural network model is an input layer, a convolutional layer, and a residual layer; preferably, the residual layer is divided into a first convolutional layer, a first batch normalization layer, a first nonlinear layer, a second convolutional layer, a second batch normalization layer, a downsampling layer, and a second nonlinear layer.
6. A text recognition method based on a decoupled attention mechanism
according to claim 1, characterized in that:
a back propagation algorithm is used in the training of the deep convolutional
neural network model in S2 to update all parameters of the network model by
calculating the transfer gradient from the last layer, layer by layer;
the deep convolutional neural network model training strategy is in a supervised manner: a generic the deep network recognition model is trained by using the text image data and the corresponding annotation information; the input image of the deep convolutional neural network model is a handwritten text image and/or a scene text image, and the output is a sequence of characters in the text image and/or the scene text image.
7. A text recognition method based on a decoupled attention mechanism
according to claim 6, characterized in that:
the parameters of the deep convolutional neural network model training are set as
follows:
the deep convolutional neural network iteration count of 1,000,000;
the deep convolutional neural network optimizer is Adadelta;
the learning rate of the deep convolutional neural network is 1.0;
the deep convolutional neural network learning rate update strategy: reduced to
one-tenth of the original at 50% and 75% of the total number of iterations,
respectively.
8. A text recognition method based on a decoupled attention mechanism
according to claim 1, characterized in that:
the specific method of S3 text recognition is:
Fx,y represents the the feature map, at,x,y represents the the attention map at
moment t obtained by convolutional alignment, and the semantic vector ct is
calculated by equation (1),
Ct (1)WJZ where W and H are the width and height of the feature map, at the moment t that, the output Yt is: yt = Wht + b,
(2)
where W and be are parameters and ht represents the hidden layer state of
the gated recursive unit at moment t;
the calculation of ht is expressed as:
ht = GRU((et_1, ct), ht_1)
(3),
et represents the coding vector of the previous outputyt1; the final loss
function is calculated as shown below,
Loss = -Z.j 1 logP(gt|I,O)
(4)
where Orepresents all learnable parameters of the deep neural network model
and gt represents the sample label value at moment t.
9. A system for text recognition based on a decoupled attention mechanism,
characterized in that it includes a feature encoding module, a convolutional alignment
module, and a text decoding module;
the feature encoding module extracts visual features from text images based on
deep convolutional neural networks;
the convolutional alignment module extracts multiscale visual features from the
feature encoding module and generates attention maps channel-by-channel via deep convolutional neural networks; the text decoding module obtains the final prediction result by combining the feature map and attention map through the gated recursive unit.
10. A system for text recognition based on a decoupled attention mechanism of
claim 9, characterized in that:
the network structure of the deep convolutional neural network unit is an input
layer unit, a convolutional layer unit, and a residual layer unit;
the residual layer unit is divided into a first convolutional layer unit, a first batch
normalization layer unit, a first nonlinear layer unit, a second convolutional layer unit,
a second batch normalization layer unit, a downsampling layer unit, and a second
nonlinear layer unit;
the nonlinear layer units within the residual layer units are all using the ReLU
activation function;
the downsampling layer unit is implemented through the convolutional layer unit
and the batch normalization layer unit.
FIGURES 1/2
Figure 1
AU2021104479A 2021-07-23 2021-07-23 Text recognition method and system based on decoupled attention mechanism Active AU2021104479A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2021104479A AU2021104479A4 (en) 2021-07-23 2021-07-23 Text recognition method and system based on decoupled attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2021104479A AU2021104479A4 (en) 2021-07-23 2021-07-23 Text recognition method and system based on decoupled attention mechanism

Publications (1)

Publication Number Publication Date
AU2021104479A4 true AU2021104479A4 (en) 2021-08-26

Family

ID=77369696

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2021104479A Active AU2021104479A4 (en) 2021-07-23 2021-07-23 Text recognition method and system based on decoupled attention mechanism

Country Status (1)

Country Link
AU (1) AU2021104479A4 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114170468A (en) * 2022-02-14 2022-03-11 阿里巴巴达摩院(杭州)科技有限公司 Text recognition method, storage medium and computer terminal

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114170468A (en) * 2022-02-14 2022-03-11 阿里巴巴达摩院(杭州)科技有限公司 Text recognition method, storage medium and computer terminal
CN114170468B (en) * 2022-02-14 2022-05-31 阿里巴巴达摩院(杭州)科技有限公司 Text recognition method, storage medium and computer terminal

Similar Documents

Publication Publication Date Title
CN111967470A (en) Text recognition method and system based on decoupling attention mechanism
CN110135366B (en) Shielded pedestrian re-identification method based on multi-scale generation countermeasure network
Gao et al. MLNet: Multichannel feature fusion lozenge network for land segmentation
CN113158862B (en) Multitasking-based lightweight real-time face detection method
CN112288011B (en) Image matching method based on self-attention deep neural network
CN114187450A (en) Remote sensing image semantic segmentation method based on deep learning
CN113780149A (en) Method for efficiently extracting building target of remote sensing image based on attention mechanism
CN112150493A (en) Semantic guidance-based screen area detection method in natural scene
CN109635726B (en) Landslide identification method based on combination of symmetric deep network and multi-scale pooling
CN107423747A (en) A kind of conspicuousness object detection method based on depth convolutional network
CN115082675B (en) Transparent object image segmentation method and system
CN111062329B (en) Unsupervised pedestrian re-identification method based on augmented network
CN112862690A (en) Transformers-based low-resolution image super-resolution method and system
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
CN110969089A (en) Lightweight face recognition system and recognition method under noise environment
CN110633706B (en) Semantic segmentation method based on pyramid network
AU2021104479A4 (en) Text recognition method and system based on decoupled attention mechanism
Huan et al. MAENet: multiple attention encoder–decoder network for farmland segmentation of remote sensing images
Cheng et al. A survey on image semantic segmentation using deep learning techniques
CN117727046A (en) Novel mountain torrent front-end instrument and meter reading automatic identification method and system
CN117726954A (en) Sea-land segmentation method and system for remote sensing image
You et al. Boundary-aware multi-scale learning perception for remote sensing image segmentation
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN113793267B (en) Self-supervision single remote sensing image super-resolution method based on cross-dimension attention mechanism
CN113222016B (en) Change detection method and device based on cross enhancement of high-level and low-level features

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)