AU2021104479A4 - Text recognition method and system based on decoupled attention mechanism - Google Patents
Text recognition method and system based on decoupled attention mechanism Download PDFInfo
- Publication number
- AU2021104479A4 AU2021104479A4 AU2021104479A AU2021104479A AU2021104479A4 AU 2021104479 A4 AU2021104479 A4 AU 2021104479A4 AU 2021104479 A AU2021104479 A AU 2021104479A AU 2021104479 A AU2021104479 A AU 2021104479A AU 2021104479 A4 AU2021104479 A4 AU 2021104479A4
- Authority
- AU
- Australia
- Prior art keywords
- text
- layer
- neural network
- convolutional neural
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/18—Extraction of features or characteristics of the image
- G06V30/18162—Extraction of features or characteristics of the image related to a structural representation of the pattern
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/63—Scene text, e.g. street names
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/146—Aligning or centring of the image pick-up or image-field
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/20—Combination of acquisition, preprocessing or recognition functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/16—Automatic learning of transformation rules, e.g. from examples
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
- Character Discrimination (AREA)
Abstract
The present invention discloses a text recognition method and system based on decoupled
attention mechanism, mainly including feature encoding module, convolutional alignment
module and text decoding module, the feature encoding module extracts visual features from
the input image based on a deep convolutional neural network. The convolutional alignment
module replaces the traditional score-based recursive alignment module by extracting
multiscale visual features from the feature encoding module as input, and generating attention
maps on a channel-by-channel basis using a fully convolutional neural network. The text
decoding module obtains the final prediction result by combining the feature map and attention
map through the gated recursive unit, which is simple to implement, high recognition accuracy,
effectiveness, flexibility and robustness. It has outstanding performance in various text
recognition fields such as scene text recognition and handwritten text recognition, with good
practical application value.
1/2
FIGURES
Feature encoding module
Convolutional alignment module
Text decoding module
Figure I
Description
1/2
Feature encoding module
Convolutional alignment module
Text decoding module
Figure I
Text recognition method and system based on decoupled attention mechanism
The present invention belongs to the technical field of pattern recognition and
artificial intelligence, and particularly relates to a method for accurate image
recognition associated with deep neural networks.
In recent years, text recognition has attracted the research interest of most
scholars. Thanks to deep learning and research on sequence problems, many text
recognition techniques have achieved remarkable success. Connected-time
classification techniques and attention mechanism techniques are two popular
approaches to solve sequential problems, among them, attention mechanism
techniques have shown more outstanding performance and have been widely studied
in recent years.
Attention mechanism techniques were first proposed in solving machine
translation problems and gradually used to deal with scene text recognition problems.
Since then, attention mechanism techniques have dominated a part of the development
in the field of text recognition. Attention mechanism techniques in text recognition are
used to align and recognize characters. In previous work, the alignment operations of
attention mechanism techniques were always combined with decoding operations.
Specifically, the alignment operations of traditional attention mechanism techniques
utilize two types of information to achieve this. One is the feature map, which is the
visual information obtained from encoding the image by the encoder. The second is
the historical decoding information, which can be the hidden layer state during recursion, or the embedding vector of the previous decoding result. The main idea behind the attention mechanism technique is matching, that is, given a portion of features in the feature map, it calculates an attention score by scoring how well this portion of features matches the historical decoded information.
Traditional attention mechanism techniques often face serious alignment
problems due to the inevitable accumulation and propagation of errors brought about
by the relationship between alignment and decoding operations combined together.
Match-based alignment operations are very easily affected by the decoding result, for
example, when there are two similar substrings in a string, decoding information
through history tends to make the attention of attention mechanism techniques jump
from one substring to another, which is the reason why it has been observed in the
literature that attention mechanism techniques have difficulty in aligning long
sequences. The reason is that the longer the sequence the more likely it is to have
similar substrings. Therefore, this encourages us to find a way to decouple the
alignment operation from the historical decoding information, thus mitigating this
negative effect.
The object of the present invention provides a method and system for text
recognition based on a decoupled attention mechanism, by decoupling a conventional
attention mechanism module into an alignment module and a text decoding module. It
avoids the accumulation and propagation of decoding errors, and solves the existing
alignment problem by aligning first and then identifying.
To achieve the above purpose, the present invention provides the following
solutions:
A text recognition method based on a decoupled attention mechanism,
comprising the following steps:
SI, extracting image features based on the text image and encoding to obtain a
feature map.
S2, aligning the feature map to obtain a target image, constructing a deep
convolutional neural network model, processing the target image to obtain an
attention map based on the deep convolutional neural network model and conducting
training.
S3, accurate text recognition of the feature map and the attention map based on
the deep convolutional neural network recognition model.
Preferably, the text image is a scene text image and/or a handwritten text image.
Preferably, the scene text image and/or handwritten text image is characterized
in:
Scene text image features including scene text training data set and scene text
real evaluation data set, scene text training data set and scene text real evaluation data
set covering many different font styles, light and shadow variations and resolution
variations.
Handwritten text image features including handwritten text real training data set
and handwritten text real evaluation data set, handwritten text real training data set
and handwritten text real evaluation data set containing different writing styles.
Preferably, the text portion of the scene text image training data set is complete
and occupies more than two-thirds of the image area, contains a variety of different
font styles, and allows for coverage of light and shadow variations as well as
resolutionvariations.
Preferably, the scene text real evaluation data set is obtained by cell phone,
special hardware camera equipment, during the shooting process, the text in the
normalized text image of the scene should occupy more than two-thirds of the image
area. Allow for the presence of skew, blur, and the captured text images of the scenes
should cover a wide range of application scenes with different font styles.
Preferably, the real training data of the handwritten text and the real evaluation
data of the handwritten text are written and collected by different people, the training
data and the evaluation data are independent from each other.
Preferably, the text image alignment processing method is:
Stretching and transforming the scene text training data set and the scene text
real evaluation data set image data to a uniform size.
The handwritten text real training data set and the handwritten text real
evaluation data set are downscaled by keeping the original image scale, and then the
surrounding area is filled until the uniform size.
Preferably, the deep convolutional neural network is constructed as follows:
Extracting multi-scale visual features based on feature coding.
Constructing deep convolutional neural network models by convolution and
deconvolution with fully convolutional neural networks.
The deconvolution phase, where each output feature is summed by the
corresponding feature map from the convolution phase.
The convolution process is downsampled, the deconvolution process is
upsampled, and all convolution and deconvolution processes except the last one are
followed by a nonlinear layer at the end, and using the ReLu function.
Preferably, the network structure of the deep convolutional neural network model
is an input layer, a convolutional layer, and a residual layer.
Preferably, the residual layer is divided into a first convolutional layer, a first
batch normalization layer, a first nonlinear layer, a second convolutional layer, a
second batch normalization layer, a downsampling layer, and a second nonlinear
layer.
Preferably, a back propagation algorithm is used in the training of the deep
convolutional neural network model in S2 to update all parameters of the network
model by calculating the transfer gradient from the last layer, layer by layer.
Preferably, the deep convolutional neural network model training strategy is in a
supervised manner: a generic the deep network recognition model is trained by using
the text image data and the corresponding annotation information.
Preferably, the input image of the deep convolutional neural network model is a
handwritten text image and/or a scene text image, and the output is a sequence of
characters in the text image and/or the scene text image.
Preferably, the parameters of the deep convolutional neural network model
training are set as follows:
Deep convolutional neural network iteration count of 1,000,000.
Deep convolutional neural network optimizer is Adadelta.
The learning rate of the deep convolutional neural network is 1.0.
Deep convolutional neural network learning rate update strategy: reduced to
one-tenth of the original at 50% and 75% of the total number of iterations,
respectively.
Preferably, the specific method of S3 text recognition is:
Fx,y represents the feature map, at,x,y represents the attention map at moment t
obtained by convolutional alignment, and the semantic vector ct is calculated by
equation (1),
c =x Y=E 1 at,x,yFx,y(1
where W and H are the width and height of the feature map, at the moment t
that,
The output Yt is: yt = Wht + b,
(2)
where W and be are parameters and ht represents the hidden layer state of
the gated recursive unit at moment t.
The calculation of ht is expressed as:
ht = GRU((et_1, ct), ht_1)
(3),
et represents the coding vector of the previous outputyt1; the final loss
function is calculated as shown below,
Loss = -Z.J logP(gt|I,O)
(4)
where Orepresents all learnable parameters of the deep neural network model
and gt represents the sample label value at moment t.
A system for text recognition based on a decoupled attention mechanism,
comprising a feature encoding module, a convolutional alignment module and a text
decoding module.
Feature encoding module extracts visual features from text images based on deep
convolutional neural networks.
The convolutional alignment module extracts multiscale visual features from the
feature encoding module and generates attention maps channel-by-channel via deep
convolutional neural networks.
The text decoding module obtains the final prediction result by combining the
feature map and attention map through the gated recursive unit.
Preferably, the network structure of the deep convolutional neural network unit is
an input layer unit, a convolutional layer unit, and a residual layer unit.
Preferably, the residual layer unit is divided into a first convolutional layer unit, a
first batch normalization layer unit, a first nonlinear layer unit, a second convolutional
layer unit, a second batch normalization layer unit, a downsampling layer unit, and a
second nonlinear layer unit.
Preferably, the nonlinear layer units within the residual layer units are all using
the ReLU activation function.
Preferably, the downsampling layer unit is implemented through the
convolutional layer unit and the batch normalization layer unit.
Technical effects of the present invention:
(1) The present invention decouples the conventional attention mechanism
module. Compared with traditional attention mechanism techniques, the present
invention does not require the information returned at the decoding stage to be aligned,
avoiding the accumulation and propagation of decoding errors, thus enabling a higher
recognition accuracy.
(2) The invention is simple to use, it can be easily embedded into other models,
and it is also flexible enough to freely convert between one-dimensional text and
two-dimensional text.
(3) A back-propagation algorithm is used to automatically adjust the
convolutional kernel parameters, resulting in a more robust filter that can adapt to a
variety of complex environments.
(4) Compared with the manual method, the present invention can automatically
complete the recognition of scene text and handwritten text, which can save
manpower and material resources.
(5) The present invention can provide more reliable alignment performance in
the attention mechanism by decoupling the attention algorithm, especially when faced
with long text, the present invention has more robust characteristics compared with
the traditional attention mechanism.
In order to illustrate more clearly the technical solutions in the embodiments of
the invention or in the prior art, the following is a brief description of the
accompanying figures that need to be used in the embodiments. Obviously, the figures
in the following description are only some embodiments of the present invention, and
other figures may be obtained from these figures for those of ordinary skill in the art
without creative labor.
Figure 1 is the structural block diagram of the deep convolutional network
recognition model of the present invention.
Figure 2 is a flow chart of the text recognition method based on the decoupled
attention mechanism of the present invention.
The technical solutions in the embodiments of the present invention will be
clearly and completely described below in conjunction with the accompanying figures
in the embodiments of the present invention. Obviously, the described embodiments
are only a part of the embodiments in the present invention, and not all of them. Based
on the embodiments in the present invention, all other embodiments obtained by a
person of ordinary skill in the art without making creative labor belong to the scope of
protection of the present invention.
Example 1: A text recognition system based on a decoupled attention mechanism,
as shown in Figure 1, comprising a feature encoding module, a convolutional
alignment module and a text decoding module.
Feature encoding module extracts visual features from text images based on deep convolutional neural networks.
The convolutional alignment module extracts multiscale visual features from the
feature encoding module and generates attention maps channel-by-channel via deep
convolutional neural networks.
The text decoding module obtains the final prediction result by combining the
feature map and attention map through the gated recursive unit.
As shown in Figure 2, the specific steps of the text recognition method based on
the decoupled attention mechanism are:
In the first step, the scene text image and/or handwritten text image is encoded
by feature extraction through the feature encoding module to form a feature map.
Scene text image features including, scene text training data set and scene text
realistic evaluation data set, scene text training data set and scene text realistic
evaluation data set covering many different font styles, light and shadow variations
and resolution variations.
Handwritten text image features including, handwritten text real training data set
and handwritten text real evaluation data set, handwritten text real training data set
and handwritten text real evaluation data set containing different writing styles.
Scene text image training data with the text portion complete and occupying
more than two-thirds of the image, containing a variety of different font styles,
allowing for some degree of light and shadow variation and resolution variation.
The real evaluation data set of scene text is obtained from camera equipment
such as cell phones and special hardware, and the text in the normalized scene text image should occupy more than two-thirds of the image area during the shooting process, allowing for a certain extent of skewing, blurring, and the captured scene text images should cover a wide range of application scenarios with different font styles.
The real training data of handwritten text and the real evaluation data of
handwritten text are written and collected by different people, the training data and
evaluation data are independent from each other.
In the second step, the scene text images and/or handwritten text images are
convolutionally aligned through the convolutional alignment module, the structure of
which is shown in Table 1.
Stretching and transforming the image data of the scene text training data set and
the scene text real evaluation data set to a uniform size.
Deflate the handwritten text real training data set and the handwritten text real
evaluation data set keeping the original image scale, and then fill the surrounding area
until the uniform size.
Table 1
Network layer Specific operation Down/up sampling ratio
Convolutional [Convolution kernel 3*3, number of 2 * 1 layer channels 64] * 5 Deconvolution 2* 1 ler [Convolution kernel 3*3, number of layer channels 64] * 4 (upsampling) Deconvolution Convolution kernel 3*3, number of 2*1 layer channels maxT (upsampling) Nonlinear layer
The deep convolutional neural network was constructed as shown in Table 2 and
trained, the deep convolutional neural network was constructed by: Extracting visual
features based on convolutional neural networks from the scene text images and/or
handwritten text images, extracting multi-scale visual features from the feature
encoding module as input, convolution and deconvolution through a fully
convolutional neural network, where each the output feature is summed by the
corresponding feature mapping in the convolutional phase. The convolution process is
downsampling, the deconvolution process is upsampling, and all convolution and
deconvolution processes except the last one are followed by a nonlinear layer at the
end, which uses the ReLu function. The number of output channels of the last layer of
the deconvolution layer is maxT, and different values are determined according to
different text types, where 25 for scene text and 150 for handwritten text. The final
layer of the nonlinear layer uses a Sigmoid function to keep the output attention map
between 0 and 1. A back-propagation algorithm is used in the training of the deep
neural network model, which updates all parameters of the network model by
calculating the transfer gradient from the last layer, layer by layer.
Table 2
Downsampling ratio Scene Text Network Handw laetork Specific operations ritte One Two layer ritten text Dimensio Dimension n s
Inputlayer Convolutio 2* 1 1* 1 1* 1 nallayer Convolution kernel 3*3, number of channels 32
Convolution kernel 1 * 1, number of channles 32 Residual [Convolution kernel 3 * 3, number of channles 32 2*2 2*2 2*2 layers *3
RConvolution kernel 1 * 1, number of channles 64 les Convolution kernel 3 * 3, number of channles 64 2*2 2*2 1 *1 layers *4
rConvolution kernel 1 * 1, number of channles 128 les Convolution kernel 3 * 3, number of channles 128 2* 1 2* 1 2*2 layers *6
rConvolution kernel 1 * 1, number of channles 256 lConvolution kernel 3 * 3, number of channles 2561 2*2 2*1 1*1 layers *6
[Convolution kernel 1 * 1, number of channles 512 lConvolution kernel 3 * 3, number of channles 512] 2*2 2*1 1*1 layers *3
Table 3
Network layer Specific operations Convolutional layer Convolution kernel 1*1, step size 1*1 Batch normalization layer Nonlinear layer Convolutional layer Convolution kernel 3*3, step size 1*1, padding 1*1 Batch normalization layer
Lower sampling Convolution kernel 1 * 1, Residual layer step size layer Batch normalization layer |
Nonlinear layer
As shown in Table 3, the residual layers are divided into afirst convolutional
layer, a first batch normalization layer, a first nonlinear layer, a second convolutional layer, a second batch normalization layer, a downsampling layer, and a second nonlinear layer.
ReLU activation function is used for all nonlinear layers within the residual
layer.
Downsampling layer implemented by convolutional and batch normalization
layers.
Deep neural network model training strategy using a supervised approach:
training a generic described deep network recognition model using text image data
and corresponding annotation information.
The input images to the deep neural network model are handwritten text images
and/or scene text images, and the output is a sequence of characters in the text images
and/or scene text images.
The parameters of the deep neural network model training are set as follows:
The number of iterations of the deep neural network is 1,000,000.
The deep neural network optimizer is Adadelta.
The deep Neural Network learning rate of 1.0.
Deep neural network learning rate update strategy: reduced to one-tenth of the
original at 50% and 75% of the total number of iterations, respectively.
In the third step, text recognition is performed by the feature map and attention
map through the text recognition module, and the images are accurately recognized by
inputting the feature map and attention map, based on a deep network recognition
model with decoupled attention mechanism.
The specific methods for performing text recognition are:
Fx,y represents the feature map, at,x,y represents the attention map at moment t
obtained by convolutional alignment, and the semantic vector ct is calculated by
equation (1),
Ct -=z4 Z (1) c = Y E1 at,x,yFx,y(1
where W and H are the width and height of the feature map, at the moment t,
The output Yt is: yt = Wht + b,
(2)
Where W and be are parameters and ht represents the hidden layer state of
the gated recursive unit at moment t.
The calculation of ht is expressed as follows:
ht = GR U((et_1, ct), ht_1) (3)
et represents the encoding vector of the last output yt-1; the final loss function
is calculated as shown below,
Loss = -Z._j 1logP(gt|I,O) (4)
where 0 represent all learnable parameters of the deep neural network model,
gt representing the sample label values at the moment t.
Input a text image, a deep network recognition model based on decoupled
attention mechanism performs accurate recognition of the image and gets the words in
the text image.
The above described embodiments are only a description of the preferred way of
the present invention, not a limitation on the scope of the present invention. Without departing from the spirit of the design of the present invention, all kinds of deformations and improvements made to the technical solutions of the present invention by a person of ordinary skill in the art shall belong to the scope of protection determined in the claims of the present invention.
Claims (10)
1. A text recognition method based on a decoupled attention mechanism,
characterized in that it comprises the following steps:
S1, extracting image features based on the text image and encoding to obtain a
feature map;
S2, aligning the feature map to obtain a target image, constructing a deep
convolutional neural network model, processing the target image to obtain an
attention map based on the deep convolutional neural network model and conducting
training;
S3, accurate text recognition of the feature map and the attention map based on
the deep convolutional neural network recognition model.
2. A text recognition method based on a decoupled attention mechanism
according to claim 1, characterized in that:
the text image is a scene text image and/or a handwritten text image;
the scene text image and/or handwritten text image is characterized in:
scene text image features including scene text training data set and scene text
real evaluation data set, scene text training data set and scene text real evaluation data
set covering many different font styles, light and shadow variations and resolution
variations;
handwritten text image features including handwritten text real training data set
and handwritten text real evaluation data set, handwritten text real training data set
and handwritten text real evaluation data set containing different writing styles.
3. A text recognition method based on a decoupled attention mechanism
according to claim 2, characterized in that:
the text portion of the scene text training data set is complete and occupies more
than two-thirds of the image area, contains a variety of different font styles, and
allows for coverage of light and shadow variations as well as resolution variations;
the scene text real evaluation data set is obtained by cell phone, special hardware
camera equipment, during the shooting process, the text in the normalized text image
of the scene should occupy more than two-thirds of the image area. Allow for the
presence of skew, blur, and the captured text images of the scenes should cover a wide
range of application scenes with different font styles;
the real training data of the handwritten text and the real evaluation data of the
handwritten text are written and collected by different people, the training data and
the evaluation data are independent from each other.
4. A text recognition method based on a decoupled attention mechanism
according to claim 2, characterized in that:
the text image alignment processing method is:
stretching and transforming the scene text training data set and the scene text real
evaluation data set image data to a uniform size;
the handwritten text real training data set and the handwritten text real evaluation
data set are downscaled by keeping the original image scale, and then the surrounding
area is filled until the uniform size.
5. A text recognition method based on a decoupled attention mechanism according to claim 1, characterized in that: in the S2, the deep convolutional neural network construction method is: extracting multi-scale visual features based on feature coding; constructing deep convolutional neural network models by convolution and deconvolution with fully convolutional neural networks; the deconvolution phase, where each output feature is summed by the corresponding feature map from the convolution phase; the convolution process is downsampled, the deconvolution process is upsampled, and all convolution and deconvolution processes except the last one are followed by a nonlinear layer at the end, and using the ReLu function; preferably, the network structure of the deep convolutional neural network model is an input layer, a convolutional layer, and a residual layer; preferably, the residual layer is divided into a first convolutional layer, a first batch normalization layer, a first nonlinear layer, a second convolutional layer, a second batch normalization layer, a downsampling layer, and a second nonlinear layer.
6. A text recognition method based on a decoupled attention mechanism
according to claim 1, characterized in that:
a back propagation algorithm is used in the training of the deep convolutional
neural network model in S2 to update all parameters of the network model by
calculating the transfer gradient from the last layer, layer by layer;
the deep convolutional neural network model training strategy is in a supervised manner: a generic the deep network recognition model is trained by using the text image data and the corresponding annotation information; the input image of the deep convolutional neural network model is a handwritten text image and/or a scene text image, and the output is a sequence of characters in the text image and/or the scene text image.
7. A text recognition method based on a decoupled attention mechanism
according to claim 6, characterized in that:
the parameters of the deep convolutional neural network model training are set as
follows:
the deep convolutional neural network iteration count of 1,000,000;
the deep convolutional neural network optimizer is Adadelta;
the learning rate of the deep convolutional neural network is 1.0;
the deep convolutional neural network learning rate update strategy: reduced to
one-tenth of the original at 50% and 75% of the total number of iterations,
respectively.
8. A text recognition method based on a decoupled attention mechanism
according to claim 1, characterized in that:
the specific method of S3 text recognition is:
Fx,y represents the the feature map, at,x,y represents the the attention map at
moment t obtained by convolutional alignment, and the semantic vector ct is
calculated by equation (1),
Ct (1)WJZ where W and H are the width and height of the feature map, at the moment t that, the output Yt is: yt = Wht + b,
(2)
where W and be are parameters and ht represents the hidden layer state of
the gated recursive unit at moment t;
the calculation of ht is expressed as:
ht = GRU((et_1, ct), ht_1)
(3),
et represents the coding vector of the previous outputyt1; the final loss
function is calculated as shown below,
Loss = -Z.j 1 logP(gt|I,O)
(4)
where Orepresents all learnable parameters of the deep neural network model
and gt represents the sample label value at moment t.
9. A system for text recognition based on a decoupled attention mechanism,
characterized in that it includes a feature encoding module, a convolutional alignment
module, and a text decoding module;
the feature encoding module extracts visual features from text images based on
deep convolutional neural networks;
the convolutional alignment module extracts multiscale visual features from the
feature encoding module and generates attention maps channel-by-channel via deep convolutional neural networks; the text decoding module obtains the final prediction result by combining the feature map and attention map through the gated recursive unit.
10. A system for text recognition based on a decoupled attention mechanism of
claim 9, characterized in that:
the network structure of the deep convolutional neural network unit is an input
layer unit, a convolutional layer unit, and a residual layer unit;
the residual layer unit is divided into a first convolutional layer unit, a first batch
normalization layer unit, a first nonlinear layer unit, a second convolutional layer unit,
a second batch normalization layer unit, a downsampling layer unit, and a second
nonlinear layer unit;
the nonlinear layer units within the residual layer units are all using the ReLU
activation function;
the downsampling layer unit is implemented through the convolutional layer unit
and the batch normalization layer unit.
FIGURES 1/2
Figure 1
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021104479A AU2021104479A4 (en) | 2021-07-23 | 2021-07-23 | Text recognition method and system based on decoupled attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021104479A AU2021104479A4 (en) | 2021-07-23 | 2021-07-23 | Text recognition method and system based on decoupled attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2021104479A4 true AU2021104479A4 (en) | 2021-08-26 |
Family
ID=77369696
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2021104479A Active AU2021104479A4 (en) | 2021-07-23 | 2021-07-23 | Text recognition method and system based on decoupled attention mechanism |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2021104479A4 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114170468A (en) * | 2022-02-14 | 2022-03-11 | 阿里巴巴达摩院(杭州)科技有限公司 | Text recognition method, storage medium and computer terminal |
-
2021
- 2021-07-23 AU AU2021104479A patent/AU2021104479A4/en active Active
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114170468A (en) * | 2022-02-14 | 2022-03-11 | 阿里巴巴达摩院(杭州)科技有限公司 | Text recognition method, storage medium and computer terminal |
CN114170468B (en) * | 2022-02-14 | 2022-05-31 | 阿里巴巴达摩院(杭州)科技有限公司 | Text recognition method, storage medium and computer terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111967470A (en) | Text recognition method and system based on decoupling attention mechanism | |
CN110135366B (en) | Shielded pedestrian re-identification method based on multi-scale generation countermeasure network | |
Gao et al. | MLNet: Multichannel feature fusion lozenge network for land segmentation | |
CN113158862B (en) | Multitasking-based lightweight real-time face detection method | |
CN112288011B (en) | Image matching method based on self-attention deep neural network | |
CN114187450A (en) | Remote sensing image semantic segmentation method based on deep learning | |
CN113780149A (en) | Method for efficiently extracting building target of remote sensing image based on attention mechanism | |
CN112150493A (en) | Semantic guidance-based screen area detection method in natural scene | |
CN109635726B (en) | Landslide identification method based on combination of symmetric deep network and multi-scale pooling | |
CN107423747A (en) | A kind of conspicuousness object detection method based on depth convolutional network | |
CN115082675B (en) | Transparent object image segmentation method and system | |
CN111062329B (en) | Unsupervised pedestrian re-identification method based on augmented network | |
CN112862690A (en) | Transformers-based low-resolution image super-resolution method and system | |
CN114724155A (en) | Scene text detection method, system and equipment based on deep convolutional neural network | |
CN110969089A (en) | Lightweight face recognition system and recognition method under noise environment | |
CN110633706B (en) | Semantic segmentation method based on pyramid network | |
AU2021104479A4 (en) | Text recognition method and system based on decoupled attention mechanism | |
Huan et al. | MAENet: multiple attention encoder–decoder network for farmland segmentation of remote sensing images | |
Cheng et al. | A survey on image semantic segmentation using deep learning techniques | |
CN117727046A (en) | Novel mountain torrent front-end instrument and meter reading automatic identification method and system | |
CN117726954A (en) | Sea-land segmentation method and system for remote sensing image | |
You et al. | Boundary-aware multi-scale learning perception for remote sensing image segmentation | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
CN113793267B (en) | Self-supervision single remote sensing image super-resolution method based on cross-dimension attention mechanism | |
CN113222016B (en) | Change detection method and device based on cross enhancement of high-level and low-level features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) |