CN112257716A - Scene character recognition method based on scale self-adaption and direction attention network - Google Patents

Scene character recognition method based on scale self-adaption and direction attention network Download PDF

Info

Publication number
CN112257716A
CN112257716A CN202011424315.0A CN202011424315A CN112257716A CN 112257716 A CN112257716 A CN 112257716A CN 202011424315 A CN202011424315 A CN 202011424315A CN 112257716 A CN112257716 A CN 112257716A
Authority
CN
China
Prior art keywords
network
polar coordinate
feature
characteristic
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011424315.0A
Other languages
Chinese (zh)
Inventor
鲍虎军
李特
操晓春
代朋纹
张华�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202011424315.0A priority Critical patent/CN112257716A/en
Publication of CN112257716A publication Critical patent/CN112257716A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to a scene character recognition method based on scale self-adaptation and a direction attention network, which comprises the steps of mapping an input picture to a polar coordinate space to obtain a polar coordinate image, and extracting a characteristic J of the polar coordinate image by using a convolution network; converting the feature expression of the picture in a polar coordinate space into a high-order semantic feature F by using a deep convolutional network; for the high-order semantic features obtained by conversion, coding the features of more relevant areas of each character by utilizing a character receptive field attention mechanism, acquiring robust feature expression and dispersing the feature expression into a feature sequence Q; capturing the context relation between the characteristic sequences Q by using a bidirectional long and short memory network to obtain a characteristic sequence H; and inputting the characteristic sequence H into a decoding network for analysis to generate a character string with a semantic sequence rule. The method can effectively identify scene characters in any semantic direction; the characters with different scales can be encoded to more effectively express the characteristics, and the recognition performance is obviously improved.

Description

Scene character recognition method based on scale self-adaption and direction attention network
Technical Field
The invention belongs to the technical field of computer vision, and relates to a method capable of identifying characters in any semantic direction in a natural scene image. In particular to a scene character recognition method based on scale self-adaption and a direction attention network.
Background
With the development of information technology, images as a popular information carrier play an indispensable role in our lives. Characters in the images are high-level visual elements, contain rich and accurate semantic information, and are very helpful for understanding scene contents. Therefore, the character information in the image is recognized, so that the character information has quite wide application value in many practical applications, and is mainly reflected in four aspects. First, content-based image retrieval. The character information in the image can effectively solve the ambiguity of the image content; and the image content can be understood more deeply by combining with the scene content, so that more accurate images can be retrieved according to the key information. And secondly, a man-machine interaction system. When people are shopping or shopping, many billboards, posters, shop signs, menus, etc. are often encountered, however, these messages often contain textual information in different languages. Therefore, the mobile equipment is used for collecting the images and identifying the character elements in the images, and the mobile equipment can bring convenience to the life of people. And thirdly, purifying the network space. Many lawbreakers use images as carriers, and embed some characters with low colloquial pornography in the images to spread in a network space. And bad character information in the image is identified, so that the transmission of the information is prevented, and the physical and mental health of the underage is protected. Fourthly, an intelligent transportation system. In the outdoor environment, accurate discernment license plate and traffic sign all have positive effect to the intelligent management of traffic.
Natural Scene Text Recognition (STR) has many challenges compared to conventional Optical Character Recognition (OCR). Mainly in the following aspects. Firstly, OCR aims at a scanned document, the image quality of the scanned document is clear, and the background of the scanned document is single; the STR is directed to a natural scene image, and due to factors such as jitter, illumination or shooting angle during shooting, a shot picture is easily blurred, resolution is low, and text occlusion is difficult. Secondly, characters processed by the OCR are generally consistent in size, uniform in color and orderly in arrangement; the characters aimed by the STR are different in font, various in color and rich in layout, so that the difficulty of character recognition is increased.
Scene character recognition based on a deep neural network is mainly divided into two categories, namely regular scene character recognition and irregular scene character recognition. The recognition of the regular scene characters refers to the recognition of characters aiming at the horizontal front face, and the recognition methods can be divided into three types, namely character-based recognition methods, word-based recognition methods and sequence-based recognition methods. The character-based recognition method firstly detects the position, then classifies the single character by utilizing the deep neural network, and finally aggregates the classification results of the single character to form a final result by a heuristic algorithm and a language rule. Word-based recognition directly classifies whole words using deep neural networks. Based on the identification of the sequence, the input images are first encoded into sequence features, which are then parsed into text strings using an attention-based sequence decoder or a connected semantic Temporal Classification (CTC). The identification of the characters in the irregular scene refers to identification of the characters in the irregular scene, such as multiple directions, perspective distortion, curved arrangement and the like. The identification method can be divided into three categories, namely an identification method based on correction, a two-dimensional space and a direction feature code. The correction-based identification method comprises the steps of firstly, correcting irregular characters into horizontal or approximately horizontal characters by using a correction network, and then identifying by using a regular character identifier; the correction network and the recognition network are combined to be trained end to end, the correction network does not need supervision information, and learning of the correction network is completed by means of gradient feedback of the recognition network. The identification method based on the two-dimensional space is characterized in that the characteristics of an input image are extracted by utilizing a full convolution network so as to keep the space information of characters from being lost; and then identified based on the attention mechanism of the two-dimensional space or class segmentation of each location in the two-dimensional space. Firstly, mapping an input image into one-dimensional features in multiple directions based on a direction feature coding identification method; then learning a weight for each direction and each position in each direction, and fusing all direction features together to form a more expressive feature through the learned weights; and finally, analyzing by using a one-dimensional attention decoder to generate a recognition result.
At present, scene character recognition mainly aims at recognition of characters with irregular geometric layout, and the scene character recognition only focuses on the arbitrariness of character semantic directions; however, in practical applications, scene text of any semantic orientation often appears. In addition, because the scale of each character in the scene characters is various, the existing method does not consider the precise feature coding of a single character. Therefore, scene character recognition of any scale in any semantic direction is a research hotspot facing practical application.
Disclosure of Invention
The invention provides a scene character recognition method based on scale self-adaption and a direction attention network, aiming at scene characters in any semantic direction and different scales of a single character. Since both the scale and orientation of the text need to be considered, the original image is mapped into polar coordinate space for this purpose. In order to accurately sense the scale of a single character in the character, according to the receptive field theory, a plurality of moderate receptive fields are utilized for self-adaptive selection.
The technical scheme of the invention is as follows:
a scene character recognition method based on scale self-adaption and a direction attention network comprises the following steps:
(1) mapping the input picture to a polar coordinate space to obtain a polar coordinate image, and extracting a characteristic J of the polar coordinate image by using a convolution network;
(2) converting the feature expression of the picture in a polar coordinate space into a high-order semantic feature F by using a deep convolutional network;
(3) for the high-order semantic features F obtained by conversion in the step (2), coding the features of a more relevant area for each character by using a character receptive field attention mechanism, acquiring robust feature expression and dispersing the robust feature expression into a feature sequence Q;
(4) capturing the context relation between the characteristic sequences Q by using a bidirectional long and short memory network to obtain a characteristic sequence H;
(5) and inputting the characteristic sequence H into a decoding network for analysis to generate a character string with a semantic sequence rule.
Further, before the step (1), the method further comprises the step of converting the input picture: an arbitrary size color input picture is converted into a fixed size grayscale picture, the size of which is expressed as H × W.
Further, the step (1) specifically includes the following sub-steps:
(1.1) learning a polar origin response map by utilizing a shallow small network; then, obtaining a polar coordinate origin according to the polar coordinate origin response diagram and the corresponding spatial position weighting; the shallow layer small network consists of three convolution layers, a rectification unit and a batch normalization layer which follow the convolution layers;
(1.2) mapping the coordinate position in the polar coordinate space to the position in the Cartesian space according to the conversion relation between the Cartesian coordinate and the polar coordinate; the numerical value at the position in each polar coordinate space is obtained by carrying out bilinear interpolation on four positions adjacent to the corresponding Cartesian coordinate position, so that a polar coordinate image is obtained;
(1.3) acquiring a characteristic J of the polar coordinate image by using a convolution network; in the convolution filling, the polar coordinate image is circularly filled in the vertical direction, that is, the uppermost line is filled by the lowermost line, and vice versa.
Further, the step (2) is specifically:
and utilizing a convolution network to carry out downsampling on the characteristic J, wherein the vertical direction is downsampled to be 1, the horizontal direction is downsampled to be L to obtain a high-order semantic characteristic F, the characteristic dimension is expressed as 1 multiplied by L multiplied by D, and D represents the number of characteristic channels.
Further, the step (3) specifically includes the following sub-steps:
(3.1) inputting the high-order semantic feature F into a standard convolution and K-1 expansion convolutions with different expansion rates to obtain the multi-scale feature F1,F2,…,FKThe feature dimension of each feature is 1 × L × D;
(3.2) combining the multiscale features F1,F2,…,FKThe character regions are spliced to learn the associated weight of each character region and the features with different scales;
(3.3) combining the multiscale features F1,F2,…,FKObtaining the characteristic sequence discretely by fusing the learned weight
Figure DEST_PATH_IMAGE001
Wherein q is j Is D.
Further, in the step (4), the bidirectional long and short memory network includes D neurons.
Further, the step (5) is specifically:
the decoding network is a recurrent neural network based on a gate cycle unit, and for each decoding time t, a multilayer perceptron is utilized to learn the hidden state of the gate cycle unit networks t-1The degree of association with the characteristic sequence H; then adaptively focusing on proper sequence feature expression according to the association degree obtained by learning; finally, each GRU unit outputs the distribution of the current time
Figure DEST_PATH_IMAGE002
Wherein C is the number of characters; in the network learning, the maximum cumulative probability at all times will be maximized; in the process of network inference, the category with the largest response is selected as output by using a greedy algorithm for analysis at each moment, or beta categories with the largest response are selected in each moment according to a cluster search algorithm for analysis at the next moment, and the network inference is ended until a sequence ending mark is met or the maximum preset time step T is exceeded.
The invention has the beneficial effects that:
1. the invention applies the polar coordinate conversion to the sequence character recognition, and can effectively sense characters in any direction and any scale, thereby obviously improving the recognition effect.
2. The invention provides a character receptive field attention mechanism which can encode more relevant characteristics for characters with different scales, thereby obviously improving the recognition effect; the mechanism is simple and effective, and can be very simply embedded into the existing sequence recognition model (such as scene character recognition, handwriting recognition, voice recognition and the like) to improve the recognition performance.
In summary, the scene character recognition method based on the scale self-adaptation and the direction attention network provided by the invention can effectively recognize scene characters in any direction. The method can effectively learn better feature expression for characters with different scales in characters, thereby integrally improving the recognition performance.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of a polar transformation process;
FIG. 3 is a diagram of a scene character recognition network structure in any semantic direction;
FIG. 4 is a schematic diagram of the mechanism of the character receptor field attention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a scene character recognition method based on scale self-adaption and a direction attention network, as shown in figure 1, the steps are as follows:
(1) mapping the input picture to a polar coordinate space to obtain a polar coordinate image, and extracting a characteristic J of the polar coordinate image by using a convolution network;
(2) converting the feature expression of the picture in a polar coordinate space into a high-order semantic feature F by using a deep convolutional network;
(3) for the high-order semantic features F obtained by conversion in the step (2), coding the features of a more relevant area for each character by using a character receptive field attention mechanism, acquiring robust feature expression and dispersing the robust feature expression into a feature sequence Q;
(4) capturing the context relation between the characteristic sequences Q by using a bidirectional long and short memory network to obtain a characteristic sequence H;
(5) and inputting the characteristic sequence H into a decoding network for analysis to generate a character string with a semantic sequence rule.
According to the method, the constructed network structure for recognizing the characters in the scene in any semantic direction is shown in FIG. 2 and specifically comprises a polar coordinate feature conversion module, a feature coding module and a character sequence decoding module.
In an embodiment of the present invention, before the step (1), a step of converting an input picture is further included: and converting the scene character image with any size and any semantic direction into an H multiplied by W gray scale image I, wherein H and W represent the height and width of the gray scale image.
In an embodiment of the present invention, in the step (1), the input image is converted into a feature expression in a polar coordinate space in a polar coordinate feature conversion module, a network structure and a flow are shown in fig. 2, and the specific steps are as follows:
(1.1) correspondingly predicting a network learning polar coordinate origin response diagram by using a shallow small network as a polar coordinate origin; then, obtaining a polar coordinate origin according to the polar coordinate origin response diagram and the corresponding spatial position weighting; specifically, the method comprises the following substeps:
(1.1.1) the polar position response is learned using a small net of four convolutional layers, the first three convolutional layers followed by a Linear rectification Unit (ReLU) and Batch Normalization (BN).
(1.1.2) response diagram O according to polar position, and horizontal position coordinate matrix E in response diagram O x And a vertical position coordinate matrix E y (wherein, E x , E y Is normalized to [ -1, 1]) The position of the polar coordinates can be obtained (x 0,y 0) The specific calculation process is as follows:
Figure DEST_PATH_IMAGE003
(1)
Figure DEST_PATH_IMAGE004
(2)
where k represents the position index in the polar response plot O.
Figure DEST_PATH_IMAGE005
Representing multiplication of corresponding elements of the matrix.
And the shallow small network performs weak supervised learning according to the feedback information of the character strings.
(1.2) mapping the coordinate position in the polar coordinate space to a position in the cartesian space according to the transformation relationship between the cartesian coordinate and the polar coordinate, as shown in fig. 3; the numerical value at the position in each polar coordinate space is obtained by carrying out bilinear interpolation on four positions adjacent to the corresponding Cartesian coordinate position, so that a polar coordinate image is obtained; specifically, the method comprises the following substeps:
(1.2.1) construct a polar image P of the same size as the input image I. The coordinates of the images I and P are normalized to [ -1, 1], and the coordinate mapping between the images I and P is calculated in the following way:
x i s =x 0+ρ i t ∙cos(θ i t ) (3)
y i s =y 0+ρ i t ∙sin(θ i t ) (4)
Figure DEST_PATH_IMAGE006
(5)
θ i t =(y i t +1)⋅π (6)
wherein (A) and (B)x i s ,y i s ) Indicating the first on the input image IiCoordinates of the individual locations; (x i t ,y i t ) Coordinates representing the ith position on the polar image P.ρ i t θ i t Representing the first on the polar image PiThe pole diameter and angle of each position. Gamma denotes the maximum distance to the origin of polar coordinates, which is set to
Figure DEST_PATH_IMAGE007
(1.2.2) mapping coordinates according to the obtained (1:)x i s ,y i s ) Calculating each position on the polar coordinate image P by bilinear interpolation (x i t ,y i t ) The value of (c) above.
(1.3) acquiring a characteristic J of the polar coordinate image by using a convolution network; the method specifically comprises the following steps: the polar coordinate image P is cyclically filled in the vertical direction, i.e. the uppermost row in P is filled with the lowermost row, whereas the lowermost row is filled with the uppermost row. Then learning is carried out by using M3 × 3 convolution kernels (followed by a rectifying unit and a batch normalization layer) to obtain a feature expression J of the polar coordinate image in a polar coordinate space.
In an embodiment of the present invention, the feature J is downsampled by using a convolutional network composed of a plurality of convolutional layers and pooling layers, wherein vertical downsampling is 1, horizontal downsampling is L to obtain a high-order semantic feature F, and a feature dimension is represented as 1 × L × D, where D represents the number of feature channels.
In an embodiment of the present invention, a more effective feature expression is learned for each character with different scales and is discretized into a feature sequence Q by using a receptive field attention mechanism of the character, a principle of which is schematically shown in fig. 4 and specifically implemented in a feature encoding module, a network structure and a flow of which are shown in fig. 2, and the steps are as follows:
(3.1) generating a feature expression F based on the high-order semantic feature expression F using a depth feature extractor, i.e., a 1 × 1 convolutional layer1And using K-1 3 × 3 expansion convolution layers to generate multi-scale feature F2,F3,…,FKRespectively, having an expansion ratio of 21, 22,…,2K-1The feature dimensions of each feature are 1 × L × D, so that each position in the feature map is associated with a different region of the input image.
(3.2) adding F1,F2,…,FKSpliced together, input into two convolutional layers to learn the associated weight of each character region and different scale features
Figure DEST_PATH_IMAGE008
(3.3) combining the multiscale features F1,F2,…,FKThe method is combined with learned weight, so that the model learns better characteristic expression for characters with different scales, and generates richer characteristic sequences by discrete coding for the characters with different scales
Figure DEST_PATH_IMAGE009
,q j The characteristic dimension of (a) is D, which is specifically calculated as follows:
Figure DEST_PATH_IMAGE010
(7)
wherein W i j Indicates the associated weightjA position andithe relevance of the seed scale feature is a scalar; f i j Is shown asiThe seed scale is characterized injFeature vector at each position with a feature dimension of
Figure DEST_PATH_IMAGE011
In one embodiment of the invention, the adaptively enhanced feature sequence is
Figure 693188DEST_PATH_IMAGE009
The dependency relationship between different positions is established by using a bidirectional long-short memory network containing D neurons, so that better sequence characteristics are obtained
Figure DEST_PATH_IMAGE012
In an embodiment of the present invention, for a scene text in any semantic direction, a text sequence decoding module (as shown in fig. 2) is used as a decoding network to generate a text string with correct text semantic order and accurate recognition result. The decoding network here is a recurrent neural network that is utilized, wherein each network element is optional, and the example here employs a Gated Current Unit (GRU). At each analysis moment, the method is beneficial to a sequence attention mechanism to automatically learn the alignment relation between the character string and the sequence feature H, and comprises the following specific steps:
(5.1) learning the relevance between the hidden state and the sequence feature of the GRU by using a multi-layer perceptron, wherein the calculation mode is as follows:
e tj = W e tanh(W s s t-1 + W h h j + b) (8)
Figure DEST_PATH_IMAGE013
(9)
whereins t-1Represents the hidden state of GRU at the t-1 th time, h j Indicates that the sequence feature H is injFeature vectors at individual locations.α tj Indicates the t-th analysis time and the th in the sequence characteristics HjThe degree of association of each location.W e , W s , W h , bAre parameters that can be learned in the perceptron.
(5.2) the correlation characteristics at the tth analysis time are obtained by weighted combination, and the calculation method is as follows:
Figure DEST_PATH_IMAGE014
(10)
(5.3) updating the hidden state of the GRU at the t-th time, wherein the calculation mode is as follows:
s t = GRU(s t-1, c t , y t-1) (11)
wherein y is t-1Labels at time t-1 are represented in trainingy t-1 *When tested, the predicted result at the t-1 th time is showny t-1
(5.4) acquiring the output probability distribution of each time t, wherein the calculation mode is as follows:
y t = softmax(V s t ) (12)
whereinVRepresenting a learnable weight parameter.
In the learning process, the used bottom loss function is expressed as follows:
Figure DEST_PATH_IMAGE015
(13)
whereiny t *The label representing the t-th time, theta represents all learnable parameters in the network, I refers to the input image, and p (∙) is the maximum response value in the probability distribution at the t time. The whole network learns end to end, only images and corresponding text strings need to be input, and extra supervision information is not needed. The recognition result with the maximum response at each time is selected as output in the inference process, or the beta categories with the maximum response are selected for the next time by bundle searching.
The method of the present invention will be further described with reference to the following specific examples.
The invention provides a scene character recognition method based on scale self-adaption and a direction attention network, which comprises the following test environments and experimental results:
(1) and (3) testing environment:
the system environment is as follows: ubuntu 16.04;
hardware environment: memory: 128GB, GPU: NVIDIA GTX 1080Ti, CPU 1.70 GHz Intel (R) Xeon (R) E5-2609, hard disk: 4 TB;
(2) experimental data:
the model constructed by the method of the invention is trained on a synthetic data set Synth90k (about nine million word pictures) and SynthText (about four million word pictures). The invention evaluated on five data sets, respectively IIIT5K (3000 training pictures, 2000 test pictures); SVT (647 test pictures); ICDAR03 (1007 test pictures); ICDAR13 (1095 test pictures); ICDAR15 (2077 test pictures). The evaluation criterion utilized is case insensitive word accuracy. In the evaluation, in order to obtain different semantic direction characters, the original image is rotated by 0 degree, 90 degrees, 180 degrees and 270 degrees. The number of characters is 36, including 26 english letters +10 numbers.
(3) The optimization method comprises the following steps:
the ADAELTA optimization method was used, where the size H × W of the image was set to 100 × 100, L was set to 23, K was set to 4 in the convolutional network, i.e., 3 dilation convolutional layers were included, the number of eigen-channels D was set to 256, and T was set to 100. The size of the training mini-batch (minipatch) is set to 128.
(4) The experimental results are as follows:
1) ablation experiment:
the evaluation of the experiment was performed on the IIIT5K test set, and for fair comparison, training was performed only on the Synth90k dataset; in model inference, a greedy selection strategy is used for obtaining a recognition result, and a dictionary is not used for correcting a final prediction result. The Baseline-A firstly trains a semantic direction classification network, namely a 0-degree, 90-degree, 180-degree and 270-degree four-classification network; then using the popular horizontal character recognizer CRNN (B, Shi, X, Bai, and C, Yao, "An end-to-end reliable network for image-based sequence and its application to scene text recognition"IEEE Trans. Pattern Anal. Mach. Intell.Vol. 39, No. 11, pp. 2298 and 2304, 2017). Baseline-B refers to performing 0-degree, 90-degree, 180-degree and 270-degree rotation on any input image; then, pictures in the four directions are identified, and finally, one prediction with the highest summation probability is selected from the four results to serve as a final result. AON (Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou, "AON: towardarbitraryoriented text recognition," in CVPR, 2018, pp. 5571-5579 ") refers to a popular multidirectional coding network that performs weighted combination of features by learning the weights of different positions in different directions. As shown in the ablation experiments in table 1 below, the effectiveness of the Polar Transformation (PT) mechanism proposed by the present invention for sequence recognition and based on word-level received Field Attention (CRFA) was found.
Watch (A)
Figure DEST_PATH_IMAGE016
Ablation experiment
Figure DEST_PATH_IMAGE017
2) And (3) comparing the performances:
when compared to other methods, the model was trained using the synthetic dataset Synth90k and synthttext. At model inference, the width β of the bundle search is set to 5, and the prediction is rectified using the largest dictionary set provided by the data set, i.e., the character string in the dictionary with the smallest edit distance from the prediction result is selected as the final result. If the data set does not provide a dictionary, then all of the truth values in the test set are placed in a set to form the dictionary. The performance is shown in table 2 below, which shows that the case-insensitive word accuracy is the average performance in four semantic directions (0 degrees, 90 degrees, 180 degrees and 270 degrees), and the results show the robustness and superiority of our method for semantic direction character recognition.
Watch (A)
Figure 614921DEST_PATH_IMAGE016
Performance comparison
Figure DEST_PATH_IMAGE018
In the table:
the Tesseract-COR method is described in "Tesseract-OCR v4.0," https:// github. com/Tesseract-OCR/Tesseract/leases.
GRCNN method is described in J.Wang and X.Hu, "Gated recovery conversion neural network for OCR," inNeurIPS, 2017, pp. 334–343.
The ALE method is described in S.Fang, H.Xie, Z. Zha, N.Sun, J.Tan, and Y.Zhang, "attachment and language understanding for scene text registration with connected sequence modification," inACM-MM, 2018, pp. 248–256.
The ASTER method is described in B, Shi, M, Yang, X, Wang, P, Lyu, C, Yao, and X, Bai, "ASTER: An interactive scene text receiver with flexible receiver"IEEE Trans.Pattern Anal. Mach. Intell., vol. 41, no. 9, pp. 2035–2048, 2019.
The MORN-v2 method is described in C. Luo, L. Jin, and Z. Sun, "A Multi-object recognition attachment network for scene text recognition"Pattern Recognition, vol. 90, pp. 109–118, 2019.
SAR methods are described in H.Li, P.Wang, C.Shen, and G.Zhang, "Show, Attenden and Read A simple and string baseline for irregular text recognition," inAAAI, 2019, pp. 8610–8617.
It is clear from the above experiments that the polar coordinate transformation and the character recency attention mechanism involved in the present invention are both effective. The two methods are used for identifying scene characters in any semantic direction, and good performance and robustness can be achieved.
The above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and a person skilled in the art may make modifications or equivalent substitutions to the technical solutions of the present invention without departing from the scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (7)

1. A scene character recognition method based on scale self-adaption and a direction attention network is characterized by comprising the following steps:
(1) mapping the input picture to a polar coordinate space to obtain a polar coordinate image, and extracting a characteristic J of the polar coordinate image by using a convolution network;
(2) converting the feature expression of the picture in a polar coordinate space into a high-order semantic feature F by using a deep convolutional network;
(3) for the high-order semantic features F obtained by conversion in the step (2), coding the features of a more relevant area for each character by using a character receptive field attention mechanism, acquiring robust feature expression and dispersing the robust feature expression into a feature sequence Q;
(4) capturing the context relation between the characteristic sequences Q by using a bidirectional long and short memory network to obtain a characteristic sequence H;
(5) and inputting the characteristic sequence H into a decoding network for analysis to generate a character string with a semantic sequence rule.
2. The scene text recognition method based on scale adaptation and direction attention network as claimed in claim 1, further comprising, before the step (1), a step of converting the input picture: an arbitrary size color input picture is converted into a fixed size grayscale picture, the size of which is expressed as H × W.
3. The method for recognizing scene texts based on scale adaptation and directional attention network as claimed in claim 1, wherein the step (1) comprises the following steps:
(1.1) learning a polar origin response map by utilizing a shallow small network; then, obtaining a polar coordinate origin according to the polar coordinate origin response diagram and the corresponding spatial position weighting; the shallow layer small network consists of three convolution layers, a rectification unit and a batch normalization layer which follow the convolution layers;
(1.2) mapping the coordinate position in the polar coordinate space to the position in the Cartesian space according to the conversion relation between the Cartesian coordinate and the polar coordinate; the numerical value at the position in each polar coordinate space is obtained by carrying out bilinear interpolation on four positions adjacent to the corresponding Cartesian coordinate position, so that a polar coordinate image is obtained;
(1.3) acquiring a characteristic J of the polar coordinate image by using a convolution network; in the convolution filling, the polar coordinate image is circularly filled in the vertical direction, that is, the uppermost line is filled by the lowermost line, and vice versa.
4. The method for recognizing scene texts based on scale adaptation and directional attention network as claimed in claim 1, wherein the step (2) is specifically as follows:
and utilizing a convolution network to carry out downsampling on the characteristic J, wherein the vertical direction is downsampled to be 1, the horizontal direction is downsampled to be L to obtain a high-order semantic characteristic F, the characteristic dimension is expressed as 1 multiplied by L multiplied by D, and D represents the number of characteristic channels.
5. The method for scene text recognition based on scale adaptation and directional attention network as claimed in claim 1, wherein the step (3) comprises the following steps:
(3.1) inputting the high-order semantic feature F into a standard convolution and K-1 expansion convolutions with different expansion rates to obtain the multi-scale feature F1,F2,…,FKThe feature dimension of each feature is 1 × L × D;
(3.2) combining the multiscale features F1,F2,…,FKThe character regions are spliced to learn the associated weight of each character region and the features with different scales;
(3.3) combining the multiscale features F1,F2,…,FKFusing with learned weight and then obtaining the characteristic sequence discretely
Figure 178509DEST_PATH_IMAGE001
Wherein q is j Is D.
6. The method for scene text recognition based on scale adaptation and directional attention network of claim 1, wherein in the step (4), the bidirectional long-short memory network comprises D neurons.
7. The method for scene text recognition based on scale adaptation and directional attention network as claimed in claim 1, wherein the step (5) is specifically as follows:
the decoding network is a recurrent neural network based on a gate cycle unit, and for each decoding time t, a multilayer perceptron is utilized to learn the hidden state of the gate cycle unit networks t-1The degree of association with the characteristic sequence H; then adaptively focusing on proper sequence feature expression according to the association degree obtained by learning; finally, each gate cycle unit outputs the distribution of the current time
Figure 481315DEST_PATH_IMAGE002
Wherein C is the number of characters; in the network learning, the maximum cumulative probability at all times will be maximized; in the process of network inference, analysis at each moment is selected by using greedy algorithmAnd selecting the category with the maximum response as output, or selecting beta categories with the maximum response in each moment according to a bundle searching algorithm to analyze the next moment until a sequence ending mark is met or the maximum preset time step T is exceeded, and concluding by the network.
CN202011424315.0A 2020-12-08 2020-12-08 Scene character recognition method based on scale self-adaption and direction attention network Pending CN112257716A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011424315.0A CN112257716A (en) 2020-12-08 2020-12-08 Scene character recognition method based on scale self-adaption and direction attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011424315.0A CN112257716A (en) 2020-12-08 2020-12-08 Scene character recognition method based on scale self-adaption and direction attention network

Publications (1)

Publication Number Publication Date
CN112257716A true CN112257716A (en) 2021-01-22

Family

ID=74224958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011424315.0A Pending CN112257716A (en) 2020-12-08 2020-12-08 Scene character recognition method based on scale self-adaption and direction attention network

Country Status (1)

Country Link
CN (1) CN112257716A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111975A (en) * 2021-05-12 2021-07-13 合肥工业大学 SAR image target classification method based on multi-kernel scale convolutional neural network
CN113297986A (en) * 2021-05-27 2021-08-24 新东方教育科技集团有限公司 Handwritten character recognition method, device, medium and electronic equipment
CN113762050A (en) * 2021-05-12 2021-12-07 腾讯云计算(北京)有限责任公司 Image data processing method, apparatus, device and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165697A (en) * 2018-10-12 2019-01-08 福州大学 A kind of natural scene character detecting method based on attention mechanism convolutional neural networks
CN110097049A (en) * 2019-04-03 2019-08-06 中国科学院计算技术研究所 A kind of natural scene Method for text detection and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165697A (en) * 2018-10-12 2019-01-08 福州大学 A kind of natural scene character detecting method based on attention mechanism convolutional neural networks
CN110097049A (en) * 2019-04-03 2019-08-06 中国科学院计算技术研究所 A kind of natural scene Method for text detection and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CARLOS等: ""POLAR TRANSFORMER NETWORKS"", 《ARXIV》 *
FISHER等: ""MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS"", 《ARXIV》 *
RONGHUA等: ""Multi-scale Adaptive Feature Fusion Network for Semantic Segmentation in Remote Sensing Images"", 《REMOTE SENSING》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113111975A (en) * 2021-05-12 2021-07-13 合肥工业大学 SAR image target classification method based on multi-kernel scale convolutional neural network
CN113762050A (en) * 2021-05-12 2021-12-07 腾讯云计算(北京)有限责任公司 Image data processing method, apparatus, device and medium
CN113762050B (en) * 2021-05-12 2024-05-24 腾讯云计算(北京)有限责任公司 Image data processing method, device, equipment and medium
CN113297986A (en) * 2021-05-27 2021-08-24 新东方教育科技集团有限公司 Handwritten character recognition method, device, medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN110222140B (en) Cross-modal retrieval method based on counterstudy and asymmetric hash
Yuan et al. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval
Qiao et al. Seed: Semantics enhanced encoder-decoder framework for scene text recognition
CN111985369B (en) Course field multi-modal document classification method based on cross-modal attention convolution neural network
Gao et al. Reading scene text with fully convolutional sequence modeling
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN109711463B (en) Attention-based important object detection method
Lin et al. RSCM: Region selection and concurrency model for multi-class weather recognition
CN110209823A (en) A kind of multi-tag file classification method and system
CN112257716A (en) Scene character recognition method based on scale self-adaption and direction attention network
CN113657450B (en) Attention mechanism-based land battlefield image-text cross-modal retrieval method and system
Li et al. Adaptive metric learning for saliency detection
Qiao et al. Gaussian constrained attention network for scene text recognition
CN111461175B (en) Label recommendation model construction method and device of self-attention and cooperative attention mechanism
Dai et al. ACE: Anchor-free corner evolution for real-time arbitrarily-oriented object detection
CN116304307A (en) Graph-text cross-modal retrieval network training method, application method and electronic equipment
Dai et al. SLOAN: Scale-adaptive orientation attention network for scene text recognition
Zhang et al. Image region annotation based on segmentation and semantic correlation analysis
CN116229482A (en) Visual multi-mode character detection recognition and error correction method in network public opinion analysis
Bagi et al. Cluttered textspotter: An end-to-end trainable light-weight scene text spotter for cluttered environment
Selvam et al. A transformer-based framework for scene text recognition
CN112182275A (en) Trademark approximate retrieval system and method based on multi-dimensional feature fusion
Yu et al. Text-image matching for cross-modal remote sensing image retrieval via graph neural network
Qin Application of efficient recognition algorithm based on deep neural network in English teaching scene
CN110909645B (en) Crowd counting method based on semi-supervised manifold embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210122

RJ01 Rejection of invention patent application after publication