CN115797952A - Handwritten English line recognition method and system based on deep learning - Google Patents

Handwritten English line recognition method and system based on deep learning Download PDF

Info

Publication number
CN115797952A
CN115797952A CN202310084850.3A CN202310084850A CN115797952A CN 115797952 A CN115797952 A CN 115797952A CN 202310084850 A CN202310084850 A CN 202310084850A CN 115797952 A CN115797952 A CN 115797952A
Authority
CN
China
Prior art keywords
input end
module
handwritten english
branch
handwritten
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310084850.3A
Other languages
Chinese (zh)
Other versions
CN115797952B (en
Inventor
许信顺
初宛晴
马磊
陈义学
李溢欢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Original Assignee
SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANDONG SHANDA OUMA SOFTWARE CO Ltd filed Critical SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority to CN202310084850.3A priority Critical patent/CN115797952B/en
Publication of CN115797952A publication Critical patent/CN115797952A/en
Application granted granted Critical
Publication of CN115797952B publication Critical patent/CN115797952B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to the technical field of image processing, in particular to a handwritten English line recognition method and a handwritten English line recognition system based on deep learning; wherein the method comprises: acquiring a handwritten English image to be recognized; preprocessing a handwritten English image to be recognized; processing the preprocessed image by adopting a trained handwritten English line recognition model to obtain a recognition result of the handwritten English line; wherein, the handwritten English line recognition model after training includes: extracting the features of the preprocessed image to obtain a primary visual feature; performing depth feature extraction on the preliminary visual features to obtain depth visual features; and decoding the depth visual features to obtain the recognition result of the handwritten English lines. The invention has accurate recognition result for handwritten English lines.

Description

Handwritten English line recognition method and system based on deep learning
Technical Field
The invention relates to the technical field of image processing, in particular to a handwritten English line recognition method and system based on deep learning.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Text, as a form of expression for transferring information between humans, is self-evident to the necessity and the breadth of human daily life as a visual code of a language. The diversity and increasing visual forms of text, coupled with the great diversity of human handwriting styles, have led to the complex nature of handwritten text.
Based on the above situation, intelligent processing of text is becoming increasingly important. The automatic transcription and storage are carried out on the text, so that the time, manpower and material resources can be efficiently saved, and the subsequent processing and application can be facilitated.
Text recognition is an important computer vision task that has arisen in response to the above task. Currently, a single-stage recognition method without segmentation is mostly adopted in the text recognition technology, that is, a text picture is regarded as a whole, and a sequence transcription mode for performing fine-grained alignment between a source sequence in an original picture and an output target sequence is sought.
The method mainly has a trend that: using the "encoder-decoder" architecture with attention mechanism, text pictures are first mapped in their entirety by the encoder into a token vector, and then transcribed by the decoder into a sequence of consecutive characters based on this token. Wherein, a weight matrix is obtained through neural network learning, and the value of the weight matrix represents the importance of corresponding context information to the current time step prediction, thereby realizing the selective alignment between the coding sequence and the decoding sequence. However, this architecture has several problems:
(1) Missing or redundant characters may exist in the text picture, which may cause accumulation of wrong alignment between the tag sequence and the attention prediction, and further mislead the training process, so it is difficult to learn from the beginning, and the effect is poor in a long text line scene;
(2) The attention mechanism relies on a complex attention module, which results in additional network parameters and runtime as well as a large amount of storage requirements.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a handwritten English line recognition method and system based on deep learning.
In a first aspect, the present invention provides a handwritten English line recognition method based on deep learning, which can realize fast recognition and accurate recognition of handwritten English lines, and the method includes:
acquiring a handwritten English image to be recognized;
preprocessing a handwritten English image to be recognized;
processing the preprocessed image by adopting a trained handwritten English line recognition model to obtain a recognition result of the handwritten English line;
wherein, the handwritten English line recognition model after training includes: extracting the features of the preprocessed image to obtain a primary visual feature; performing depth feature extraction on the preliminary visual features to obtain depth visual features; and decoding the depth visual features to obtain the recognition result of the handwritten English lines.
In a second aspect, the invention provides a handwritten English line recognition system based on deep learning; the invention can realize the quick recognition and the accurate recognition of the handwritten English line, and the system comprises:
an acquisition module configured to: acquiring a handwritten English image to be recognized;
a pre-processing module configured to: preprocessing a handwritten English image to be recognized;
an identification module configured to: processing the preprocessed image by adopting a trained handwritten English line recognition model to obtain a recognition result of the handwritten English line;
wherein, the handwritten English line recognition model after training includes: extracting the features of the preprocessed image to obtain a primary visual feature; performing depth feature extraction on the preliminary visual features to obtain depth visual features; and decoding the depth visual features to obtain the recognition result of the handwritten English lines.
Compared with the prior art, the invention has the beneficial effects that:
(1) The invention provides a simple complete convolution neural network architecture, only uses a high-efficiency convolution to replace the traditional regular convolution, only has feedforward connection without circular connection, and realizes high data and calculation efficiency. The method can train on the text image with variable size by using the line-level transcription label with variable length without preprocessing such as character segmentation and horizontal standardization;
(2) The invention provides an effective iterative estimation method of a target sequence, and provides a novel gate unit, which is used for fully extracting features through unit stacking and simultaneously utilizing the advantages of a gate mechanism on feature processing and fusion modes, so that a model is facilitated to obtain a more accurate identification result;
(3) The model provided by the invention adopts a plurality of characteristic transformation modes, and can extract characteristic representation with strong information from training data;
(4) The model provided by the invention uses various normalization modes and Dropout to regularize the network, and adopts various regularization methods to accelerate model convergence and effectively relieve overfitting.
(5) The model provided by the invention does not use a full connection layer, the main calculation block of the model is separable convolution, and comparable performance can be obtained by using the model to replace the traditional regular convolution, so that the parameter computation amount is greatly reduced, the training and convergence of the model are accelerated, and the storage space is saved;
(6) The invention provides a novel statistical loss function, provides more supervision information for a model, assists CTC (central traffic control) to jointly optimize a target function, and is beneficial to correct identification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flowchart of a method according to a first embodiment;
FIG. 2 is a diagram of an internal network structure of a trained handwritten English line recognition model according to the first embodiment;
FIG. 3 is a diagram of an internal structure of an encoding module according to the first embodiment;
FIG. 4 is a view of the internal structure of the stacked door module according to the first embodiment;
FIG. 5 is a diagram illustrating an internal structure of a decoding module according to the first embodiment;
fig. 6 is an internal structural diagram of a first depth separable convolutional network of the first embodiment;
fig. 7 is an internal structure view of the first door unit according to the first embodiment.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
All data are obtained according to the embodiment and are legally applied on the data on the basis of compliance with laws and regulations and user consent.
CTC (Connectionist Temporal Classification) -based architecture that computes the probability distribution of all possible output sequences by considering all possible alignment paths of each input frame in the prediction process, and then greedily finds the most probable output sequence from them. Therefore, the CTC alignment is more accurate, the training convergence of the model is faster, and the method is suitable for a one-dimensional sequence prediction task without additional processing. Considering that the application scenario of the present embodiment is directed to a long text line, the present embodiment adopts CTC as a decoder of a model, as described in this paragraph.
At present, many deep learning-based technologies are proposed to solve the text recognition task, but methods based on Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) always dominate, and become a general processing architecture. Nevertheless, this structure has a drawback: the sequential processing nature of RNNs introduces some delay to model prediction, and therefore RNNs are not good choices in some cases. In summary, this embodiment only uses the complete convolution CNN architecture as the feature extractor of the model.
According to the above two paragraphs, the embodiment uses the CNN + CTC architecture, but unlike the existing work, the embodiment uses a novel CNN architecture, which can process sequences of any length, does not require preprocessing such as character segmentation and horizontal standardization, and can achieve advanced performance on handwritten english line pictures.
Example one
The embodiment provides a handwritten English line recognition method based on deep learning;
as shown in fig. 1, the method for recognizing handwritten english lines based on deep learning includes:
s101: acquiring a handwritten English image to be recognized;
s102: preprocessing a handwritten English image to be recognized;
s103: processing the preprocessed image by adopting a trained handwritten English line recognition model to obtain a recognition result of the handwritten English line;
wherein, the handwritten English line recognition model after training includes:
extracting the features of the preprocessed image to obtain a primary visual feature;
performing depth feature extraction on the preliminary visual features to obtain depth visual features;
and decoding the depth visual features to obtain the recognition result of the handwritten English lines.
Further, the S101: and acquiring a handwritten English image to be recognized, photographing the handwritten English image by adopting a camera, and acquiring the handwritten English image to be recognized in a photographing mode.
Further, the S102: preprocessing a handwritten English image to be recognized, which specifically comprises the following steps:
s102-1: carrying out size normalization processing on the handwritten English image to be recognized;
s102-2: and carrying out gray processing on the handwritten English image after the size normalization processing.
It should be understood that, since the length of the english lines is different to cause the length and width of each image to be different, in order to implement batch processing of the text images, the embodiment first unifies and normalizes the lengths and widths of all the text images. Meanwhile, the handwritten text image is a three-channel color image, the color of each pixel is determined by three components of R, G and B, and in view of the particularity of the handwritten text image, the numerical values of each pixel point of a scanned image on the three components are equal, so that the handwritten text image is converted into a gray-scale image form, each pixel point only has one component, the subsequent image calculation amount is reduced, and the overall effect is not influenced.
Further, as shown in fig. 2, the network structure of the trained handwritten english line recognition model includes:
the coding module, the stacking door module and the decoding module are connected in sequence;
as shown in fig. 3, the encoding module includes: a first depth separable convolutional network (Depthwisediable convolutional network), a first layer standardized module LN (layer normalization), and a connector concat connected in sequence; the input end of the depth separable convolution network is connected with the input end of the connector concat in a residual error mode; the input end of the first depth separable convolutional network is used as the input end of the coding module; the output end of the connector concat is used as the output end of the coding module; and the coding module is used for extracting the features of the preprocessed image to obtain the preliminary visual features.
Further, as shown in fig. 4, the stacked door module includes: the input end of the first gate unit is connected with the output end of the connector concat, and the output end of the last gate unit is connected with the input end of the second-layer standardized module; the input end of the first gate unit is used as the input end of the stacked gate module, and the output end of the second-layer standardized module is used as the output end of the stacked gate module; the stacking door module is used for carrying out depth feature extraction on the preliminary visual features to obtain depth visual features.
Further, as shown in fig. 5, the decoding module includes:
a second depth separable convolutional network, an Exponential Linear Unit (ELU), a third layer of normalization module and a decoder connected in sequence; the input end of the second depth separable convolutional network is used as the input end of the decoding module, the input end of the second depth separable convolutional network is connected with the output end of the second layer of standardized module, the output end of the decoder is used as the output end of the decoding module, and the decoding module is used for decoding the depth visual characteristics to obtain the recognition result of the handwritten English lines.
Further, the internal network structure of the first and second deep separable convolutional networks is consistent.
Further, as shown in fig. 6, the network structure of the first deep separable convolutional network includes:
a first channel-wise Convolution layer (Depth-wise Convolution) and a first Point-wise Convolution layer (Point-wise Convolution) connected in series.
Further, the internal structure of all the door units is uniform.
Further, as shown in fig. 7, the first door unit has an internal structure including:
the input end of the first gate unit is used for inputting the characteristic diagram output by the coding module; except the first gate unit, the input ends of the other gate units are used for inputting the characteristic diagram output by the previous gate unit;
the input end of the first gate unit is connected with three parallel branches, and the three parallel branches are a first branch, a second branch and a third branch in sequence;
the first branch, comprising: a second channel-wise Convolution layer (Depth-wise Convolution), a second Point-wise Convolution layer (Point-wise Convolution), an Exponential Linear Unit (ELU), and a first multiplier, which are connected in sequence; the input end of the second channel-by-channel convolution layer is connected with the input end of the first gate unit;
the second branch circuit includes: the input end of the sigmoid activation function layer is connected with the input end of the first gate unit;
the third branch, comprising: the input end of the second multiplier is connected with the input end of the first gate unit;
the input end of the first multiplier is connected with the output end of the sigmoid activation function layer; the output end of the first multiplier is connected with the input end of the first adder; outputting a weighted value a by an output layer of the sigmoid activation function layer;
processing the weighted value to obtain 1-a; inputting 1-a to an input of a second multiplier;
the output end of the second multiplier is connected with the input end of the first adder;
the output end of the first adder is connected with the input end of the second adder;
the input end of the second adder is also connected with the input end of the first gate unit;
the output of the second adder serves as the output of the first gate unit.
Further, the first gate unit includes:
a first branch for carrying out a non-linear operation on the inputSex conversion, wherein the obtained conversion characteristics are; the second channel-by-channel convolution layer, the second point-by-point convolution layer and the exponential linear unit can be regarded as a combination to jointly form conversion
Figure SMS_1
Can obtain conversion characteristics
Figure SMS_2
As input to a first multiplier; wherein the conversion characteristics
Figure SMS_3
Refer to input features
Figure SMS_4
Nonlinear conversion via a first branch
Figure SMS_5
The resulting features thereafter; "raw features" refers to input features
Figure SMS_6
A third branch for reserving input
Figure SMS_7
I.e. the original feature, as input to the second multiplier.
A second branch defining two gate signals summed to one for modeling the input
Figure SMS_8
And conversion thereof
Figure SMS_9
The relationship between the two is explored to find the matching degree; learning to obtain the weight a-conversion gate through an activation function layer
Figure SMS_10
The weight value of the conversion feature obtained by using the weight value as the first branch represents the importance of the conversion feature to the output feature, and controls the feature information after conversionDegree of carrying into output characteristic information; then, a is subtracted from 1 to obtain the weight 1-a-the retention gate
Figure SMS_11
The weighted value of the original feature obtained by using the third branch as the third branch represents the importance of the original feature to the output feature, and controls the degree of the input feature information carried into the output feature information. The larger the weight value, the greater the effect of the representative feature on the output result of the gate unit.
The output of each gate unit obtains the output characteristic
Figure SMS_12
", it is passed to the next gate unit as input to the next gate unit.
Furthermore, the first multiplier and the second multiplier are both used for multiplying elements at corresponding positions of the two matrixes, and the first multiplier multiplies the conversion characteristic of the first branch by the conversion gate of the second branch to realize weighting; the second multiplier multiplies the original characteristic of the third branch by a reserved gate of the second branch to realize weighting;
furthermore, the first adder and the second adder are both used for adding elements at corresponding positions of the two matrixes, and the first adder sums up the weighted conversion characteristic of the first branch and the weighted original characteristic of the third branch to obtain a fusion characteristic; the second adder can be seen as a kind of residual concatenation, adding the original input to the output result as well.
It should be understood that the encoding module is used to extract the initial visual features of the picture. The coding module is based on a complete convolution network and consists of deep convolution, point level convolution and interlayer residual connection. Where the training process is regularized using batch and layer normalization and Dropout, using the ELU as the activation function.
Generally, the more network parameters, the better the performance, but the network parameters lead to large calculation amount, high memory requirement, often unsatisfactory aging performance, high training cost, and difficult efficient deployment to practical application scenarios. Therefore, the present embodiment uses a lightweight small-scale separable convolution instead of the conventional regular convolution, making a trade-off between recognition accuracy and recognition delay, and alleviating the above problems.
The depth separable convolution (separable convolution) is a decomposable convolution that can be decomposed into two atomic operations: channel-by-channel convolution and point-by-point convolution.
Channel-wise convolution (Depth-wise convolution), which is a convolution in the Depth dimension (Depth-wise). Different from the traditional convolution, the convolution kernel of the traditional convolution acts on all input channels, the deep convolution adopts different convolution kernels for each input channel to carry out operation, and the convolution kernels among the channels are not shared.
Point-wise convolution (Point-wise convolution), similar to conventional convolution, differs in that the height and width of the convolution kernel are 1x1, and the number of channels is equal to the number of channels of the input feature map.
In this way, the depth separable convolution firstly adopts the channel-by-channel convolution to respectively carry out convolution operation in the horizontal direction on different input channels, and then adopts the point-by-point convolution to combine the convolution results in the vertical direction, so that the obtained overall effect is almost the same as that of a traditional convolution, but the calculated amount and the model parameter quantity are greatly reduced.
It should be appreciated that the stacked gate module, using separable deconvolution instead of conventional convolution, and using residual concatenation, reduces the amount of parameter computations and speeds convergence, as well as reducing memory requirements. Meanwhile, strong characteristics are extracted by using various characteristic conversion functions, the network is regularized by using two modes of batch normalization and hierarchical normalization, and Dropout is used for regularizing the network at the end of the whole coding module and the end of the whole stacked gate module, so that overfitting is effectively relieved, and training and convergence of the model are accelerated.
The entire model of the present embodiment uses a variety of feature transfer functions. In particular, the encoding module uses an ELU activation function; the stack gate module uses ELU and Sigmoid activation functions; the decoding module uses a Softmax activation function.
The stacked gate module is a main calculation block of the model of the embodiment, and sends the visual features extracted by the encoder into a stacked structure, and a series of gate units perform deep feature processing to extract strong feature representation so as to complete the modeling task of the input sequence. This is an iterative estimation method performed on the target sequence and can be regarded as a feature information perception filter.
For a feedforward neural network, each layer can be considered as a pair input
Figure SMS_13
A conversion is carried out
Figure SMS_14
Finally obtaining an output
Figure SMS_15
Thus, the network can be understood as a modeling input
Figure SMS_16
And conversion thereof
Figure SMS_17
The relationship between the two. To explore this degree of matching, the present embodiment designs a novel gate unit as the basic computing block of the stacked gate module.
Unlike the GRU that a time sequence relationship is established between the time of the history step and the time of the current step to learn the importance of the history feature and the current feature to the output feature, the embodiment establishes a time sequence relationship between the original and the converted layers to learn the importance of the original feature and the converted feature to the output feature. The original signature is then concatenated with the transformed signature using two gates summed to one to obtain the final fused signature, which is passed on to the next gate unit. This series of repeated operations can be viewed as a deep extraction of features, with different responses to different inputs.
For this reason, the present embodiment defines two kinds of gate signals, one is a retention gate and one is a conversion gate. The reserve gate is used for controlling the degree of the input characteristic information carried into the output characteristic information, the conversion gate is used for controlling the degree of the characteristic information after conversion carried into the output characteristic information, and the larger the gate value is, the more the input characteristic information is carried into, the larger the influence on the output is.
The operation formula of each gate unit can be expressed as follows:
Figure SMS_18
(2.1)
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_19
is the output of each of the gate units,
Figure SMS_20
is the original input of the same, and,
Figure SMS_21
and
Figure SMS_22
respectively representing a retention gate and a transfer gate,
Figure SMS_23
to adaptively learn the weights between the input features and the transformed features. The embodiment adopts the combination of deep convolution, point-level convolution and ELU activation function as the nonlinear conversion of the gate unit
Figure SMS_24
As with a neural network consisting of a plurality of neurons, a stacked gate module is made up of a plurality of gate units. The door mechanism provided by the embodiment is helpful for controlling information flow transmission among layers, so that the model automatically learns the feature importance taking the time sequence before and after conversion, and can filter unimportant feature signals and strengthen important feature signals at the same time, and excavate the optimal matching relationship among the feature signals.
It should be understood that the exponential linear unit ELU, as an activation function, has the advantages of: on one hand, the ELU is zero-centered, the average activation value of the network unit can be pushed to a value closer to zero, the interlayer accumulated deviation caused by deepening of the hierarchy is reduced, the gradient propagation is more stable, the convergence speed is accelerated, the effect similar to batch normalization is exerted, and the calculation complexity is lower; on the other hand, for a smaller input, the ELU is saturated to a negative value with a smaller parameter, the soft saturation characteristic reduces the change influence of the inactivation unit on the feature information in forward propagation, weakens the correlation of the inactivation unit on the next layer of network, strengthens the feature importance of the activation unit, and makes the dependency relationship between the units easier to model and interpret, so that the model has stronger robustness on noise, and allows the network to learn a more stable representation.
The function of the ELUs and their derivatives are formulated as follows:
Figure SMS_25
(1.1)
Figure SMS_26
(1.2)
wherein the hyper-parameter
Figure SMS_27
The saturation value of the ELU at negative inputs, i.e., when the negative portion of the ELU is saturated, is controlled. It can be seen that when the input is positive, the ELU mitigates the gradient vanishing by a constant positive equation; when the input is in a negative state, the ELU is saturated to a negative value, so that the change of the inactivation unit and the information propagated to the next layer are reduced, and the change is limited to a small area fluctuation, thereby improving the robustness to noise.
It will be appreciated that the decoder module performs a subsequent decoding process on the visual features obtained by the stacked gate module, using CTCs as the decoder for that module, outputting the final text sequence.
Further, the training process of the trained handwritten English line recognition model comprises the following steps:
s103-1: constructing a training set, wherein the training set is a handwritten English image of a known handwritten English line recognition result;
s103-2: inputting the training set into a handwritten English line recognition model, training the model, and stopping training when the total loss function value of the model does not decrease any more to obtain the trained handwritten English line recognition model.
Further, the total loss function of the model is a weighted sum of the loss function of the decoder and the statistical loss function.
Further, the total loss function of the model is:
Figure SMS_28
(4.1)
the total loss function is composed of two parts, wherein,
Figure SMS_29
indicating the loss of the CTC of the decoder,
Figure SMS_30
the loss representing the "statistical loss function" is,
Figure SMS_31
Figure SMS_32
respectively, corresponding weights.
Figure SMS_33
The specific formula is as follows:
Figure SMS_34
(3.1)
Figure SMS_35
(3.2)
Figure SMS_36
(3.3)
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_40
is shown at time step
Figure SMS_44
With characters
Figure SMS_45
Based on the prediction result of each frame
Figure SMS_39
And a dictionary set, wherein the probability is multiplied by each time step in the path to finally obtain the whole path
Figure SMS_43
Probability of (2)
Figure SMS_47
Figure SMS_51
Represents passing through
Figure SMS_37
The transformation is followed by a sequence
Figure SMS_42
Of (2) a
Figure SMS_48
Due to the existence of multiple paths
Figure SMS_49
The same sequence is obtained so that the probability of the final sequence is equal to all paths
Figure SMS_38
Is added to obtain the final sequence
Figure SMS_41
Total probability of
Figure SMS_46
. The final objective function takes the conditional probability of the tag sequence
Figure SMS_50
Negative log likelihood of (d).
"statistical loss function"
Figure SMS_52
The calculation process of (2) is shown as the following formula:
Figure SMS_53
(3.4)
Figure SMS_54
(3.5)
Figure SMS_55
(3.6)
Figure SMS_56
(3.7)
wherein the content of the first and second substances,
Figure SMS_64
and
Figure SMS_61
respectively represent tag sequences
Figure SMS_69
And a prediction sequence
Figure SMS_63
Statistical probability distribution of all character categories;
Figure SMS_68
which represents the input image, is,
Figure SMS_70
a training set is represented that is,
Figure SMS_73
a representation character dictionary;
Figure SMS_60
is shown as
Figure SMS_72
Character classification
Figure SMS_57
In the tag sequence
Figure SMS_71
The number of occurrences in (a);
Figure SMS_59
is shown as
Figure SMS_66
Character classification
Figure SMS_62
In predicting sequences
Figure SMS_67
By aggregating the predicted probability of each character along the time dimension, accumulating all time steps
Figure SMS_58
Is predicted with probability of
Figure SMS_65
And obtaining the product.
CTC loss function, which calculates the probability distribution of all possible output sequences by considering all possible alignment paths of each input frame during the prediction process, then greedily finds the output sequence with the highest probability for the input sequence from them, and finally uses a function
Figure SMS_74
To remove redundant characters and spaces, thereby routing paths
Figure SMS_75
Mapping to a final sequence
Figure SMS_76
In the training stage, the whole model is trained end to end by minimizing the CTC loss between the text labels corresponding to the original pictures; and in the reasoning stage, the final visual characteristics obtained by the visual characteristic extraction module are transcribed and decoded, and the output text sequence is identified.
"statistical-based loss function" (statistical-based loss), the data set of this embodiment has the following problems: many special characters and spaces exist in a text line, the number of the spaces marked manually and the number of the real spaces at the image pixel level are possibly inconsistent, so that the division among different words is unclear, the phenomenon that the word interval of a label is different from the predicted word interval is caused in a recognition result, the number of the characters of the label and the prediction result is inconsistent, and the sequence lengths of the label and the prediction result are inconsistent. Therefore, the embodiment provides statistical information based on the number of characters as additional supervision information for the model, and considers that the network learns the number of all characters in the text line, and by accurately predicting the number of characters of each class in the label, the generation of a prediction sequence can be better constrained, and more supervision information can be helpful for correct recognition.
Furthermore, because the alignment between the tag characters and the model predictions may be unclear, the way that CTCs accurately estimate conditional probabilities directly is challenging, and using CTCs alone as loss supervision may introduce errors. Therefore, the embodiment additionally adds a supervision mode to assist CTCs to jointly supervise the model and jointly optimize the objective function, thereby facilitating correct identification.
For the above two points, the present embodiment proposes a novel "statistical loss function". Unlike CTC, the prediction method for probability is to make the network predict a probability distribution for each class at each time step, and the "statistical loss function" counts the number of occurrences of each character class by considering the number of each character class to learn the cumulative probability of each character class at all time steps, so that the network is essentially made to predict the number of each class, i.e., without considering the character sequence information in the tag sequence. For example, in the word "hello", the character "l" appears twice, then its cumulative prediction probability over all time steps should be exactly 2, and the corresponding two predictions should both be close to 1.
Therefore, two new distributions are formed by the prediction result and the label after considering the statistical information, and in order to measure the distance between the two new probability distributions, the cross entropy is used as a calculation method in the embodiment.
It can be seen that the statistical loss function provided by the present embodiment only involves simple operations, no additional parameters, negligible computation and memory consumption, and can be implemented by only changing the original model with little change.
Further, the constructing of the training set specifically includes:
s103-11: carrying out size normalization processing on the handwritten English line image;
s103-12: and processing the label of the handwritten English line image.
Further, processing the label of the handwritten English line image specifically includes:
s103-121: constructing a character dictionary; the character dictionary includes: a one-to-one correspondence between characters and indices; the character, comprising: capital letters, lowercase letters, numbers, punctuation marks;
s103-122: constructing a statistical dictionary; the statistical dictionary includes: the number of all characters in the text label and the number of each type of characters; the statistical dictionary includes: one-to-one correspondence between characters and quantity;
s103-123: mapping the text labels according to the character dictionary, and establishing a corresponding relation between characters and indexes;
s103-124: and after S103-121 to S103-123, obtaining labels corresponding to all handwritten English line images.
It should be understood that a character dictionary is constructed: in the embodiment, a novel dictionary establishing mode is adopted, all characters appearing in a data set are counted and repeated characters are filtered, so that the model can identify English characters containing capital and small letters, numbers and punctuations, can identify Chinese characters and special symbols under a Chinese input method, and is more universal;
it should be understood that a statistical dictionary is constructed: the embodiment counts the number of all characters in the text label and the number of each type of characters, and establishes a statistical correspondence between the characters and the number;
it should be understood that the text labels are mapped according to a character dictionary, and a correspondence relationship is established between the characters and the indexes.
Example two
The embodiment provides a handwritten English line recognition system based on deep learning;
handwritten English line recognition system based on deep learning comprises:
an acquisition module configured to: acquiring a handwritten English image to be recognized;
a pre-processing module configured to: preprocessing a handwritten English image to be recognized;
an identification module configured to: processing the preprocessed image by adopting a trained handwritten English line recognition model to obtain a recognition result of the handwritten English line;
wherein, the handwritten English line recognition model after training includes:
extracting the characteristics of the preprocessed image to obtain a preliminary visual characteristic;
performing depth feature extraction on the preliminary visual features to obtain depth visual features;
and decoding the depth visual features to obtain the recognition result of the handwritten English lines.
It should be noted here that the acquiring module, the preprocessing module and the identifying module correspond to steps S101 to S103 in the first embodiment, and the modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The handwritten English line recognition method based on deep learning is characterized by comprising the following steps:
acquiring a handwritten English image to be recognized;
preprocessing a handwritten English image to be recognized;
processing the preprocessed image by adopting a trained handwritten English line recognition model to obtain a recognition result of the handwritten English line;
wherein, the handwritten English line recognition model after training includes: extracting the features of the preprocessed image to obtain a primary visual feature; performing depth feature extraction on the preliminary visual features to obtain depth visual features; and decoding the depth visual features to obtain the recognition result of the handwritten English lines.
2. The method as claimed in claim 1, wherein the network structure of the trained handwritten english line recognition model comprises:
the coding module, the stacking door module and the decoding module are connected in sequence;
the encoding module includes: the first depth separable convolution network, the first layer standardization module and the connector are connected in sequence; the input end of the first depth separable convolutional network is connected with the input end of the connector in a residual error mode; the input end of the first depth separable convolutional network is used as the input end of the coding module; the output end of the connector is used as the output end of the coding module; and the coding module is used for extracting the features of the preprocessed image to obtain the preliminary visual features.
3. The method for deep learning-based handwritten english line recognition according to claim 2, wherein the stacking gate module comprises: the input end of the first gate unit is connected with the output end of the connector, and the output end of the last gate unit is connected with the input end of the second-layer standardized module; the input end of the first gate unit is used as the input end of the stacked gate module, and the output end of the second-layer standardized module is used as the output end of the stacked gate module; the stacking door module is used for carrying out depth feature extraction on the preliminary visual features to obtain depth visual features.
4. The method for recognizing handwritten English lines based on deep learning of claim 3, wherein said decoding module comprises:
the second depth separable convolutional network, the exponential linear unit, the third layer of standardization module and the decoder are connected in sequence; the input end of the second depth separable convolutional network is used as the input end of the decoding module, the input end of the second depth separable convolutional network is connected with the output end of the second layer standardization module, the output end of the decoder is used as the output end of the decoding module, and the decoding module is used for decoding the depth visual features to obtain the recognition result of the handwritten English lines.
5. The method as claimed in claim 3, wherein the first gate unit has an internal structure comprising:
the input end of the first gate unit is used for inputting the characteristic diagram output by the coding module; except the first gate unit, the input ends of the other gate units are used for inputting the characteristic diagram output by the previous gate unit; the input end of the first gate unit is connected with three parallel branches, and the three parallel branches are a first branch, a second branch and a third branch in sequence;
the first branch, comprising: the second channel-by-channel convolution layer, the second point-by-point convolution layer, the exponential linear unit and the first multiplier are connected in sequence; the input end of the second channel-by-channel convolution layer is connected with the input end of the first gate unit;
the second branch circuit includes: the input end of the activation function layer is connected with the input end of the first gate unit;
the third branch, comprising: the input end of the second multiplier is connected with the input end of the first gate unit;
the input end of the first multiplier is connected with the output end of the activation function layer; the output end of the first multiplier is connected with the input end of the first adder; activating an output layer output weighted value a of the function layer; processing the weighted value to obtain 1-a; inputting 1-a to an input of a second multiplier; the output end of the second multiplier is connected with the input end of the first adder; the output end of the first adder is connected with the input end of the second adder; the input end of the second adder is also connected with the input end of the first gate unit; the output of the second adder serves as the output of the first gate unit.
6. The method as claimed in claim 5, wherein the first branch is used for performing a nonlinear transformation on the input, and obtaining the transformation characteristics; the second channel-by-channel convolution layer, the second point-by-point convolution layer and the exponential linear unit can be regarded as a combination, and jointly form conversion to obtain conversion characteristics which are used as the input of the first multiplier; wherein, the conversion characteristic refers to the characteristic obtained after the input characteristic is subjected to the nonlinear conversion of the first branch.
7. The method as claimed in claim 5, wherein the second branch defines two gate signals with one summation for modeling input
Figure QLYQS_1
And conversion thereof
Figure QLYQS_2
The relationship between the two is explored to find the matching degree; learning the weight a through an activation function layer, using the weight as the weight value of the conversion feature obtained by the first branch, representing the importance of the conversion feature to the output feature, and controlling the feature information after conversion to be carried to the output featureDegree in the characteristic information; then, a is subtracted from 1 to obtain a weight 1-a, the weight 1-a is used as a weight value of the original feature obtained by the third branch to represent the importance of the original feature to the output feature, the degree of the input feature information carried into the output feature information is controlled, the weight a is obtained through learning of an activation function layer, and the process is regarded as a conversion gate; the difference between a and 1 gives the weight 1-a, and this process is considered as reservation gate.
8. The method for recognizing handwritten English lines based on deep learning of claim 7, wherein said first multiplier and said second multiplier are both used for multiplying elements at corresponding positions of two matrices, the first multiplier multiplies conversion characteristics of the first branch by conversion gates of the second branch to realize weighting; the second multiplier multiplies the original characteristic of the third branch by a reserved gate of the second branch to realize weighting;
the first adder and the second adder are used for adding elements at corresponding positions of the two matrixes, and the first adder sums the weighted conversion characteristic of the first branch and the weighted original characteristic of the third branch to obtain a fusion characteristic; the second adder is considered as a kind of residual concatenation, adding the original input to the output result as well.
9. The method as claimed in claim 4, wherein the training process of the trained handwritten English line recognition model comprises:
constructing a training set, wherein the training set is a handwritten English image of a known handwritten English line recognition result;
inputting the training set into a handwritten English line recognition model, training the model, and stopping training when the total loss function value of the model does not decrease any more to obtain a trained handwritten English line recognition model;
the total loss function of the model is the weighted sum of the loss function of the decoder and the statistical loss function.
10. Handwritten English line recognition system based on deep learning is characterized by comprising:
an acquisition module configured to: acquiring a handwritten English image to be recognized;
a pre-processing module configured to: preprocessing a handwritten English image to be recognized;
an identification module configured to: processing the preprocessed image by adopting a trained handwritten English line recognition model to obtain a recognition result of the handwritten English line;
wherein, the handwritten English line recognition model after training includes: extracting the features of the preprocessed image to obtain a primary visual feature; performing depth feature extraction on the preliminary visual features to obtain depth visual features; and decoding the depth visual features to obtain the recognition result of the handwritten English lines.
CN202310084850.3A 2023-02-09 2023-02-09 Deep learning-based handwriting English line recognition method and system Active CN115797952B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310084850.3A CN115797952B (en) 2023-02-09 2023-02-09 Deep learning-based handwriting English line recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310084850.3A CN115797952B (en) 2023-02-09 2023-02-09 Deep learning-based handwriting English line recognition method and system

Publications (2)

Publication Number Publication Date
CN115797952A true CN115797952A (en) 2023-03-14
CN115797952B CN115797952B (en) 2023-05-05

Family

ID=85430576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310084850.3A Active CN115797952B (en) 2023-02-09 2023-02-09 Deep learning-based handwriting English line recognition method and system

Country Status (1)

Country Link
CN (1) CN115797952B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824597A (en) * 2023-07-03 2023-09-29 金陵科技学院 Dynamic image segmentation and parallel learning hand-written identity card number and identity recognition method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704859A (en) * 2017-11-01 2018-02-16 哈尔滨工业大学深圳研究生院 A kind of character recognition method based on deep learning training framework
US20180137349A1 (en) * 2016-11-14 2018-05-17 Kodak Alaris Inc. System and method of character recognition using fully convolutional neural networks
CN110738090A (en) * 2018-07-19 2020-01-31 塔塔咨询服务公司 System and method for end-to-end handwritten text recognition using neural networks
CN111652332A (en) * 2020-06-09 2020-09-11 山东大学 Deep learning handwritten Chinese character recognition method and system based on two classifications
CN112686345A (en) * 2020-12-31 2021-04-20 江南大学 Off-line English handwriting recognition method based on attention mechanism
CN113592045A (en) * 2021-09-30 2021-11-02 杭州一知智能科技有限公司 Model adaptive text recognition method and system from printed form to handwritten form

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137349A1 (en) * 2016-11-14 2018-05-17 Kodak Alaris Inc. System and method of character recognition using fully convolutional neural networks
CN107704859A (en) * 2017-11-01 2018-02-16 哈尔滨工业大学深圳研究生院 A kind of character recognition method based on deep learning training framework
CN110738090A (en) * 2018-07-19 2020-01-31 塔塔咨询服务公司 System and method for end-to-end handwritten text recognition using neural networks
CN111652332A (en) * 2020-06-09 2020-09-11 山东大学 Deep learning handwritten Chinese character recognition method and system based on two classifications
CN112686345A (en) * 2020-12-31 2021-04-20 江南大学 Off-line English handwriting recognition method based on attention mechanism
CN113592045A (en) * 2021-09-30 2021-11-02 杭州一知智能科技有限公司 Model adaptive text recognition method and system from printed form to handwritten form

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KHA CONG NGUYEN等: "A Semantic Segmentation-based Method for Handwritten Japanese Text Recognition" *
李金广: "TensorFlow手写数字识别分析" *
闵锋;叶显一;张彦铎;: "基于改进主成分分析网络的手写数字识别方法" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824597A (en) * 2023-07-03 2023-09-29 金陵科技学院 Dynamic image segmentation and parallel learning hand-written identity card number and identity recognition method
CN116824597B (en) * 2023-07-03 2024-05-24 金陵科技学院 Dynamic image segmentation and parallel learning hand-written identity card number and identity recognition method

Also Published As

Publication number Publication date
CN115797952B (en) 2023-05-05

Similar Documents

Publication Publication Date Title
CN110442707B (en) Seq2 seq-based multi-label text classification method
WO2021135254A1 (en) License plate number recognition method and apparatus, electronic device, and storage medium
CN112733768B (en) Natural scene text recognition method and device based on bidirectional characteristic language model
WO2016197381A1 (en) Methods and apparatus for recognizing text in an image
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
CN111783767B (en) Character recognition method, character recognition device, electronic equipment and storage medium
CN114187311A (en) Image semantic segmentation method, device, equipment and storage medium
CN113673482B (en) Cell antinuclear antibody fluorescence recognition method and system based on dynamic label distribution
CN114037930A (en) Video action recognition method based on space-time enhanced network
CN115797952A (en) Handwritten English line recognition method and system based on deep learning
CN114255456A (en) Natural scene text detection method and system based on attention mechanism feature fusion and enhancement
Wu et al. STR transformer: a cross-domain transformer for scene text recognition
CN116954113B (en) Intelligent robot driving sensing intelligent control system and method thereof
CN115601744B (en) License plate detection method for vehicle body and license plate with similar colors
CN111242114A (en) Character recognition method and device
CN116595167A (en) Intent recognition method based on integrated knowledge distillation network
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN113592045B (en) Model adaptive text recognition method and system from printed form to handwritten form
CN115862015A (en) Training method and device of character recognition system, and character recognition method and device
CN115797642A (en) Self-adaptive image semantic segmentation algorithm based on consistency regularization and semi-supervision field
CN114529908A (en) Offline handwritten chemical reaction type image recognition technology
Chen et al. CaptchaGG: A linear graphical CAPTCHA recognition model based on CNN and RNN
Li et al. Channel attention convolutional recurrent neural network on street view symbol recognition
CN117456191B (en) Semantic segmentation method based on three-branch network structure under complex environment
CN113095335B (en) Image recognition method based on category consistency deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant