CN115797952A

CN115797952A - Handwritten English line recognition method and system based on deep learning

Info

Publication number: CN115797952A
Application number: CN202310084850.3A
Authority: CN
Inventors: 许信顺; 初宛晴; 马磊; 陈义学; 李溢欢
Original assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Current assignee: SHANDONG SHANDA OUMA SOFTWARE CO Ltd
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2023-03-14
Anticipated expiration: 2043-02-09
Also published as: CN115797952B

Abstract

The invention relates to the technical field of image processing, in particular to a handwritten English line recognition method and a handwritten English line recognition system based on deep learning; wherein the method comprises: acquiring a handwritten English image to be recognized; preprocessing a handwritten English image to be recognized; processing the preprocessed image by adopting a trained handwritten English line recognition model to obtain a recognition result of the handwritten English line; wherein, the handwritten English line recognition model after training includes: extracting the features of the preprocessed image to obtain a primary visual feature; performing depth feature extraction on the preliminary visual features to obtain depth visual features; and decoding the depth visual features to obtain the recognition result of the handwritten English lines. The invention has accurate recognition result for handwritten English lines.

Description

Handwritten English line recognition method and system based on deep learning

Technical Field

The invention relates to the technical field of image processing, in particular to a handwritten English line recognition method and system based on deep learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

Text, as a form of expression for transferring information between humans, is self-evident to the necessity and the breadth of human daily life as a visual code of a language. The diversity and increasing visual forms of text, coupled with the great diversity of human handwriting styles, have led to the complex nature of handwritten text.

Based on the above situation, intelligent processing of text is becoming increasingly important. The automatic transcription and storage are carried out on the text, so that the time, manpower and material resources can be efficiently saved, and the subsequent processing and application can be facilitated.

Text recognition is an important computer vision task that has arisen in response to the above task. Currently, a single-stage recognition method without segmentation is mostly adopted in the text recognition technology, that is, a text picture is regarded as a whole, and a sequence transcription mode for performing fine-grained alignment between a source sequence in an original picture and an output target sequence is sought.

The method mainly has a trend that: using the "encoder-decoder" architecture with attention mechanism, text pictures are first mapped in their entirety by the encoder into a token vector, and then transcribed by the decoder into a sequence of consecutive characters based on this token. Wherein, a weight matrix is obtained through neural network learning, and the value of the weight matrix represents the importance of corresponding context information to the current time step prediction, thereby realizing the selective alignment between the coding sequence and the decoding sequence. However, this architecture has several problems:

(1) Missing or redundant characters may exist in the text picture, which may cause accumulation of wrong alignment between the tag sequence and the attention prediction, and further mislead the training process, so it is difficult to learn from the beginning, and the effect is poor in a long text line scene;

(2) The attention mechanism relies on a complex attention module, which results in additional network parameters and runtime as well as a large amount of storage requirements.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a handwritten English line recognition method and system based on deep learning.

In a first aspect, the present invention provides a handwritten English line recognition method based on deep learning, which can realize fast recognition and accurate recognition of handwritten English lines, and the method includes:

acquiring a handwritten English image to be recognized;

preprocessing a handwritten English image to be recognized;

processing the preprocessed image by adopting a trained handwritten English line recognition model to obtain a recognition result of the handwritten English line;

wherein, the handwritten English line recognition model after training includes: extracting the features of the preprocessed image to obtain a primary visual feature; performing depth feature extraction on the preliminary visual features to obtain depth visual features; and decoding the depth visual features to obtain the recognition result of the handwritten English lines.

In a second aspect, the invention provides a handwritten English line recognition system based on deep learning; the invention can realize the quick recognition and the accurate recognition of the handwritten English line, and the system comprises:

an acquisition module configured to: acquiring a handwritten English image to be recognized;

a pre-processing module configured to: preprocessing a handwritten English image to be recognized;

an identification module configured to: processing the preprocessed image by adopting a trained handwritten English line recognition model to obtain a recognition result of the handwritten English line;

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention provides a simple complete convolution neural network architecture, only uses a high-efficiency convolution to replace the traditional regular convolution, only has feedforward connection without circular connection, and realizes high data and calculation efficiency. The method can train on the text image with variable size by using the line-level transcription label with variable length without preprocessing such as character segmentation and horizontal standardization;

(2) The invention provides an effective iterative estimation method of a target sequence, and provides a novel gate unit, which is used for fully extracting features through unit stacking and simultaneously utilizing the advantages of a gate mechanism on feature processing and fusion modes, so that a model is facilitated to obtain a more accurate identification result;

(3) The model provided by the invention adopts a plurality of characteristic transformation modes, and can extract characteristic representation with strong information from training data;

(4) The model provided by the invention uses various normalization modes and Dropout to regularize the network, and adopts various regularization methods to accelerate model convergence and effectively relieve overfitting.

(5) The model provided by the invention does not use a full connection layer, the main calculation block of the model is separable convolution, and comparable performance can be obtained by using the model to replace the traditional regular convolution, so that the parameter computation amount is greatly reduced, the training and convergence of the model are accelerated, and the storage space is saved;

(6) The invention provides a novel statistical loss function, provides more supervision information for a model, assists CTC (central traffic control) to jointly optimize a target function, and is beneficial to correct identification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flowchart of a method according to a first embodiment;

FIG. 2 is a diagram of an internal network structure of a trained handwritten English line recognition model according to the first embodiment;

FIG. 3 is a diagram of an internal structure of an encoding module according to the first embodiment;

FIG. 4 is a view of the internal structure of the stacked door module according to the first embodiment;

FIG. 5 is a diagram illustrating an internal structure of a decoding module according to the first embodiment;

fig. 6 is an internal structural diagram of a first depth separable convolutional network of the first embodiment;

fig. 7 is an internal structure view of the first door unit according to the first embodiment.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

All data are obtained according to the embodiment and are legally applied on the data on the basis of compliance with laws and regulations and user consent.

CTC (Connectionist Temporal Classification) -based architecture that computes the probability distribution of all possible output sequences by considering all possible alignment paths of each input frame in the prediction process, and then greedily finds the most probable output sequence from them. Therefore, the CTC alignment is more accurate, the training convergence of the model is faster, and the method is suitable for a one-dimensional sequence prediction task without additional processing. Considering that the application scenario of the present embodiment is directed to a long text line, the present embodiment adopts CTC as a decoder of a model, as described in this paragraph.

At present, many deep learning-based technologies are proposed to solve the text recognition task, but methods based on Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) always dominate, and become a general processing architecture. Nevertheless, this structure has a drawback: the sequential processing nature of RNNs introduces some delay to model prediction, and therefore RNNs are not good choices in some cases. In summary, this embodiment only uses the complete convolution CNN architecture as the feature extractor of the model.

According to the above two paragraphs, the embodiment uses the CNN + CTC architecture, but unlike the existing work, the embodiment uses a novel CNN architecture, which can process sequences of any length, does not require preprocessing such as character segmentation and horizontal standardization, and can achieve advanced performance on handwritten english line pictures.

Example one

The embodiment provides a handwritten English line recognition method based on deep learning;

as shown in fig. 1, the method for recognizing handwritten english lines based on deep learning includes:

s101: acquiring a handwritten English image to be recognized;

s102: preprocessing a handwritten English image to be recognized;

s103: processing the preprocessed image by adopting a trained handwritten English line recognition model to obtain a recognition result of the handwritten English line;

wherein, the handwritten English line recognition model after training includes:

extracting the features of the preprocessed image to obtain a primary visual feature;

performing depth feature extraction on the preliminary visual features to obtain depth visual features;

and decoding the depth visual features to obtain the recognition result of the handwritten English lines.

Further, the S101: and acquiring a handwritten English image to be recognized, photographing the handwritten English image by adopting a camera, and acquiring the handwritten English image to be recognized in a photographing mode.

Further, the S102: preprocessing a handwritten English image to be recognized, which specifically comprises the following steps:

s102-1: carrying out size normalization processing on the handwritten English image to be recognized;

s102-2: and carrying out gray processing on the handwritten English image after the size normalization processing.

It should be understood that, since the length of the english lines is different to cause the length and width of each image to be different, in order to implement batch processing of the text images, the embodiment first unifies and normalizes the lengths and widths of all the text images. Meanwhile, the handwritten text image is a three-channel color image, the color of each pixel is determined by three components of R, G and B, and in view of the particularity of the handwritten text image, the numerical values of each pixel point of a scanned image on the three components are equal, so that the handwritten text image is converted into a gray-scale image form, each pixel point only has one component, the subsequent image calculation amount is reduced, and the overall effect is not influenced.

Further, as shown in fig. 2, the network structure of the trained handwritten english line recognition model includes:

the coding module, the stacking door module and the decoding module are connected in sequence;

as shown in fig. 3, the encoding module includes: a first depth separable convolutional network (Depthwisediable convolutional network), a first layer standardized module LN (layer normalization), and a connector concat connected in sequence; the input end of the depth separable convolution network is connected with the input end of the connector concat in a residual error mode; the input end of the first depth separable convolutional network is used as the input end of the coding module; the output end of the connector concat is used as the output end of the coding module; and the coding module is used for extracting the features of the preprocessed image to obtain the preliminary visual features.

Further, as shown in fig. 4, the stacked door module includes: the input end of the first gate unit is connected with the output end of the connector concat, and the output end of the last gate unit is connected with the input end of the second-layer standardized module; the input end of the first gate unit is used as the input end of the stacked gate module, and the output end of the second-layer standardized module is used as the output end of the stacked gate module; the stacking door module is used for carrying out depth feature extraction on the preliminary visual features to obtain depth visual features.

Further, as shown in fig. 5, the decoding module includes:

a second depth separable convolutional network, an Exponential Linear Unit (ELU), a third layer of normalization module and a decoder connected in sequence; the input end of the second depth separable convolutional network is used as the input end of the decoding module, the input end of the second depth separable convolutional network is connected with the output end of the second layer of standardized module, the output end of the decoder is used as the output end of the decoding module, and the decoding module is used for decoding the depth visual characteristics to obtain the recognition result of the handwritten English lines.

Further, the internal network structure of the first and second deep separable convolutional networks is consistent.

Further, as shown in fig. 6, the network structure of the first deep separable convolutional network includes:

a first channel-wise Convolution layer (Depth-wise Convolution) and a first Point-wise Convolution layer (Point-wise Convolution) connected in series.

Further, the internal structure of all the door units is uniform.

Further, as shown in fig. 7, the first door unit has an internal structure including:

the input end of the first gate unit is used for inputting the characteristic diagram output by the coding module; except the first gate unit, the input ends of the other gate units are used for inputting the characteristic diagram output by the previous gate unit;

the input end of the first gate unit is connected with three parallel branches, and the three parallel branches are a first branch, a second branch and a third branch in sequence;

the first branch, comprising: a second channel-wise Convolution layer (Depth-wise Convolution), a second Point-wise Convolution layer (Point-wise Convolution), an Exponential Linear Unit (ELU), and a first multiplier, which are connected in sequence; the input end of the second channel-by-channel convolution layer is connected with the input end of the first gate unit;

the second branch circuit includes: the input end of the sigmoid activation function layer is connected with the input end of the first gate unit;

the third branch, comprising: the input end of the second multiplier is connected with the input end of the first gate unit;

the input end of the first multiplier is connected with the output end of the sigmoid activation function layer; the output end of the first multiplier is connected with the input end of the first adder; outputting a weighted value a by an output layer of the sigmoid activation function layer;

processing the weighted value to obtain 1-a; inputting 1-a to an input of a second multiplier;

the output end of the second multiplier is connected with the input end of the first adder;

the output end of the first adder is connected with the input end of the second adder;

the input end of the second adder is also connected with the input end of the first gate unit;

the output of the second adder serves as the output of the first gate unit.

Further, the first gate unit includes:

a first branch for carrying out a non-linear operation on the inputSex conversion, wherein the obtained conversion characteristics are; the second channel-by-channel convolution layer, the second point-by-point convolution layer and the exponential linear unit can be regarded as a combination to jointly form conversion

Can obtain conversion characteristics

As input to a first multiplier; wherein the conversion characteristics

Refer to input features

Nonlinear conversion via a first branch

The resulting features thereafter; "raw features" refers to input features

。

A third branch for reserving input

I.e. the original feature, as input to the second multiplier.

A second branch defining two gate signals summed to one for modeling the input

And conversion thereof

The relationship between the two is explored to find the matching degree; learning to obtain the weight a-conversion gate through an activation function layer

The weight value of the conversion feature obtained by using the weight value as the first branch represents the importance of the conversion feature to the output feature, and controls the feature information after conversionDegree of carrying into output characteristic information; then, a is subtracted from 1 to obtain the weight 1-a-the retention gate

The weighted value of the original feature obtained by using the third branch as the third branch represents the importance of the original feature to the output feature, and controls the degree of the input feature information carried into the output feature information. The larger the weight value, the greater the effect of the representative feature on the output result of the gate unit.

The output of each gate unit obtains the output characteristic

", it is passed to the next gate unit as input to the next gate unit.

Furthermore, the first multiplier and the second multiplier are both used for multiplying elements at corresponding positions of the two matrixes, and the first multiplier multiplies the conversion characteristic of the first branch by the conversion gate of the second branch to realize weighting; the second multiplier multiplies the original characteristic of the third branch by a reserved gate of the second branch to realize weighting;

furthermore, the first adder and the second adder are both used for adding elements at corresponding positions of the two matrixes, and the first adder sums up the weighted conversion characteristic of the first branch and the weighted original characteristic of the third branch to obtain a fusion characteristic; the second adder can be seen as a kind of residual concatenation, adding the original input to the output result as well.

It should be understood that the encoding module is used to extract the initial visual features of the picture. The coding module is based on a complete convolution network and consists of deep convolution, point level convolution and interlayer residual connection. Where the training process is regularized using batch and layer normalization and Dropout, using the ELU as the activation function.

Generally, the more network parameters, the better the performance, but the network parameters lead to large calculation amount, high memory requirement, often unsatisfactory aging performance, high training cost, and difficult efficient deployment to practical application scenarios. Therefore, the present embodiment uses a lightweight small-scale separable convolution instead of the conventional regular convolution, making a trade-off between recognition accuracy and recognition delay, and alleviating the above problems.

The depth separable convolution (separable convolution) is a decomposable convolution that can be decomposed into two atomic operations: channel-by-channel convolution and point-by-point convolution.

Channel-wise convolution (Depth-wise convolution), which is a convolution in the Depth dimension (Depth-wise). Different from the traditional convolution, the convolution kernel of the traditional convolution acts on all input channels, the deep convolution adopts different convolution kernels for each input channel to carry out operation, and the convolution kernels among the channels are not shared.

Point-wise convolution (Point-wise convolution), similar to conventional convolution, differs in that the height and width of the convolution kernel are 1x1, and the number of channels is equal to the number of channels of the input feature map.

In this way, the depth separable convolution firstly adopts the channel-by-channel convolution to respectively carry out convolution operation in the horizontal direction on different input channels, and then adopts the point-by-point convolution to combine the convolution results in the vertical direction, so that the obtained overall effect is almost the same as that of a traditional convolution, but the calculated amount and the model parameter quantity are greatly reduced.

It should be appreciated that the stacked gate module, using separable deconvolution instead of conventional convolution, and using residual concatenation, reduces the amount of parameter computations and speeds convergence, as well as reducing memory requirements. Meanwhile, strong characteristics are extracted by using various characteristic conversion functions, the network is regularized by using two modes of batch normalization and hierarchical normalization, and Dropout is used for regularizing the network at the end of the whole coding module and the end of the whole stacked gate module, so that overfitting is effectively relieved, and training and convergence of the model are accelerated.

The entire model of the present embodiment uses a variety of feature transfer functions. In particular, the encoding module uses an ELU activation function; the stack gate module uses ELU and Sigmoid activation functions; the decoding module uses a Softmax activation function.

The stacked gate module is a main calculation block of the model of the embodiment, and sends the visual features extracted by the encoder into a stacked structure, and a series of gate units perform deep feature processing to extract strong feature representation so as to complete the modeling task of the input sequence. This is an iterative estimation method performed on the target sequence and can be regarded as a feature information perception filter.

For a feedforward neural network, each layer can be considered as a pair input

A conversion is carried out

Finally obtaining an output

Thus, the network can be understood as a modeling input

And conversion thereof

The relationship between the two. To explore this degree of matching, the present embodiment designs a novel gate unit as the basic computing block of the stacked gate module.

Unlike the GRU that a time sequence relationship is established between the time of the history step and the time of the current step to learn the importance of the history feature and the current feature to the output feature, the embodiment establishes a time sequence relationship between the original and the converted layers to learn the importance of the original feature and the converted feature to the output feature. The original signature is then concatenated with the transformed signature using two gates summed to one to obtain the final fused signature, which is passed on to the next gate unit. This series of repeated operations can be viewed as a deep extraction of features, with different responses to different inputs.

For this reason, the present embodiment defines two kinds of gate signals, one is a retention gate and one is a conversion gate. The reserve gate is used for controlling the degree of the input characteristic information carried into the output characteristic information, the conversion gate is used for controlling the degree of the characteristic information after conversion carried into the output characteristic information, and the larger the gate value is, the more the input characteristic information is carried into, the larger the influence on the output is.

The operation formula of each gate unit can be expressed as follows:

（2.1）

wherein, the first and the second end of the pipe are connected with each other,

is the output of each of the gate units,

is the original input of the same, and,

and

respectively representing a retention gate and a transfer gate,

to adaptively learn the weights between the input features and the transformed features. The embodiment adopts the combination of deep convolution, point-level convolution and ELU activation function as the nonlinear conversion of the gate unit

。

As with a neural network consisting of a plurality of neurons, a stacked gate module is made up of a plurality of gate units. The door mechanism provided by the embodiment is helpful for controlling information flow transmission among layers, so that the model automatically learns the feature importance taking the time sequence before and after conversion, and can filter unimportant feature signals and strengthen important feature signals at the same time, and excavate the optimal matching relationship among the feature signals.

It should be understood that the exponential linear unit ELU, as an activation function, has the advantages of: on one hand, the ELU is zero-centered, the average activation value of the network unit can be pushed to a value closer to zero, the interlayer accumulated deviation caused by deepening of the hierarchy is reduced, the gradient propagation is more stable, the convergence speed is accelerated, the effect similar to batch normalization is exerted, and the calculation complexity is lower; on the other hand, for a smaller input, the ELU is saturated to a negative value with a smaller parameter, the soft saturation characteristic reduces the change influence of the inactivation unit on the feature information in forward propagation, weakens the correlation of the inactivation unit on the next layer of network, strengthens the feature importance of the activation unit, and makes the dependency relationship between the units easier to model and interpret, so that the model has stronger robustness on noise, and allows the network to learn a more stable representation.

The function of the ELUs and their derivatives are formulated as follows:

（1.1）

（1.2）

wherein the hyper-parameter

The saturation value of the ELU at negative inputs, i.e., when the negative portion of the ELU is saturated, is controlled. It can be seen that when the input is positive, the ELU mitigates the gradient vanishing by a constant positive equation; when the input is in a negative state, the ELU is saturated to a negative value, so that the change of the inactivation unit and the information propagated to the next layer are reduced, and the change is limited to a small area fluctuation, thereby improving the robustness to noise.

It will be appreciated that the decoder module performs a subsequent decoding process on the visual features obtained by the stacked gate module, using CTCs as the decoder for that module, outputting the final text sequence.

Further, the training process of the trained handwritten English line recognition model comprises the following steps:

s103-1: constructing a training set, wherein the training set is a handwritten English image of a known handwritten English line recognition result;

s103-2: inputting the training set into a handwritten English line recognition model, training the model, and stopping training when the total loss function value of the model does not decrease any more to obtain the trained handwritten English line recognition model.

Further, the total loss function of the model is a weighted sum of the loss function of the decoder and the statistical loss function.

Further, the total loss function of the model is:

(4.1)

the total loss function is composed of two parts, wherein,

indicating the loss of the CTC of the decoder,

the loss representing the "statistical loss function" is,

、

respectively, corresponding weights.

The specific formula is as follows:

（3.1）

（3.2）

（3.3）

is shown at time step

With characters

Based on the prediction result of each frame

And a dictionary set, wherein the probability is multiplied by each time step in the path to finally obtain the whole path

Probability of (2)

。

Represents passing through

The transformation is followed by a sequence

Of (2) a

Due to the existence of multiple paths

The same sequence is obtained so that the probability of the final sequence is equal to all paths

Is added to obtain the final sequence

Total probability of

. The final objective function takes the conditional probability of the tag sequence

Negative log likelihood of (d).

"statistical loss function"

The calculation process of (2) is shown as the following formula:

（3.4）

（3.5）

（3.6）

（3.7）

wherein the content of the first and second substances,

and

respectively represent tag sequences

And a prediction sequence

Statistical probability distribution of all character categories;

which represents the input image, is,

a training set is represented that is,

a representation character dictionary;

is shown as

Character classification

In the tag sequence

The number of occurrences in (a);

is shown as

Character classification

In predicting sequences

By aggregating the predicted probability of each character along the time dimension, accumulating all time steps

Is predicted with probability of

And obtaining the product.

CTC loss function, which calculates the probability distribution of all possible output sequences by considering all possible alignment paths of each input frame during the prediction process, then greedily finds the output sequence with the highest probability for the input sequence from them, and finally uses a function

To remove redundant characters and spaces, thereby routing paths

Mapping to a final sequence

。

In the training stage, the whole model is trained end to end by minimizing the CTC loss between the text labels corresponding to the original pictures; and in the reasoning stage, the final visual characteristics obtained by the visual characteristic extraction module are transcribed and decoded, and the output text sequence is identified.

"statistical-based loss function" (statistical-based loss), the data set of this embodiment has the following problems: many special characters and spaces exist in a text line, the number of the spaces marked manually and the number of the real spaces at the image pixel level are possibly inconsistent, so that the division among different words is unclear, the phenomenon that the word interval of a label is different from the predicted word interval is caused in a recognition result, the number of the characters of the label and the prediction result is inconsistent, and the sequence lengths of the label and the prediction result are inconsistent. Therefore, the embodiment provides statistical information based on the number of characters as additional supervision information for the model, and considers that the network learns the number of all characters in the text line, and by accurately predicting the number of characters of each class in the label, the generation of a prediction sequence can be better constrained, and more supervision information can be helpful for correct recognition.

Furthermore, because the alignment between the tag characters and the model predictions may be unclear, the way that CTCs accurately estimate conditional probabilities directly is challenging, and using CTCs alone as loss supervision may introduce errors. Therefore, the embodiment additionally adds a supervision mode to assist CTCs to jointly supervise the model and jointly optimize the objective function, thereby facilitating correct identification.

For the above two points, the present embodiment proposes a novel "statistical loss function". Unlike CTC, the prediction method for probability is to make the network predict a probability distribution for each class at each time step, and the "statistical loss function" counts the number of occurrences of each character class by considering the number of each character class to learn the cumulative probability of each character class at all time steps, so that the network is essentially made to predict the number of each class, i.e., without considering the character sequence information in the tag sequence. For example, in the word "hello", the character "l" appears twice, then its cumulative prediction probability over all time steps should be exactly 2, and the corresponding two predictions should both be close to 1.

Therefore, two new distributions are formed by the prediction result and the label after considering the statistical information, and in order to measure the distance between the two new probability distributions, the cross entropy is used as a calculation method in the embodiment.

It can be seen that the statistical loss function provided by the present embodiment only involves simple operations, no additional parameters, negligible computation and memory consumption, and can be implemented by only changing the original model with little change.

Further, the constructing of the training set specifically includes:

s103-11: carrying out size normalization processing on the handwritten English line image;

s103-12: and processing the label of the handwritten English line image.

Further, processing the label of the handwritten English line image specifically includes:

s103-121: constructing a character dictionary; the character dictionary includes: a one-to-one correspondence between characters and indices; the character, comprising: capital letters, lowercase letters, numbers, punctuation marks;

s103-122: constructing a statistical dictionary; the statistical dictionary includes: the number of all characters in the text label and the number of each type of characters; the statistical dictionary includes: one-to-one correspondence between characters and quantity;

s103-123: mapping the text labels according to the character dictionary, and establishing a corresponding relation between characters and indexes;

s103-124: and after S103-121 to S103-123, obtaining labels corresponding to all handwritten English line images.

It should be understood that a character dictionary is constructed: in the embodiment, a novel dictionary establishing mode is adopted, all characters appearing in a data set are counted and repeated characters are filtered, so that the model can identify English characters containing capital and small letters, numbers and punctuations, can identify Chinese characters and special symbols under a Chinese input method, and is more universal;

it should be understood that a statistical dictionary is constructed: the embodiment counts the number of all characters in the text label and the number of each type of characters, and establishes a statistical correspondence between the characters and the number;

it should be understood that the text labels are mapped according to a character dictionary, and a correspondence relationship is established between the characters and the indexes.

Example two

The embodiment provides a handwritten English line recognition system based on deep learning;

handwritten English line recognition system based on deep learning comprises:

extracting the characteristics of the preprocessed image to obtain a preliminary visual characteristic;

It should be noted here that the acquiring module, the preprocessing module and the identifying module correspond to steps S101 to S103 in the first embodiment, and the modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The handwritten English line recognition method based on deep learning is characterized by comprising the following steps:

acquiring a handwritten English image to be recognized;

preprocessing a handwritten English image to be recognized;

2. The method as claimed in claim 1, wherein the network structure of the trained handwritten english line recognition model comprises:

the encoding module includes: the first depth separable convolution network, the first layer standardization module and the connector are connected in sequence; the input end of the first depth separable convolutional network is connected with the input end of the connector in a residual error mode; the input end of the first depth separable convolutional network is used as the input end of the coding module; the output end of the connector is used as the output end of the coding module; and the coding module is used for extracting the features of the preprocessed image to obtain the preliminary visual features.

3. The method for deep learning-based handwritten english line recognition according to claim 2, wherein the stacking gate module comprises: the input end of the first gate unit is connected with the output end of the connector, and the output end of the last gate unit is connected with the input end of the second-layer standardized module; the input end of the first gate unit is used as the input end of the stacked gate module, and the output end of the second-layer standardized module is used as the output end of the stacked gate module; the stacking door module is used for carrying out depth feature extraction on the preliminary visual features to obtain depth visual features.

4. The method for recognizing handwritten English lines based on deep learning of claim 3, wherein said decoding module comprises:

the second depth separable convolutional network, the exponential linear unit, the third layer of standardization module and the decoder are connected in sequence; the input end of the second depth separable convolutional network is used as the input end of the decoding module, the input end of the second depth separable convolutional network is connected with the output end of the second layer standardization module, the output end of the decoder is used as the output end of the decoding module, and the decoding module is used for decoding the depth visual features to obtain the recognition result of the handwritten English lines.

5. The method as claimed in claim 3, wherein the first gate unit has an internal structure comprising:

the input end of the first gate unit is used for inputting the characteristic diagram output by the coding module; except the first gate unit, the input ends of the other gate units are used for inputting the characteristic diagram output by the previous gate unit; the input end of the first gate unit is connected with three parallel branches, and the three parallel branches are a first branch, a second branch and a third branch in sequence;

the first branch, comprising: the second channel-by-channel convolution layer, the second point-by-point convolution layer, the exponential linear unit and the first multiplier are connected in sequence; the input end of the second channel-by-channel convolution layer is connected with the input end of the first gate unit;

the second branch circuit includes: the input end of the activation function layer is connected with the input end of the first gate unit;

the input end of the first multiplier is connected with the output end of the activation function layer; the output end of the first multiplier is connected with the input end of the first adder; activating an output layer output weighted value a of the function layer; processing the weighted value to obtain 1-a; inputting 1-a to an input of a second multiplier; the output end of the second multiplier is connected with the input end of the first adder; the output end of the first adder is connected with the input end of the second adder; the input end of the second adder is also connected with the input end of the first gate unit; the output of the second adder serves as the output of the first gate unit.

6. The method as claimed in claim 5, wherein the first branch is used for performing a nonlinear transformation on the input, and obtaining the transformation characteristics; the second channel-by-channel convolution layer, the second point-by-point convolution layer and the exponential linear unit can be regarded as a combination, and jointly form conversion to obtain conversion characteristics which are used as the input of the first multiplier; wherein, the conversion characteristic refers to the characteristic obtained after the input characteristic is subjected to the nonlinear conversion of the first branch.

7. The method as claimed in claim 5, wherein the second branch defines two gate signals with one summation for modeling input

And conversion thereof

The relationship between the two is explored to find the matching degree; learning the weight a through an activation function layer, using the weight as the weight value of the conversion feature obtained by the first branch, representing the importance of the conversion feature to the output feature, and controlling the feature information after conversion to be carried to the output featureDegree in the characteristic information; then, a is subtracted from 1 to obtain a weight 1-a, the weight 1-a is used as a weight value of the original feature obtained by the third branch to represent the importance of the original feature to the output feature, the degree of the input feature information carried into the output feature information is controlled, the weight a is obtained through learning of an activation function layer, and the process is regarded as a conversion gate; the difference between a and 1 gives the weight 1-a, and this process is considered as reservation gate.

8. The method for recognizing handwritten English lines based on deep learning of claim 7, wherein said first multiplier and said second multiplier are both used for multiplying elements at corresponding positions of two matrices, the first multiplier multiplies conversion characteristics of the first branch by conversion gates of the second branch to realize weighting; the second multiplier multiplies the original characteristic of the third branch by a reserved gate of the second branch to realize weighting;

the first adder and the second adder are used for adding elements at corresponding positions of the two matrixes, and the first adder sums the weighted conversion characteristic of the first branch and the weighted original characteristic of the third branch to obtain a fusion characteristic; the second adder is considered as a kind of residual concatenation, adding the original input to the output result as well.

9. The method as claimed in claim 4, wherein the training process of the trained handwritten English line recognition model comprises:

constructing a training set, wherein the training set is a handwritten English image of a known handwritten English line recognition result;

inputting the training set into a handwritten English line recognition model, training the model, and stopping training when the total loss function value of the model does not decrease any more to obtain a trained handwritten English line recognition model;

the total loss function of the model is the weighted sum of the loss function of the decoder and the statistical loss function.

10. Handwritten English line recognition system based on deep learning is characterized by comprising: