CN113705730B

CN113705730B - Handwriting equation image recognition method based on convolution attention and label sampling

Info

Publication number: CN113705730B
Application number: CN202111120578.7A
Authority: CN
Inventors: 季爽; 顾志文; 王慧萍; 李剑; 许磊磊
Original assignee: Jiangsu Urban and Rural Construction College
Current assignee: Jiangsu Yuanhong Ecology Technology Co ltd
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2023-04-14
Anticipated expiration: 2041-09-24
Also published as: CN113705730A

Abstract

The invention discloses a handwritten equation image recognition method based on convolution attention and label sampling, which comprises the following steps of: preprocessing an input image to ensure that the size and the channel of the image are fixed; using an image feature extraction module to perform feature extraction on an input image and outputting a feature matrix; using an attention feature extraction module, taking the feature matrix as input, and generating an attention matrix through convolution and inverse convolution operations; and (3) obtaining each sequence bit character output by using a text feature decoding module based on a recurrent neural network combined with a label sampling technology and taking a feature matrix and an attention matrix as input, and finally obtaining the whole sequence output to obtain the final identification result of the mathematical equation. The invention relates to a handwritten mathematical equation image recognition technology based on the combination of convolution attention and a label sampling technology, which effectively recognizes a complex offline handwritten mathematical equation, comprises a corner mark with spatial information and a special identifier, and has higher accuracy and better robustness.

Description

Handwriting equation image recognition method based on convolution attention and label sampling

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a handwritten equation image recognition method based on convolution attention and label sampling.

Background

With the development of information technology and the influence of epidemic situations in recent years, online education is more and more popular. The identification scenes of the handwritten mathematical formula are more and more, but the traditional method based on keyboard input always has the problem of low efficiency because the mathematical formula usually comprises complex spatial relationships, such as two-dimensional structures of upper and lower corner marks of integral symbols, angle symbols and the like. Meanwhile, the accuracy of conventional text recognition is also influenced by the complex space symbols.

At present, the identification of the off-line mathematical formula mainly has two directions, namely a standard printing formula aiming at a printing form and a complex formula aiming at off-line handwriting. Compared with print recognition, the recognition of the off-line handwritten formula has various handwriting styles and a large number of corner marks with spatial relation due to the existence of handwritten adhesion of the formula, so that the recognition difficulty of the off-line handwritten mathematical formula is far greater than that of the print mathematical formula.

The identification of the handwritten mathematical formula is divided into the identification of an online handwritten formula and the identification of an offline handwritten formula, and compared with the identification of the online handwritten formula, the identification of the offline handwritten formula lacks the stroke sequence written by the mathematical formula, so that the identification difficulty is increased. The traditional formula recognition method comprises the steps of firstly carrying out character segmentation, splitting the whole mathematical formula into single characters, then extracting image features from a graph of each character for recognition, finally carrying out geometric and semantic constraints on the combination of the characters according to the categories of the characters and the position relation among the characters, and finally combining all the characters to reconstruct the mathematical formula. In the current main method, based on a frame of an encoder, a decoder and an attention mechanism, firstly, in an encoding stage, a CNN (convolutional neural network) component, such as Resnet50, increment and the like, is adopted to extract the characteristics of an input image; in the decoding stage, feature weight calculation is carried out according to the encoded features and the decoding output at the previous moment, attention (weight) calculation of each feature is completed, finally, RNN is used as a decoder to output current characters according to the current attention value and the encoded features, and the final prediction output is obtained by splicing after all characters are output.

However, the main idea behind the traditional attention network framework is matching, and given a feature in the feature map, the attention score is calculated by evaluating the matching degree of the feature with the historical output information, which causes a serious error accumulation phenomenon in the current attention model, and due to the coupling relationship between the output of the RNN historical time step and the image feature, once the current output is wrong, the attention calculation is inevitably wrong, thereby causing error accumulation and propagation. This is because, firstly, in the attention calculation generation process, the calculation of the current feature depends on the historical decoding information of the encoder, and if the historical decoding information of the decoder is wrong, the calculation of the current attention value is affected, so that the current output is wrong, and an error accumulation is formed. In the RNN decoding stage, the decoder re-encodes the current decoded output into the next time step as the input vector for the next time step. Once the current time step is output wrongly, the following input vectors are also wrongly output, and the phenomenon of error accumulation is formed again. Therefore, mathematical formula recognition based directly on attention models is always accompanied by serious error accumulation problems, which affect the recognition accuracy. There is a need for an improved model that addresses this problem of error accumulation.

Disclosure of Invention

In order to solve the technical problem that complex offline handwritten mathematical equations are difficult to recognize in the prior art, the invention aims to provide a handwritten equation image recognition method based on convolution attention and label sampling, which is a handwritten mathematical equation image recognition technology based on the combination of convolution attention and label sampling technology, can effectively recognize complex offline handwritten mathematical equations and has higher accuracy and better robustness.

In order to achieve the purpose and achieve the technical effect, the invention adopts the technical scheme that:

the handwriting equation image recognition method based on convolution attention and label sampling comprises the following steps:

s1: preprocessing an input image to ensure that the size and the channel of the image are fixed;

s2: using an image feature extraction module to perform feature extraction on an input image and outputting a corresponding feature matrix;

s3: using an attention feature extraction module, taking the feature matrix as input, and generating a corresponding attention matrix through convolution and inverse convolution operations;

s4: and (3) obtaining each sequence bit character output by using a text feature decoding module based on a recurrent neural network combined with a label sampling technology and taking a feature matrix and an attention matrix as input, and finally obtaining the whole sequence output to obtain the final identification result of the mathematical equation.

Further, in step S1, preprocessing the input image to ensure that the size and the channel of the image are fixed, which specifically includes the following steps:

inputting a mathematical equation image in a training set, converting the image into a single channel with the size of (w, h, 1); then, the input image is scaled in equal proportion to ensure that the length of the input image is less than 2048 or the width of the input image is less than 192, and the blank area after scaling is filled with ground color to ensure that the sizes of all the images are (192, 2048, 1); and randomly splitting the image into a training sample and a testing sample, and labeling corresponding label values on the two samples.

Further, in step S2, the image feature extraction module includes a CNN convolution module and a residual block module; the method comprises the following steps of using an image feature extraction module to perform feature extraction on an input image and outputting a corresponding feature matrix, wherein the method specifically comprises the following steps:

firstly, the training sample obtained in the step S1 is input into a CNN convolution module for coding, the size of the input image is (192, 2048, 1), the size of the finally output feature matrix is (3, 128, 512) through the processing of a residual block module, wherein (3, 128) is the size of the feature matrix, 512 is the number of channels, and the residual block module comprises 48 layers of residual blocks.

Further, in step S3, the attention feature extraction module includes an upper sampling deconvolution layer applying a Sigmoid activation function and a symmetric network structure, where the symmetric network structure includes a plurality of lower sampling convolution layers applying a ReLU activation function and a plurality of upper sampling deconvolution layers applying a ReLU activation function, each lower sampling convolution layer applying a ReLU activation function is symmetrically arranged with one upper sampling deconvolution layer applying a ReLU activation function, each lower sampling convolution layer applying a ReLU activation function, each upper sampling deconvolution layer applying a ReLU activation function, and each upper sampling deconvolution layer applying a Sigmoid activation function is sequentially arranged, and the upper sampling deconvolution layer applying a Sigmoid activation function is located in the last layer of the attention feature extraction module.

Further, in step S3, an attention feature extraction module is used, and the feature matrix is used as an input, and a corresponding attention matrix is generated through convolution and deconvolution operations, which specifically includes the following steps:

and finally, obtaining an attention matrix with the size of (3, 128, maxT) by applying a Sigmoid activation function and adopting an inverse convolutional layer, wherein the maxT refers to the length of a text in a current input image label.

Further, in step S4, using a text feature decoding module, based on a recurrent neural network and combining with a tag sampling technology, and taking the feature matrix and the attention matrix as inputs, obtaining an output of each sequence bit character, and finally obtaining an output of the whole sequence, to obtain an identification result of a final mathematical equation, specifically including the following steps:

combining the characteristic matrix obtained in the step S2 and the attention moment matrix obtained in the step S3 to obtain the image characteristics with different attentions, and expressing the corresponding function as a function c _t ：

Where t is the attention weight for identifying the tth character in the text, F _x，y Representing the image feature matrix in step S2, A _t，x，y Representing the attention matrix in step S3;

then inputting image characteristic information with different attentions into a recurrent neural network, selecting a real label value according to a continuously attenuated probability value epsilon in a training stage by using a label sampling technology, selecting one of an output value or a real label value of a previous time step according to a 1-epsilon probability value for coding, performing inner product summation with a characteristic matrix and an attention matrix to obtain an image characteristic intermediate vector with different attentions, using the image characteristic intermediate vector as the input of the current time step, updating a hidden state vector in the recurrent neural network, inputting the hidden state vector into the fully-connected neural network, outputting the probability value of each character, selecting the maximum probability as the output of the current character, and connecting all character outputs to obtain a recognition result of a final mathematical equation.

Further, the hidden state vector is input into a full-connection neural network, the probability value of each character is output through a softmax function, and the prediction probability is as follows:

wherein pk represents the output probability of the current classification category k, n is the number of categories of all capital and small English characters and special symbols, exp (x) represents the exponention of the elements in brackets,

the sum of the index values representing the output scores of all classification categories over a fully connected network.

Further, the method also comprises the following steps:

training all training samples according to the steps S1-S4, inputting test samples after all training samples are trained, calculating average identification accuracy, repeating the steps S1-S4, continuously repeating training and test verification until the identification rate meets the requirement, and storing current model parameters and settings after the accuracy of the test samples is stable to complete model construction.

Compared with the prior art, the invention has the beneficial effects that:

1. the network model applies an attention mechanism, in a text feature decoding module, the character prediction not only considers the image features, but also gives different weights to the feature information of the model by calculating the attention vector, thereby not only accelerating the convergence speed of the model, but also improving the overall recognition accuracy;

2. by adding a convolution decoupling mechanism and a label sampling technology, the error accumulation phenomenon in the attention vector generation process and the RNN decoding process is greatly reduced, and the identification precision and the model robustness are further improved. The core of the invention is different from other similar technologies, and the detection performance of the invention is superior to other similar technologies;

3. the invention discloses an end-to-end text recognition scheme which can accurately recognize a corner mark with spatial information and some special identifiers in a mathematical equation and is not available in other similar technologies.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a block diagram of an image feature module according to the present invention;

FIG. 3 is a block diagram of an attention extraction module according to the present invention;

FIG. 4 is a block diagram of a text feature decoding module according to the present invention.

Detailed Description

The present invention is described in detail below so that the advantages and features of the present invention can be more easily understood by those skilled in the art, and the scope of the present invention can be clearly and clearly defined.

The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

As shown in fig. 1-4, the method for recognizing handwriting equation image based on convolution attention and label sampling comprises the following steps:

In the step S1, an input image is preprocessed to ensure that the size of the image and a channel are fixed, and the method specifically includes the following steps:

In the step S2, the image feature extraction module comprises a CNN convolution module and a residual block module; the method comprises the following steps of using an image feature extraction module to perform feature extraction on an input image and outputting a corresponding feature matrix, wherein the method specifically comprises the following steps:

In step S3, the attention feature extraction module includes an upper sampling deconvolution layer applying a Sigmoid activation function and a symmetric network structure, where the symmetric network structure includes a plurality of lower sampling convolution layers applying a ReLU activation function and a plurality of upper sampling deconvolution layers applying a ReLU activation function, each lower sampling convolution layer applying a ReLU activation function is symmetrically arranged with one upper sampling deconvolution layer applying a ReLU activation function, respectively, a lower sampling convolution layer applying a ReLU activation function, an upper sampling deconvolution layer applying a ReLU activation function, and an upper sampling deconvolution layer applying a Sigmoid activation function are sequentially arranged, and the upper sampling deconvolution layer applying a Sigmoid activation function is located in the last layer of the attention feature extraction module.

In step S3, an attention feature extraction module is used, and the feature matrix is used as an input, and a corresponding attention matrix is generated through convolution and deconvolution operations, which specifically includes the following steps:

In step S4, a text feature decoding module is used, a cyclic neural network is combined with a tag sampling technology, a feature matrix and an attention matrix are used as inputs, each sequence bit character output is obtained, and finally, the whole sequence output is obtained, so as to obtain a final identification result of a mathematical equation, which specifically comprises the following steps:

Where t is expressed as the attention weight for identifying the tth character in the text, F _x，y Representing the image feature matrix in step S2, A _t，x，y Representing the attention matrix in step S3;

inputting image characteristic information with different attentions into a recurrent neural network, selecting a real label value according to a continuously attenuated probability value epsilon in a training stage by using a label sampling technology, selecting one of an output value or a real label value of a previous time step according to a 1-epsilon probability value for coding, performing inner product summation with a characteristic matrix and an attention matrix to obtain image characteristic intermediate vectors with different attentions, taking the image characteristic intermediate vectors as the input of the current time step, updating a hidden state vector in the recurrent neural network, inputting the hidden state vector into a fully-connected neural network, outputting the probability value of each character, selecting the maximum probability as the output of the current character, and outputting and connecting all the characters as the recognition result of a final mathematical equation;

in step S4, the method further includes the following steps:

training all training samples according to the steps S1-S4, inputting test samples after all training samples are trained, calculating average recognition accuracy, repeating the steps S1-S4, continuously repeating training and test verification until the recognition rate meets the requirement, and storing current model parameters and settings after the accuracy of the test samples is stable to complete model construction.

The invention has at least the following technical effects:

compared with the prior art, the method can be used for identifying the text lines of the complicated handwritten mathematical formula without independently cutting and identifying text characters, has better identification accuracy rate on the conditions of adhesion, deformation and complex background, and has good identification capability on the corner marks and special symbols with space information in the mathematical formula;

according to the invention, an attention mechanism is applied, and the recognition accuracy is improved by adding additional context semantic information with different weights to the decoding unit;

in the network structure, a convolution attention mechanism is adopted, an attention characteristic extraction module is used instead by disconnecting the coupling relation between the output information of the previous time step and the image characteristics output by the CNN network, and only the image characteristics are used as input to generate an attention matrix, so that even if the output error in the current RNN time step occurs, the calculation of the attention matrix is not influenced, and the next character prediction process is not influenced;

at present, the output of the previous time step of the recurrent neural network needs to be used as the input of the next time step, and the recurrent neural network is circulated until all text lines are recognized, so that the current character recognition error occurs and is used as the input of the next time step, the output of the subsequent time step is totally wrong, and the phenomenon of error accumulation is formed again. However, the invention does not have the technical problem, the invention selects the output of the last time step as the input or the true value (namely the label) corresponding to the current moment according to a certain probability by adopting the label sampling technology in the training process, so that the network can learn the error correction capability under the condition that the output of the last time step is wrong in the training process, thereby not only accelerating the training speed, but also improving the training precision;

the method combines a residual network structure, a cyclic neural network structure, an attention mechanism and a label sampling technology, is applied to the identification process of a mathematical equation sequence, utilizes a convolution module to extract the characteristics of an original image, does not need to perform complex image processing and artificial characteristic extraction on a character area, only needs to scale the image to a fixed size and perform gray level normalization, and unifies the characteristic extraction and identification processes into the whole network frame, thereby really realizing the end-to-end identification of an offline handwritten mathematical equation.

Example 1

The handwritten equation image recognition system based on convolution attention and label sampling comprises an image feature extraction module, an attention feature extraction module and a text feature decoding module, wherein the image feature extraction module, the attention feature extraction module and the text feature decoding module respectively correspond to image feature extraction, feature attention weight distribution and feature-based sequence recognition.

The handwritten equation image recognition method based on convolution attention and label sampling considers the recognition of a mathematical equation as a text sequence recognition problem, so that a CNN convolution neural network is adopted to extract the features of an image, an RNN convolution neural network is adopted to extract the context semantics of the text sequence, different weight coefficients are distributed to the image features by combining an attention mechanism, and the recognition accuracy is improved.

In specific implementation, in order to express the corner marks and other special symbols with spatial relationships existing in the mathematical equation, the invention adopts some other common symbols to express, such as:

wherein the subscript is represented as: "[ integral ] 1^2"; the angle symbol "30 °" is denoted as "30^0"; log of logarithmic operation ₂ 4 "is expressed as" log _2 is not zero 4", the power operation" 2 ² "is expressed as" 2^2".

s1: preprocessing an input image to ensure that the size of the image and a channel are fixed; the method specifically comprises the following steps:

s1.1: inputting a mathematical equation image in a training set, preprocessing the input image, converting the image into a single channel with the size of (w, h, 1);

s1.2: scaling the input image equally to ensure that its length is less than 2048, or its width is less than 192; filling the blank area after zooming with ground color to ensure that all the pictures are (192, 2048, 1);

s1.3: and randomly splitting the image into a training sample and a testing sample, wherein the two samples are marked with corresponding label values.

S2: scanning an original off-line image by using an image feature extraction module, extracting features of an input image, and outputting a corresponding image feature matrix; the method specifically comprises the following steps:

inputting a training sample image into an image feature extraction module for encoding, as shown in fig. 2, the whole network is constructed based on a residual block module, the residual block module has 48 layers of residual blocks, compared with the currently popular Resnet50, experiments prove that the current network has no obvious difference in recognition accuracy, but has the advantages of simple network structure, small parameters and high network convergence speed, and the input and output sizes of the network are fixed, the input is a single-channel image processed in step S1.2, the size is (192, 2048, 1), the output is an image feature matrix extracted by the network, the size is (3, 128, 512), wherein, (3, 128) is the feature matrix size, and 512 is the number of channels;

s3: using an attention feature extraction module, taking the feature matrix as input, and generating a corresponding attention matrix through convolution and inverse convolution operations, which is equivalent to distributing different importance coefficients to different feature matrices; the method specifically comprises the following steps:

sending the image matrix characteristics extracted in the step S2 to an attention characteristic extraction module, as shown in fig. 3, which includes a plurality of layers of downsampling convolution layers using a ReLU activation function and a plurality of layers of upsampling deconvolution layers using a ReLU activation function, the downsampling convolution layers using the ReLU activation function of each layer are respectively symmetrically arranged with the upsampling deconvolution layers using the ReLU activation function of each layer, the downsampling convolution layers using the ReLU activation function, the upsampling deconvolution layers using the ReLU activation function, and the upsampling deconvolution layers using the Sigmoid activation function are sequentially arranged, and the upsampling deconvolution layers using the Sigmoid activation function are located in the last layer of the attention characteristic extraction module; in the first half-section network, the input features are subjected to feature extraction again by adopting a downsampling convolution layer of a ReLU activation function, and in the second half-section network, the output of the previous layer with the same size as the output of the first half-section network is combined and then serves as the current input to be subjected to deconvolution operation; finally, obtaining a final attention matrix by applying a Sigmoid activation function and adopting an inverse convolutional layer, wherein the size of the attention matrix is (3, 128, maxT), wherein maxT refers to the length of a text in a current input image label, namely the step size of the maximum time step in the RNN module;

s4: a text feature decoding module is used, a feature matrix and an attention matrix are used as input based on a recurrent neural network combined label sampling technology, each sequence bit character output is obtained, and finally the whole sequence output is obtained; the method specifically comprises the following steps:

combining the image characteristic matrix obtained in the step S2 with the attention moment matrix obtained in the step S3 to obtain image characteristics with different attentions, and expressing the corresponding function as a function c _t ：

Where t denotes the time step of the RNN and also the attention weight for recognizing the tth character of the text, F _x，y Representing the image feature matrix in step S2, A _t，x，y Representing the attention matrix in step S3;

the method comprises the steps of finally obtaining image feature information with different weights, and then sending the image feature information to an RNN (neural network) cyclic neural network Unit, specifically, the GRU (Gate Recurrent Unit, chinese full Gate Cyclic Unit) is used as a feature decoder, compared with the traditional RNN structure, the GRU can effectively solve the problems of gradient disappearance and gradient explosion existing in the RNN, so that more context semantic information with longer intervals can be stored, and compared with the LSTM (Long-Short Term Memory Unit, chinese full Long Short Term Memory Unit), the GRU has the advantages of simple structure and high convergence speed.

As shown in fig. 4, there are two total inputs to the GRU unit, one input being the output of the GRU unit at a time step or the re-encoding of the real tag, denoted as "e _t-1 In the present invention, since the faced character categories are English characters and special symbols with fixed number, and the number is not large, the one-hot encoding technique is adopted to encode the character output at the last time step, and the other input is the obtained image feature with different attention, which is denoted as "c" in the present invention _t ”。

Specifically, in the present invention, a tag sampling technique is adopted to solve the problem of selecting a real tag and an output of a last time step of a GRU unit in the current step. In the conventional RNN recurrent neural network structure, since the output of the previous time step is used as the input of the current time step, once the output of the previous time step is wrong, the input of the unit of the subsequent time step is also wrong, resulting in all the following units being wrong. In order to enhance the robustness of the network, the error correction capability of the current time step under the condition that the error is output at the last time step is learned in the training phase. In the invention, a label sampling technology is adopted, in a training stage, a real label value is selected according to a continuously attenuated probability value epsilon, the output of the last time step is selected according to a 1-epsilon probability value for coding, and the real label value is subjected to inner product summation with a feature matrix and an attention matrix to obtain image feature intermediate vectors with different attentions which are used as the input of the current time step, hidden state vectors in a recurrent neural network are updated, the hidden state vectors are input into a full-connection neural network, the probability value of each character is output, the maximum probability is selected as the output of the current character, all character outputs are connected and used as the recognition result of a final mathematical equation, so that the robustness of the network is enhanced, and the phenomenon of partial error accumulation is eliminated. The 'epsilon' of the invention selects a linear descent attenuation function as follows:

ε _i ＝max(∈，k-ci)

where e is a value between 0 and 1, representing the minimum probability value for selecting a true tag, k represents the intercept, c represents the function descent rate, and i represents the number of model iterations. Other attenuation functions may also be used depending on the actual application requirements.

Inputting the hidden state vector of the GRU unit in the step into a full-connection neural network, and outputting the prediction probability of each character through a softmax function, wherein the output model prediction probability is as follows:

wherein p is _k And (3) representing the output probability of the current classification category k, and n represents the total number of categories. exp indicates that the elements in parentheses are indexed,

a sum of index values representing all classification categories output scores over the fully connected network.

And after the model prediction softmax probability of the current input character is obtained, selecting the maximum probability value as the optimal output of the current input character, finally calculating the accuracy of the model prediction through a loss function, feeding the result back to the previous network layer through a back propagation algorithm, and updating the weight parameters of the network units.

Specifically, the invention uses a log loss function (log _ loss) as a basis for measuring the distance between the real result and the predicted result in the network. The mathematical expression is as follows:

where θ represents all of the trainable network parameters, g _t Representing the true label, T representing the number of characters in the current text, and I representing the given current input feature. Since the log-loss function is differentiable, it can be converged using a gradient descent method. The smaller the loss value, the closer the predicted sequence is to the true sequence. In the specific training process, the weight and the bias of each neuron can be continuously adjusted by using an Adma gradient method, so that the loss function is quickly converged to reach the minimum value.

After training of all training samples is completed, inputting a test set sample, calculating average identification accuracy, repeating the steps in sequence, and continuously repeating training and test verification until the identification rate meets the requirement, and when the accuracy of the test sample is stable, storing current model parameters and settings to complete model construction.

The parts or structures of the invention which are not described in detail can be the same as those in the prior art or the existing products, and are not described in detail herein.

The above description is only an embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.

Claims

1. The handwritten equation image recognition method based on convolution attention and label sampling is characterized by comprising the following steps of:

s4: a text feature decoding module is used, a feature matrix and an attention matrix are used as input based on a recurrent neural network combined label sampling technology, each sequence bit character output is obtained, and finally the whole sequence output is obtained to obtain the final identification result of a mathematical equation;

in step S4, a text feature decoding module is used, a cyclic neural network is combined with a tag sampling technology, a feature matrix and an attention matrix are used as inputs, each sequence bit character output is obtained, and finally, the whole sequence output is obtained, so as to obtain a final identification result of a mathematical equation, which specifically includes the following steps:

combining the feature matrix obtained in the step S2 and the attention moment matrix obtained in the step S3 to obtain image features with different attentions, wherein the corresponding function is expressed as a function ct:

where t is the attention weight for identifying the tth character in the text, F _x,y Representing the image feature matrix in step S2, A _t,x,y Representing the attention matrix in step S3;

then inputting image characteristic information with different attentions into a recurrent neural network, selecting a real label value according to a continuously decaying probability value epsilon by using a label sampling technology in a training stage, selecting one of an output value of a previous time step or a real label value according to a 1-epsilon probability value for coding, performing inner product summation with a characteristic matrix and an attention matrix to obtain image characteristic intermediate vectors with different attentions, using the image characteristic intermediate vectors as the input of the current time step, updating a hidden state vector in the recurrent neural network, inputting the hidden state vector into a fully connected neural network, outputting the probability value of each character, selecting the maximum probability as the output of the current character, and outputting and connecting all the characters to obtain the recognition result of a final mathematical equation.

2. The method for recognizing the handwritten equation image based on convolution attention and label sampling as claimed in claim 1, wherein in step S1, the input image is preprocessed to ensure that the size and the channel of the image are fixed, and specifically comprising the following steps:

3. The handwriting equation image recognition method based on convolution attention and label sampling according to claim 1, wherein in step S2, the image feature extraction module comprises a CNN convolution module and a residual block module; the method comprises the following steps of using an image feature extraction module to extract features of an input image and output a corresponding feature matrix, and specifically comprises the following steps:

4. The method for recognizing handwritten equation images based on convolutional attention and tag sampling as claimed in claim 1, wherein in step S3, said attention feature extraction module includes an upper sampling deconvolution layer applying Sigmoid activation function and a symmetric network structure, said symmetric network structure includes several layers of lower sampling convolution layers using ReLU activation function and several layers of upper sampling deconvolution layers using ReLU activation function, each lower sampling convolution layer using ReLU activation function is symmetrically arranged with one upper sampling deconvolution layer using ReLU activation function, the lower sampling convolution layer using ReLU activation function, the upper sampling deconvolution layer using ReLU activation function and the upper sampling deconvolution layer applying Sigmoid activation function are sequentially arranged, and the upper sampling deconvolution layer applying Sigmoid activation function is located at the last layer in the attention feature extraction module.

5. The method for recognizing handwritten equation image based on convolution attention and label sampling as claimed in claim 1, wherein in step S3, the attention feature extraction module is used to generate the corresponding attention matrix through convolution and deconvolution operations with the feature matrix as input, specifically including the following steps:

6. The handwriting equation image recognition method based on convolution attention and label sampling as claimed in claim 1, wherein hidden state vector is input into full-link neural network, probability value of each character is output through softmax function, and predicted probability is:

wherein p is _k The output probability of the current classification category k is shown, n is the category number of all capital and small English characters and special symbols, exp (x) shows the exponentiation of the element in the brackets,

7. The method for recognizing a handwritten equation image based on convolutional attention and tag sampling as claimed in claim 1, further comprising the steps of: