CN112329766A

CN112329766A - Character recognition method and device, electronic equipment and storage medium

Info

Publication number: CN112329766A
Application number: CN202011098938.3A
Authority: CN
Inventors: 李楠; 姜仟艺; 宋祺; 张睿
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2021-02-05

Abstract

The embodiment of the application discloses a character recognition method, a character recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: extracting image features of an image to be identified through standard convolution and expanding the number of channels to generate a first feature map; performing feature extraction processing on the first feature map through a modular structure comprising point-by-point group convolution and depth convolution to obtain a processed feature map; performing pooling treatment on the treated characteristic diagram to obtain a pooled characteristic diagram; taking the pooling characteristic diagram as the input of the modular structure, and circularly executing the modular structure treatment and the pooling treatment until the obtained pooling characteristic diagram meets the preset condition; performing standard convolution processing on the pooling characteristic graph to obtain a coding result; and decoding the coding result to obtain an identification result. According to the embodiment of the application, on the premise of ensuring the identification precision, the calculation complexity is reduced, the identification speed is increased, and the consumption of calculation resources is reduced.

Description

Character recognition method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of image recognition, in particular to a character recognition method and device, electronic equipment and a storage medium.

Background

The traditional character recognition firstly preprocesses the picture, reduces the noise interference on the picture by adopting modes of binarization, edge smoothing and the like, then divides the picture into each character sub-block by using a division algorithm, and splices the character sub-blocks into the whole character recognition result after character recognition.

When the character image with strong interference noise is identified, the accuracy of the traditional method can not meet the requirement of identification. Compared with the traditional character recognition method, the Deep Convolutional Neural Network (DCNN) has better accuracy in character recognition. However, at present, the verification code identification research based on the DCNN mostly adopts a standard convolutional network, the calculation complexity is high, the prediction time consumption is large, the calculation resource consumption is large, the algorithm operation depends on a GPU machine, and the application requirements of real-time performance and high concurrency cannot be met.

Disclosure of Invention

The embodiment of the application provides a character recognition method, a character recognition device, an electronic device and a storage medium, which are beneficial to reducing the calculation complexity, improving the recognition speed and reducing the consumption of calculation resources.

In order to solve the above problem, in a first aspect, an embodiment of the present application provides a text recognition method, including:

extracting image features of an image to be identified through standard convolution and expanding the number of channels to generate a first feature map;

performing feature extraction processing on the first feature map through a modular structure comprising point-by-point group convolution and depth convolution to obtain a processed feature map;

performing pooling treatment on the treated characteristic diagram to obtain a pooled characteristic diagram;

taking the pooling characteristic diagram as the input of the modular structure, and circularly executing the modular structure treatment and the pooling treatment until the obtained pooling characteristic diagram meets the preset condition;

performing standard convolution processing on the pooling characteristic graph to obtain a coding result;

and decoding the coding result to obtain an identification result.

In a second aspect, an embodiment of the present application provides a text recognition apparatus, including:

the channel expansion module is used for extracting the image characteristics of the image to be identified through standard convolution and expanding the number of channels to generate a first characteristic diagram;

the characteristic extraction module is used for carrying out characteristic extraction processing on the first characteristic diagram through a modular structure comprising point-by-point group convolution and depth convolution to obtain a processed characteristic diagram;

the pooling processing module is used for pooling the processed characteristic map to obtain a pooled characteristic map;

the circulation control module is used for taking the pooling characteristic diagram as the input of the modular structure and circularly executing the modular structure processing and the pooling processing until the obtained pooling characteristic diagram meets the preset condition;

the standard convolution processing module is used for carrying out standard convolution processing on the pooling characteristic graph to obtain a coding result;

and the decoding module is used for decoding the coding result to obtain an identification result.

In a third aspect, an embodiment of the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the text recognition method according to the embodiment of the present application.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the text recognition method disclosed in the present application.

The character recognition method, the character recognition device, the electronic device and the storage medium provided by the embodiment of the application extract image features of an image to be recognized through standard convolution and expand the number of channels to generate a first feature map, perform feature extraction processing on the first feature map through a modular structure comprising point-by-point group convolution and depth convolution to obtain a processed feature map, perform pooling processing on the processed feature map to obtain a pooled feature map, use the pooled feature map as input of the modular structure, and perform the modular structure processing and the pooling processing in a circulating manner until the obtained pooled feature map meets preset conditions, perform standard convolution processing on the pooled feature map to obtain a coding result, and decode the coding result to obtain a recognition result. Because the features are extracted by adopting the point-by-point group convolution and the deep convolution instead of the standard convolution, the calculation complexity is reduced, the identification speed is improved and the consumption of calculation resources is reduced on the premise of ensuring the identification precision.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of a text recognition method according to a first embodiment of the present application;

fig. 2 is a schematic diagram of a codec structure in an embodiment of the present application;

FIG. 3 is a schematic diagram of a modular structure in an embodiment of the present application;

FIG. 4 is a schematic diagram of a convolution block attention model in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a character recognition apparatus according to a second embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Example one

Fig. 1 is a flowchart of a text recognition method provided in this embodiment, which is suitable for the text recognition fields of identifying a verification code, identifying an identity card, identifying a passport, identifying a signboard name, and the like. Among them, Captcha (Captcha) is an important issue in word recognition research, and Captcha (Captcha) is an automated test to distinguish computers from people, and is a common and effective security mechanism used in many websites and applications. Although the security of the system can be improved by setting the verification code, in actual work, due to the fact that automatic testing or other requirements need to be carried out, each testing is achieved through manual input, the working efficiency is greatly reduced, actual requirements cannot be met, and therefore automatic identification is achieved through an identification algorithm.

The character recognition method is realized based on an encoder-decoder (encoder-decoder) structure, and the encoder adopts the same modular (Block) structure and is combined with a pooling layer. Fig. 2 is a schematic diagram of a codec structure in an embodiment of the present application, and as shown in fig. 2, the codec structure includes an encoder and a decoder, and the encoder includes a standard convolutional layer, a repeated modular structure and a pooling layer, and the standard convolutional layer.

As shown in fig. 1, the method includes: step 110 to step 170.

And step 110, extracting the image characteristics of the image to be identified through standard convolution and expanding the number of channels to generate a first characteristic diagram.

The image to be recognized is an image with characters to be recognized (such as verification code characters) and interference noise, and is generally an image of three channels of RGB.

Firstly, extracting the image characteristics of an image to be identified through a layer of standard convolution and expanding the number of channels to generate a first characteristic diagram, and inputting the generated first characteristic diagram into a subsequent modular structure to continue to perform characteristic extraction.

And 120, performing feature extraction processing on the first feature map through a modular structure comprising point-by-point group convolution and depth convolution to obtain a processed feature map.

Wherein the modular structure is used for further extracting character features in the first feature map.

By performing the feature extraction processing on the first feature map using the point-by-point group convolution and the depth convolution, the amount of calculation can be reduced and the processing speed can be increased compared to performing the feature extraction using the standard convolution.

The modular structure may include point-by-point group convolution, deep convolution, channel aliasing, and convolution block attention. Firstly, performing point-by-point group convolution processing on a first feature map through a modular structure, wherein the point-by-point group convolution can be understood as combination of point-by-point convolution and group convolution, the point-by-point convolution uses a convolution kernel with the size of 1 multiplied by 1 to extract features on each feature point so as to further expand the number of channels of the feature map, and each group of convolution can perform parallel calculation when calculation is performed through the group convolution; then, carrying out deep convolution processing on the feature map output by point-by-point group convolution so as to carry out convolution operation on each channel independently and learn the association of features under the same channel so as to extract the features of a single channel; then, carrying out point-by-point convolution processing on the feature map output by the deep convolution so as to carry out inter-channel fusion on the feature map output by the deep convolution, learning the relationship among channels and generating a new feature map; then, carrying out channel confusion on the feature graphs output by the point-by-point group convolution so as to improve the interaction and fusion among the features; and then carrying out convolution block attention processing on the feature map output by the channel confusion so as to distinguish noise and character content and obtain a processed feature map.

In an embodiment of the application, the performing, by a modular structure including a point-by-point group convolution and a depth convolution, a feature extraction process on the first feature map to obtain a processed feature map includes:

performing point-by-point group convolution processing on the first feature map to obtain a second feature map;

carrying out depth convolution processing on the second feature map to obtain a third feature map;

performing inter-channel fusion processing on the third feature map by point-by-point group convolution to obtain a fourth feature map;

performing channel confusion processing on the fourth feature map to obtain a fifth feature map;

performing channel feature extraction and spatial feature extraction on the fifth feature map through a convolution block attention model to obtain a sixth feature map;

and carrying out residual error mapping processing on the sixth characteristic diagram and the first characteristic diagram to obtain the processed characteristic diagram.

Fig. 3 is a schematic view of a modular structure in an embodiment of the present application, and as shown in fig. 3, the modular structure includes: point-wise group convolution, depth convolution, point-wise group convolution, channel aliasing, Convolutional Block Attention Model (CBAM), and shortcut connection (i.e., the output of CBAM is added to the input of the modular structure).

For a first feature map obtained after standard convolution, firstly performing point-by-point group convolution processing on the first feature map, namely grouping the first feature map and 1 multiplied by 1 convolution kernels according to channels, and performing convolution processing on each group in parallel to expand the number of the channels to obtain a second feature map; then, performing deep convolution processing on the second feature map, extracting features of each channel, and obtaining a third feature map, wherein the deep convolution is different from standard convolution, one convolution kernel of the deep convolution is only responsible for one channel, one channel is only convoluted by one convolution kernel, the number of the convolution kernels is the same as that of the last layer of output channels, namely, the channels correspond to the convolution kernels one by one; because the deep convolution only learns a single channel without utilizing the relationship between the channels, the inter-channel fusion of the third feature map output by the deep convolution needs to be carried out again through point-by-point group convolution, and the relationship between the channels is learned to generate a fourth feature map; although the fusion between channels is carried out through the point-by-point group convolution, no information interaction exists between the group convolution products of different groups, the output of each group of point-by-point group convolution only comes from one part of the third feature map, which damages the learning of the neural network and the expression of the feature map, in order to avoid the situation, the fourth feature map output by the point-by-point group convolution processing is subjected to the confusion processing in the channel dimension, so that the interaction and the fusion between the features are improved, and when the fourth feature map is subjected to the channel confusion, the confusion can be carried out according to a preset confusion mode or can be carried out randomly; since most of images to be recognized can include various noises for interference recognition, noise and character contents are distinguished through a convolution block attention model in the embodiment of the application, so that the neural network can be helped to better extract image features, the convolution block attention model comprises a channel attention module and a space attention module, namely channel feature extraction and space feature extraction can be carried out on a fifth feature map, a sixth feature map is obtained, and the accuracy of character recognition can be improved on the premise that extra calculation amount is hardly increased by using the convolution block attention model; with the increase of the network depth, the neural network is prone to degradation (namely, the precision on the training set is saturated and then rapidly decreases), and in order to solve the problem, each modular structure is quickly connected to fit residual mapping, that is, the sixth feature map and the first feature map are subjected to residual mapping, that is, the sixth feature map and the first feature map are added to obtain a processed feature map. Wherein, the formula of the residual mapping is as follows:

F(x)＝H(x)-x

wherein x is the input of the modular structure, i.e. the first feature map, h (x) is the output of the modular structure, i.e. the processed feature map, and f (x) is the output of the convolved block attention model, i.e. the sixth feature map. The processed characteristic diagram can be obtained by the formula as follows:

H(x)＝F(x)+x。

compared with the standard convolution, the method greatly reduces the calculation amount by combining the point-by-point group convolution and the depth convolution. The calculated quantity formula of the standard convolution is as follows:

D_k*D_k*M*N*D_w*D_h

wherein D is_h*D_kThe convolution kernel size of standard convolution, M is the number of channels of the input feature map, N is the number of channels of the output feature map, D_wIs the width of the feature map, D_hIs the height of the feature map;

the amount of computation for the point-by-point set convolution combined with the depth convolution is:

wherein D is_k*D_kThe convolution kernel size for the deep convolution may be the same as the convolution kernel size of the standard convolution, where M is the number of input channels for the point-by-point group convolution, i.e., the number of channels of the first feature map, N is the number of output channels for the point-by-point group convolution, i.e., the number of output channels for the deep convolution and the second point-by-point group convolution, i.e., the number of channels for the second feature map, the third feature map, and the fourth feature map, and D_wWidth of the feature map (including the first feature map, the second feature map, the third feature map and the fourth feature map), D_hIs the height of the feature map. D_k*D_k*N*D_w*D _h1/2M N D, a calculated quantity of depth convolution_w*D _h1/2N D, the calculated quantity of point-by-point group convolution before depth convolution_w*D_hThe calculated amount of point-by-point group convolution after the depth convolution.

According to the calculation formula of the standard convolution and the calculation formula of the combination of the point-by-point group convolution and the depth convolution, the calculation quantity is 1/M +1/2D of the standard convolution by adopting the combination of the point-by-point group convolution and the depth convolution_k ²+N/2D_k ²And M, the calculation amount is greatly reduced.

In an embodiment of the application, the performing a dot-by-dot group convolution process on the first feature map to obtain a second feature map includes: dividing the first characteristic diagram and the convolution kernels of 1 x 1 into preset groups according to channels; and carrying out convolution processing on the data of the group corresponding to the convolution kernel in the first characteristic diagram in parallel through each group of convolution kernels to obtain a second characteristic diagram.

The preset group is a group number preset according to needs, and may be 2 groups or the like.

When performing point-by-point group convolution processing on the first feature map, firstly, dividing the first feature map and 1 × 1 convolution kernels into preset groups according to channels, for example, when dividing the first feature map into 2 groups, dividing a part of the channels in front of the first feature map into one group, dividing a part of the channels in back of the first feature map into one group, and grouping the 1 × 1 convolution kernels in the same way; after the first feature map and the 1 × 1 convolution kernels are divided into preset groups, data of a group corresponding to the convolution kernels in the first feature map are subjected to convolution processing in parallel through each group of convolution kernels, namely, each group of convolution kernels respectively learn the data of the group corresponding to the convolution kernels to obtain a second feature map, for example, when the first feature map and the 1 × 1 convolution kernels are divided into 2 groups according to channels in point-by-point group convolution, feature learning is performed in a group convolution mode, each group of convolution kernels respectively learn half of input features, namely the first feature map, and 2 groups of convolution kernels are calculated in parallel. After the point-by-point group convolution divides the first characteristic diagram into the preset groups, the data of the preset groups can be processed in parallel, so that the processing speed can be increased, and the processing efficiency can be improved.

In an embodiment of the present application, the performing channel aliasing processing on the fourth feature map to obtain a fifth feature map includes: and carrying out random channel confusion processing on the feature points belonging to a preset group in the fourth feature map to obtain the fifth feature map.

When point-by-point group convolution processing is performed, because the fourth feature maps obtained by grouping according to the channels are grouped in the same way, when channel aliasing is performed, feature points belonging to a preset group are subjected to random channel aliasing processing, that is, the feature points in each group are aliased according to the sequence of the channels in each group, so that the feature maps of each channel in the second group, the third group, … and the fourth feature map in the second preset group are respectively and randomly inserted between the channels of the fourth feature map in the first group according to the sequence of the channels, and a fifth feature map is obtained. For example, when the point-by-point group convolution is divided into 2 groups, when channel aliasing is performed, the feature map of each channel may be respectively and randomly inserted between the channels of the fourth feature map in the first group according to the order of the channels in the fourth feature map in the second group, so as to obtain the fifth feature map. By performing random channel confusion processing on the fourth feature map, interaction and fusion among features can be further improved.

In an embodiment of the application, the performing, by using a convolution block attention model, channel feature extraction and spatial feature extraction on the fifth feature map to obtain a sixth feature map includes:

determining a channel attention weight in the fifth feature map by a channel attention mechanism in the convolutional block attention model;

according to the channel attention weight, performing weighting operation on each channel in the fifth feature map respectively to obtain a channel attention feature map;

determining spatial attention weights in the channel attention feature map by a spatial attention mechanism in the convolutional block attention model;

and respectively carrying out weighting operation on each feature point in the channel attention feature map according to the spatial attention weight to obtain the sixth feature map.

Fig. 4 is a schematic diagram of a convolution block attention model in an embodiment of the present application, and as shown in fig. 4, the convolution block attention model includes a channel attention mechanism and a spatial attention mechanism. Inputting the fifth feature map into a convolution block attention model, processing the fifth feature map by a channel attention mechanism, determining the attention weight of each channel in the fifth feature map to obtain a channel attention weight, and respectively performing weighting operation on each channel in the fifth feature map according to the channel attention weight, namely multiplying the channel attention weight by the corresponding channel in the fifth feature map to obtain a channel attention feature map. The output of the channel attention mechanism is used as the input of the spatial attention, namely the channel attention feature map is input into the spatial attention, the channel attention feature map is processed by the spatial attention mechanism, the spatial feature is learned, the spatial attention weight in the channel attention feature map is determined, namely the weight of each feature point in the channel attention feature map is determined, the spatial attention weight is respectively weighted on each feature point in the channel attention feature map, namely the spatial attention weight is multiplied by the corresponding feature point in the channel attention feature map, and a sixth feature map is obtained. Through the processing of the convolution block attention model, noise and character content can be distinguished, and the accuracy of character recognition is improved.

In an embodiment of the present application, the determining the channel attention weight in the fifth feature map by the channel attention mechanism in the convolution block attention model includes:

performing global maximum pooling and global average pooling on the fifth feature map based on width and height respectively to obtain a first maximum pooling feature map and a first average pooling feature map;

inputting the first maximum pooling feature map and the first average pooling feature map into a multilayer perceptron respectively to obtain a maximum pooling multilayer perception feature map and an average pooling multilayer perception feature map;

adding corresponding feature points of the maximum pooling multi-layer perception feature map and the average pooling multi-layer perception feature map to obtain a sum feature map;

and carrying out sigmoid activation processing on the addition characteristic graph to obtain the attention weight of the channel.

Inputting the fifth feature map into a convolution block attention model, respectively performing global maximum pooling processing and global average pooling processing based on width and height on the fifth feature map, compressing the fifth feature map on a spatial dimension to aggregate spatial information of the feature mapping to obtain a first maximum pooling feature map and a first average pooling feature map, respectively inputting the first maximum pooling feature map and the first average pooling feature map into a same multilayer Perceptron (MLP), respectively compressing the spatial dimensions of the first maximum pooling feature map and the first average pooling feature map through a multilayer Perceptron (MLP), summing and combining the feature maps element by element to obtain a maximum pooling multilayer perception feature map and an average pooling multilayer perception feature map, and then adding and processing corresponding feature points of the maximum pooling multilayer perception feature map and the average pooling multilayer perception feature map (i.e. eleminwise adding and adding operation), and obtaining an addition feature map, and then carrying out sigmoid activation processing on the addition feature map to obtain the channel attention weight. The spatial dimensions of the feature map are compressed by using both maximal pooling and average pooling, while taking into account all feature points and the maximal feature points on the feature map, improving the accuracy of the channel attention weights.

In an embodiment of the present application, the determining the spatial attention weight in the channel attention feature map by the spatial attention mechanism in the convolution block attention model includes:

respectively carrying out maximum pooling processing and average pooling processing based on channels on the channel attention feature map to obtain a second maximum pooling feature map and a second average pooling feature map;

performing channel-based splicing processing on the second maximum pooling characteristic map and the second average pooling characteristic map to obtain a spliced characteristic map;

and sequentially carrying out convolution processing and sigmoid activation processing on the spliced characteristic graph to obtain the space attention weight.

The channel attention mechanism in the convolution block attention model outputs a channel attention feature map, the channel attention feature map is processed again through a space attention mechanism, the space attention mechanism mainly compresses the channels, and maximum pooling and average pooling are respectively carried out in channel dimensions to obtain a second maximum pooling feature map and a second average pooling feature map, wherein the number of the channels is 1. The maximum pooling treatment is to extract the maximum value on a channel, and the extraction times are the height multiplied by the width; the average pooling process is to take an average over the channels, the number of times extracted being height times width. And then, performing channel-based splicing processing on the second maximum pooling feature map and the second average pooling feature map, namely combining the second maximum pooling feature map and the second average pooling feature map to obtain a spliced feature map, wherein the number of channels of the spliced feature map is 2. And performing convolution processing on the spliced feature map, reducing the dimension of the channel number to 1, and performing sigmoid activation processing on the spliced feature map after the convolution processing to obtain the spatial attention weight. The channel dimensions of the feature map are compressed by simultaneously using maximum pooling and average pooling, and feature points of all channels on the feature map and the maximum feature point in the channels are considered, so that the accuracy of the spatial attention weight is improved.

And step 130, performing pooling treatment on the treated feature map to obtain a pooled feature map.

And obtaining a processed feature map after the processing by the modular structure, wherein the height of the processed feature map is larger, so that the subsequent decoding processing is inconvenient to perform, and at the moment, the processed feature map is subjected to pooling processing to compress the height of the processed feature map, so that the number of network optimization parameters can be reduced, and the pooled feature map is obtained after the processing.

Step 140, determining whether the pooled feature map satisfies a predetermined condition, if not, performing step 150, and if so, performing step 160.

Wherein the preset condition is that the height of the pooled feature map is less than or equal to the height of a convolution kernel when standard convolution processing is subsequently carried out.

When the pooling characteristic map does not meet the preset condition, executing step 150 to perform modular structure processing and pooling processing, and further compressing the height of the characteristic map; when the pooled feature map satisfies the preset condition, step 160 may be performed for standard convolution processing. For example, if the height of the pooled feature map is 5 and the height of the convolution kernel is 3 when the standard convolution processing is performed subsequently, step 150 needs to be performed; step 160 may be performed if the height of the pooled feature map is 2 and the height of the convolution kernel is 3 when standard convolution processing is subsequently performed.

Step 150, taking the pooled feature map as an input of the modular structure, performing feature extraction processing on the pooled feature map through the modular structure to obtain a processed feature map, and then executing step 130.

Taking the pooling feature map as an input of the modular structure, performing feature extraction processing on the pooling feature map again through the modular structure to obtain a processed feature map, and then executing step 130, namely, performing the modular structure processing and pooling processing in a circulating manner until the obtained pooling feature map meets a preset condition.

The process of performing the feature extraction processing on the pooled feature map through the modular structure is the same as above, and is not described herein again.

And 160, performing standard convolution processing on the pooled feature map to obtain a coding result.

And after the pooling characteristic diagram meets the preset condition, performing standard convolution processing on the pooling characteristic diagram, and further compressing the height of the pooling characteristic diagram to enable the height of the pooling characteristic diagram to be 1, thereby obtaining a coding result.

And 170, decoding the coding result to obtain an identification result.

The image to be recognized with the characters to be recognized and the interference noise is coded through the steps to obtain a coding result, and the coding result is decoded to obtain a recognition result, namely the characters to be recognized (such as verification code characters) are obtained. Decoding may be performed using full concatenation and softmax, LSTM (Long Short-Term Memory network) and softmax, or Concatenation Timing Classification (CTC), for example, for identifying an authentication code, decoding of a fixed-length authentication code may be performed using full concatenation and softmax, decoding of an indefinite-length authentication code may be performed using LSTM and softmax, or decoding of an indefinite-length authentication code may be performed using concatenation timing Classification.

In an embodiment of the present application, the decoding the encoding result to obtain an identification result includes:

carrying out connection time sequence classification processing on the coding result to obtain conditional probability distribution of the character sequence;

and determining the recognition result according to the character probability distribution of the label sequence.

The connection time sequence classification is end-to-end model training, data do not need to be aligned in advance, and only one input sequence and one output sequence are needed for training.

The character recognition needs to convert the extracted features into character sequences, and convert the coding result into conditional probability distributions of a plurality of character sequences by performing connection time sequence classification processing on the coding result, that is, for the coding result, the probabilities of all possible character sequences are output by connecting time sequence classification, so that the character sequence with the maximum conditional probability is selected from the conditional probability distributions to serve as the recognition result. The method has the advantages that the connection time sequence classification is used for decoding, characters with different lengths can be recognized, the universality of character recognition is improved, the connection time sequence classification can be decoded without learning any parameter, and the model and the parameter quantity needing to be learned can be guaranteed to be small enough.

The character recognition method provided by the embodiment of the application extracts image features of an image to be recognized through standard convolution and expands the number of channels to generate a first feature map, performs feature extraction processing on the first feature map through a modular structure comprising point-by-point group convolution and depth convolution to obtain a processed feature map, performs pooling processing on the processed feature map to obtain a pooled feature map, performs the modular structure processing and the pooling processing circularly by taking the pooled feature map as input of the modular structure until the obtained pooled feature map meets a preset condition, performs standard convolution processing on the pooled feature map to obtain a coding result, and decodes the coding result to obtain a recognition result. Because the features are extracted by adopting the point-by-point group convolution and the deep convolution instead of the standard convolution, on the premise of ensuring the identification precision, the calculation complexity is reduced, the identification speed is improved, the consumption of calculation resources is reduced, and the method can operate under various computing platforms.

Example two

In the present embodiment, as shown in fig. 5, the character recognition apparatus 500 includes:

a channel expansion module 510, configured to extract image features of the image to be identified through standard convolution and expand the number of channels to generate a first feature map;

a feature extraction module 520, configured to perform feature extraction processing on the first feature map through a modular structure including point-by-point group convolution and depth convolution to obtain a processed feature map;

a pooling processing module 530, configured to perform pooling processing on the processed feature map to obtain a pooled feature map;

a circulation control module 540, configured to take the pooling characteristic map as an input of the modular structure, and perform the modular structure processing and pooling processing in a circulation manner until the obtained pooling characteristic map meets a preset condition;

a standard convolution processing module 550, configured to perform standard convolution processing on the pooled feature map to obtain a coding result;

and a decoding module 560, configured to decode the encoding result to obtain an identification result.

Optionally, the feature extraction module includes:

the first point-by-point group convolution unit is used for performing point-by-point group convolution processing on the first feature map to obtain a second feature map;

the depth convolution unit is used for performing depth convolution processing on the second feature map to obtain a third feature map;

the second point-by-point group convolution unit is used for carrying out inter-channel fusion processing on the third feature map through point-by-point group convolution to obtain a fourth feature map;

a channel confusion unit, configured to perform channel confusion processing on the fourth feature map to obtain a fifth feature map;

the convolution block attention processing unit is used for carrying out channel feature extraction and space feature extraction on the fifth feature map through a convolution block attention model to obtain a sixth feature map;

and the residual error mapping unit is used for carrying out residual error mapping processing on the sixth characteristic diagram and the first characteristic diagram to obtain the processed characteristic diagram.

Optionally, the first point-by-point group convolution unit is specifically configured to:

dividing the first characteristic diagram and the convolution kernels of 1 x 1 into preset groups according to channels;

and carrying out convolution processing on the data of the group corresponding to the convolution kernel in the first characteristic diagram in parallel through each group of convolution kernels to obtain a second characteristic diagram.

Optionally, the channel obfuscating unit is specifically configured to:

and carrying out random channel confusion processing on the feature points belonging to a preset group in the fourth feature map to obtain the fifth feature map.

Optionally, the convolution block attention processing unit includes:

a channel weight determination subunit, configured to determine a channel attention weight in the fifth feature map through a channel attention mechanism in the convolution block attention model;

the channel weighting subunit is configured to perform weighting operation on each channel in the fifth feature map according to the channel attention weight, so as to obtain a channel attention feature map;

a spatial weight determination subunit, configured to determine a spatial attention weight in the channel attention feature map through a spatial attention mechanism in the convolution block attention model;

and the spatial weighting subunit is configured to perform weighting operation on each feature point in the channel attention feature map respectively according to the spatial attention weight, so as to obtain the sixth feature map.

Optionally, the channel weight determining subunit is specifically configured to:

Optionally, the spatial weight determination subunit is specifically configured to:

Optionally, the decoding module is specifically configured to:

and determining the recognition result according to the conditional probability distribution of the character sequence.

The text recognition device provided in the embodiment of the present application is used to implement the steps of the text recognition method described in the first embodiment of the present application, and the specific implementation of each module of the device refers to the corresponding steps, which are not described herein again.

The character recognition device provided by the embodiment of the application extracts image features of an image to be recognized through a standard convolution by a channel expansion module and expands the number of channels to generate a first feature map, the feature extraction module performs feature extraction processing on the first feature map through a modular structure comprising point-by-point group convolution and depth convolution to obtain a processed feature map, the pooling processing module performs pooling processing on the processed feature map to obtain a pooled feature map, the circulation control module uses the pooled feature map as input of the modular structure to perform the modular structure processing and the pooling processing in a circulation mode until the obtained pooled feature map meets a preset condition, the labeling convolution processing module performs standard convolution processing on the pooled feature map to obtain a coding result, and the decoding module decodes the coding result to obtain a recognition result. Because the features are extracted by adopting the point-by-point group convolution and the deep convolution instead of the standard convolution, on the premise of ensuring the identification precision, the calculation complexity is reduced, the identification speed is improved, the consumption of calculation resources is reduced, and the method can operate under various computing platforms.

EXAMPLE III

Embodiments of the present application also provide an electronic device, as shown in fig. 6, the electronic device 600 may include one or more processors 610 and one or more memories 620 connected to the processors 610. Electronic device 600 may also include input interface 630 and output interface 640 for communicating with another apparatus or system. Program code executed by processor 610 may be stored in memory 620.

The processor 610 in the electronic device 600 calls the program code stored in the memory 620 to perform the text recognition method in the above-described embodiment.

The above elements in the above electronic device may be connected to each other by a bus, such as one of a data bus, an address bus, a control bus, an expansion bus, and a local bus, or any combination thereof.

The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the text recognition method according to the first embodiment of the present application.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The text recognition method, the text recognition device, the electronic device and the storage medium provided by the embodiments of the present application are introduced in detail, and a specific example is applied to illustrate the principle and the implementation manner of the present application, and the description of the embodiments is only used to help understanding the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Claims

1. A method for recognizing a character, comprising:

and decoding the coding result to obtain an identification result.

2. The method of claim 1, wherein the performing a feature extraction process on the first feature map through a modular structure comprising a point-by-point group convolution and a depth convolution to obtain a processed feature map comprises:

3. The method according to claim 2, wherein performing a dot-by-dot group convolution process on the first feature map to obtain a second feature map comprises:

4. The method according to claim 3, wherein the performing channel obfuscation processing on the fourth feature map to obtain a fifth feature map comprises:

5. The method according to claim 2, wherein the performing channel feature extraction and spatial feature extraction on the fifth feature map through a rolling block attention model to obtain a sixth feature map comprises:

6. The method of claim 5, wherein determining the channel attention weight in the fifth feature map by a channel attention mechanism in the volume block attention model comprises:

7. The method of claim 5, wherein the determining spatial attention weights in the channel attention feature map by a spatial attention mechanism in the volume block attention model comprises:

8. The method according to any one of claims 1-7, wherein said decoding the encoded result to obtain the recognition result comprises:

9. A character recognition apparatus, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of recognizing words according to any one of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for word recognition according to any one of claims 1 to 8.