CN116168394A

CN116168394A - Image text recognition method and device

Info

Publication number: CN116168394A
Application number: CN202310195882.0A
Authority: CN
Inventors: 刘腾龙
Original assignee: New Oriental Education Technology Group Co ltd
Current assignee: New Oriental Education Technology Group Co ltd
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-05-26

Abstract

The application provides an image text recognition method, an image text recognition device, computing equipment and a computer readable storage medium, and relates to the technical field of computers. The method comprises the steps of carrying out nonlinear position coding on first image features of an image to be identified to obtain second image features, and scaling a first matrix of the second image features to obtain a second matrix, wherein the first matrix comprises a first key matrix and/or a first value matrix. Encoding and decoding the second matrix with reduced dimensionality based on a multi-head attention mechanism to obtain text information in the image to be recognized. The method and the device can reduce the calculated amount of encoding and decoding, thereby improving the efficiency of image character recognition.

Description

Image text recognition method and device

Technical Field

The present application relates to the field of computer technology, and in particular, to an image text recognition method, apparatus, computing device, and computer readable storage medium.

Background

Optical character recognition (optical character recognition, OCR) is a process of scanning or photographing text data (e.g., test paper) or a natural scene with an electronic device (e.g., a scanner or a digital camera) to obtain an image file, then analyzing the image file, and extracting text and layout information in the image file.

Irregular text recognition is one of the difficulties in the OCR field. In the conventional method, a plurality of feature maps of an image to be recognized can be extracted according to a convolutional neural network (convolutional neural networks, CNN), and the feature maps are encoded in position and then encoded and decoded based on an attention mechanism, so that texts in the image to be recognized are obtained. However, the coding and decoding operation time based on the attention mechanism is long, and the efficiency of image text recognition is reduced.

Disclosure of Invention

The application provides an image text recognition method, an image text recognition device, a computing device and a computer readable storage medium. The method encodes and decodes the key matrix and/or the value matrix with reduced dimensionality based on a multi-head attention mechanism, so that the calculated amount of encoding and decoding can be reduced, and the efficiency of image character recognition is improved.

In a first aspect, there is provided an image text recognition method, including: acquiring first image features of an image to be identified, wherein the first image features comprise a plurality of feature maps; performing nonlinear position coding on the first image feature to obtain a second image feature, wherein the second image feature comprises horizontal position information and vertical position information; acquiring a first matrix according to the second image features, wherein the first matrix comprises a first key matrix and/or a first value matrix, the first key matrix is used for indicating the queried second image features, and the first value matrix is used for indicating values of the queried second image features; scaling the first matrix to obtain a second matrix, wherein the second matrix has a dimension lower than the dimension of the first matrix; encoding and decoding the second image feature based on a multi-head attention mechanism according to the second matrix to obtain text information in the image to be identified; and outputting text information in the image to be recognized.

In the embodiment of the application, when encoding and decoding are performed based on a multi-head attention mechanism, the key matrix and/or the value matrix are scaled, so that the dimension of the key matrix and/or the value matrix is reduced. Thus, the calculation amount of encoding and decoding can be reduced, and the efficiency of image character recognition is improved.

With reference to the first aspect, in certain implementations of the first aspect, scaling the first matrix to obtain a second matrix includes: performing row-column transformation on the first matrix to obtain an intermediate matrix, wherein the number of rows of the intermediate matrix is smaller than that of the first matrix; and carrying out weight transformation on the intermediate matrix to obtain a second matrix, wherein the number of columns of the second matrix is smaller than that of the intermediate matrix.

In the embodiment of the application, row-column transformation and weight transformation are sequentially performed on the first matrix to obtain a second matrix. The dimensions of the second matrix are lower than the dimensions of the first matrix, in particular the number of rows of the second matrix is smaller than the number of rows of the first matrix, that is to say the dimensions of the key matrix and/or the value matrix are reduced. Thus, the calculation amount of encoding and decoding can be reduced, and the efficiency of image character recognition is improved.

With reference to the first aspect, in certain implementations of the first aspect, scaling the first matrix to obtain a second matrix includes: scaling the first matrix to obtain the second matrix according to the following formula:

wherein ,Y₁ For the first matrix, Y ₂ In the form of a second matrix, the first matrix,

according to Y ₁ The determined intermediate quantity, reshape, is matrix rank-changeChanging function, linear is matrix weight conversion function, Y ₁ and Y₂ All N rows and C columns, G is a scaling parameter, G is a factor of N and G is not 1.

According to the embodiment of the application, the reshape function and the linear function are sequentially adopted for processing the first matrix, so that a second matrix is obtained. The dimensions of the second matrix are lower than the dimensions of the first matrix, in particular the number of rows of the second matrix is smaller than the number of rows of the first matrix, that is to say the dimensions of the key matrix and/or the value matrix are reduced. Thus, the time complexity of the multi-head attention mechanism can be reduced, and the efficiency of image character recognition is improved.

With reference to the first aspect, in certain implementations of the first aspect, non-linear position encoding the first image feature to obtain a second image feature includes: acquiring horizontal position information and vertical position information of the first image feature; obtaining a horizontal coding vector according to the horizontal position information, and obtaining a vertical coding vector according to the vertical position information; splicing the horizontal coding vector and the vertical coding vector into a position coding vector; the first image feature is processed according to the position-coding vector to obtain a second image feature.

According to the method and the device for identifying the image text, the nonlinear position coding is used for reserving the position information of the image to be identified, and the image text identification can be effectively carried out.

With reference to the first aspect, in certain implementations of the first aspect, the position-coding vector is obtained according to the following formula:

wherein ,

for this position encoding vector c is 10 ^-4 X is used to indicate the horizontal position of the pixel in the feature map, y is used to indicate the vertical position of the pixel in the feature map, d is used to indicate the dimension of the position-coded vector, i, j e [0, d/4]The pixel belongs to the feature map, which belongs to the first image feature.

With reference to the first aspect, in certain implementations of the first aspect, encoding and decoding the second image feature based on a multi-headed gaze mechanism to obtain the recognition text includes: encoding and decoding the second image feature based on a multi-head attention mechanism using a transform model to obtain a recognition text, wherein the transform model comprises an encoding layer having at least 2 encoding modules that output third image features having different resolutions, and a decoding layer that processes the third image features having different resolutions to obtain the recognition text.

In the embodiment of the application, the coding layer in the transform model can extract the characteristics of the second image characteristic under multiple scales through multiple coding modules to obtain the third image characteristic with multiple scales, so that the high-resolution characteristic and the low-resolution characteristic are captured together, and the captured third image characteristic with multiple scales is input into the decoding layer together for processing, thereby avoiding the problem of space information loss caused by gradual reduction of the resolution along with deepening of the network depth, optimizing the extracted characteristics, and improving the accuracy of the predicted characteristics extracted by the model.

With reference to the first aspect, in certain implementation manners of the first aspect, according to the second matrix, encoding and decoding the second image feature based on a multi-head attention mechanism to obtain text information in the image to be identified includes: and decoding the second image characteristic according to the cluster searching strategy to obtain text information in the image to be identified.

In the embodiment of the application, the second image feature is decoded according to the cluster search strategy, so that the decoded output sequence is prevented from falling into a local optimal solution, and the decoding accuracy of the model can be improved, thereby improving the accuracy of image character recognition.

With reference to the first aspect, in certain implementations of the first aspect, the bundle search policy is specifically configured to: when decoding is performed based on a multi-head attention mechanism, the optimal two decoding information are selected in a first decoding step as the output of the first decoding step.

In the embodiment of the application, the two optimal decoding information are selected in one decoding step to be used as the output of the decoding step, so that the decoded output sequence is prevented from sinking into a local optimal solution, the decoding accuracy of a model can be improved, and the accuracy of image character recognition is improved.

With reference to the first aspect, in certain implementation manners of the first aspect, acquiring a first image feature of an image to be identified includes: and acquiring a first image characteristic of the image to be identified according to the convolutional neural network CNN.

In a second aspect, there is provided an image text recognition apparatus including: the acquisition module is used for acquiring first image features of the image to be identified, wherein the first image features comprise a plurality of feature maps; the position coding module is used for carrying out nonlinear position coding on the first image feature so as to obtain a second image feature, wherein the second image feature comprises horizontal position information and vertical position information; the acquisition module is further configured to acquire a first matrix according to the second image feature, where the first matrix includes a first key matrix and/or a first value matrix, the first key matrix is used to indicate the queried second image feature, and the first value matrix is used to indicate a value of the queried second image feature; a scaling module, configured to scale the first matrix to obtain a second matrix, where a dimension of the second matrix is lower than a dimension of the first matrix; the encoding and decoding module is used for encoding and decoding the second image features based on a multi-head attention mechanism according to the second matrix so as to obtain text information in the image to be identified; and the output module is used for outputting text information in the image to be identified.

With reference to the second aspect, in certain implementations of the second aspect, the scaling module is specifically configured to: performing row-column transformation on the first matrix to obtain an intermediate matrix, wherein the number of rows of the intermediate matrix is smaller than that of the first matrix; and carrying out weight transformation on the intermediate matrix to obtain a second matrix, wherein the number of columns of the second matrix is smaller than that of the intermediate matrix.

With reference to the second aspect, in certain implementations of the second aspect, the scaling module is specifically configured to: scaling the first matrix to obtain the second matrix according to the following formula:

according to Y ₁ The determined intermediate quantity, reshape is a matrix-column transformation function, linear is a matrix weight transformation function, Y ₁ and Y₂ All N rows and C columns, G is a scaling parameter, G is a factor of N and G is not 1.

With reference to the second aspect, in certain implementations of the second aspect, the position encoding module is specifically configured to: acquiring horizontal position information and vertical position information of the first image feature; obtaining a horizontal coding vector according to the horizontal position information, and obtaining a vertical coding vector according to the vertical position information; splicing the horizontal coding vector and the vertical coding vector into a position coding vector; the first image feature is processed according to the position-coding vector to obtain a second image feature.

With reference to the second aspect, in certain implementations of the second aspect, the position-coding vector is obtained according to the following formula:

wherein ,

With reference to the second aspect, in certain implementations of the second aspect, the codec module is specifically configured to: encoding and decoding the second image feature based on a multi-head attention mechanism using a transform model to obtain a recognition text, wherein the transform model comprises an encoding layer having at least 2 encoding modules that output third image features having different resolutions, and a decoding layer that processes the third image features having different resolutions to obtain the recognition text.

With reference to the second aspect, in certain implementations of the second aspect, the codec module is specifically configured to: and decoding the second image characteristic according to the cluster searching strategy to obtain text information in the image to be identified.

With reference to the second aspect, in certain implementations of the second aspect, the bundle search policy specifically applies to: when decoding is performed based on a multi-head attention mechanism, the optimal two decoding information are selected in a first decoding step as the output of the first decoding step.

With reference to the second aspect, in certain implementations of the second aspect, the obtaining module is specifically configured to: and acquiring a first image characteristic of the image to be identified according to the convolutional neural network CNN.

In a third aspect, there is provided a computing device comprising a processor and a memory, the processor being operable to execute instructions stored in the memory to cause the computing device to perform the image text recognition method of the first aspect or any implementation thereof.

In a fourth aspect, there is provided a computer program product comprising instructions which, when executed by a computing device, cause the computing device to perform the image text recognition method of the first aspect or any implementation thereof.

In a fifth aspect, a computer readable storage medium is provided, comprising computer program instructions which, when executed by a computing device, perform the image text recognition method of the first aspect or any implementation thereof.

Drawings

Fig. 1 is a schematic block diagram of an image text recognition apparatus provided in an embodiment of the present application.

Fig. 2 is a schematic flow chart of an image text recognition method provided in an embodiment of the present application.

Fig. 3 is a schematic architectural diagram of a transducer model.

Fig. 4 is a schematic flow chart of a transducer model provided in an embodiment of the present application.

FIG. 5 is a schematic flow chart of an image text recognition device provided in an embodiment of the present application

Detailed Description

The technical solutions in the present application will be described below with reference to the accompanying drawings.

It should be understood that, in various embodiments of the present application, the size of the sequence number of each process does not mean that the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Unless defined otherwise, all technical and scientific terms used in the examples of this application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items. Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system, apparatus and module may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules is merely a logical function division, and there may be additional divisions of actual implementation, e.g., multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

If implemented as a software functional module and sold or used as a stand-alone product, may be stored on a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Based on the above, the embodiments of the present application provide an image text recognition method, apparatus, computing device and computer readable storage medium, which reduce the time complexity of a multi-head attention mechanism by performing a transformation operation on a matrix, thereby improving the text recognition efficiency.

Fig. 1 shows a schematic diagram of an image text recognition device according to an embodiment of the present application. As shown in fig. 1, image text recognition device 110 is communicatively coupled to input device 120. The input device 120 captures an image of a natural scene and inputs the image into the image text recognition device 110 in the form of an electrical or other signal. The image text recognition device 110 may perform image text recognition on the inputted image.

The device architecture of fig. 1 is merely exemplary and is not limiting of the scenarios in which embodiments of the present application may be applied. For example, the image text recognition device 110 and the input device 120 may be different devices, such as the input device 120 being an image capturing and preprocessing apparatus, such as a camera or the like, for capturing images and converting the images into electrical or other signals. The image text recognition device 110 may be a local or remote server for storing and processing signals transmitted by the input device 120. As another example, image-text recognition device 110 and input device 120 may be implemented on the same apparatus that may both capture and recognize images. The processing of the signals may be performed in real time or may be performed at intervals, which is not limited in the embodiment of the present application.

The input device 120 may input one image or may input a plurality of images at the same time, which is not limited in the embodiment of the present application.

The image text recognition device 110 may be a device or system having information processing capabilities, such as a computer or server or the like. Specifically, the image text recognition device 110 may include a processor for implementing information processing, for example, text recognition on an image using the technical solution of the embodiment of the present application. The processor may be any kind of processor, such as a special purpose processor or a general purpose processor, which is not limited by the embodiments of the present application.

The image text recognition device 110 may also include a memory. The memory may be used to store information and instructions, such as computer-executable instructions that implement aspects of embodiments of the present application. The memory may also be used to store images. The memory may be any type of memory, and embodiments of the present application are not limited in this regard.

The image text recognition device 110 may further include a communication interface, which is communicatively connected to the input device 120 through the communication interface, where the communication connection may be wired or wireless, and the embodiment of the present application is not limited thereto.

Fig. 2 illustrates an image text recognition method 200 of one embodiment of the present application. The image text recognition method 200 may be performed by the image text recognition device 110 of fig. 1, for example. The image text recognition method shown in fig. 2 includes the following procedure.

Step 210, obtaining a first image feature of an image to be identified, the first image feature including a plurality of feature maps.

The image to be identified may be a scene picture taken or downloaded, for example, the scene picture may include a license plate, an identification card, a bank card, and a billboard waiting for identification text. The image to be identified may also be a photograph of a test paper, etc., which is not limited in this application.

For example, the embodiment of the present application may acquire the first image feature of the image to be identified through the input device 120 as shown in fig. 1, or may store the first image feature of the image to be identified in the memory of the image text recognition device 110 in advance, and acquire the first image feature of the image to be identified from the memory.

In one possible implementation, the first image feature may be extracted by a convolutional neural network (convolutional neural networks, CNN).

CNN is a deep neural network with a convolution structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer, which can be regarded as a filter. The convolution layer refers to a neuron layer in the CNN that performs convolution processing on an input signal, and in the embodiment of the present application, the convolution layer may be used to perform convolution processing on an image to be identified. In the convolutional layer of CNN, one neuron in one neuron layer may be connected with only a part of adjacent layer neurons. In a convolution layer, a number of feature planes are typically included, each feature plane may be composed of a number of neurons arranged in a rectangular shape, and the neurons of the same feature plane share a weight, where the shared weight is the convolution kernel. Sharing weights can be understood as the way image information is extracted is independent of location. The convolution kernel may be initialized in the form of a matrix of random size, and may be given more reasonable weight during the training of the CNN. In addition, sharing weights can reduce the connections between layers in the CNN, while reducing the risk of overfitting.

Specifically, the image to be identified may be input into a filter for preprocessing, and the preprocessed image to be identified may be input into a CNN, and after passing through a convolution layer (convolution layer) and a pooling layer, a plurality of feature maps (feature maps) may be obtained, where the feature maps may constitute the first image feature. It will be appreciated that the feature map may be a matrix of the digitized pixels of the portion of the image to which the band is applied.

For example, an image to be identified may be represented using a plurality of tensors, the size of the image to be identified may be understood as the size of the tensor, and the size of the image to be identified may be understood as the tensor of the (H, W) size. Wherein H and W represent the width of the image to be identified and the height of the image to be identified, respectively. The size of the tensor of the image to be identified can also be expressed as h×w. The image to be identified is input into CNN, and feature extraction can be carried out on the image to be identified with the size of (H, W) to obtain (H) ₁ ，W ₁ ) A first tensor of size. The input sequence formed by stacking a plurality of tensors can be understood as the first image feature of the image to be identified, i.e. a plurality of feature maps.

For example, CNN may employ a res net31 backbone network. For example, H may be scaled to 32, scaled to equal width, and then input into the CNN.

Step 220, performing nonlinear position coding on the first image feature to obtain a second image feature, wherein the second image feature includes horizontal position information and vertical position information.

The embodiment of the application is used for processing the image to be identified on the whole so as to obtain text information in the image to be identified. The text information may include at least one text line, and each text line may include a plurality of characters. The position and order of the characters in the text line is important, not only as an integral part of the grammar structure of the text line, but also as an important part of the expression semantics. The variation in the position and arrangement of characters in a text line often results in a deviation in the meaning of the entire text line.

When the image to be identified is a scene picture, even if the scene picture is obtained by shooting paper such as a test paper, the text line in the scene picture still generates distortion. Such distortion is irregular such that the planar scene picture contains irregularly arranged characters. Also, these irregularly arranged characters are not simply arranged in one direction, but are arranged in a planar scene picture. In other words, the first image feature may represent position information that is difficult to express characters in a scene picture with linear position coding,

The first image feature is position encoded, which may be non-linear. The second image feature may include horizontal position information and vertical position information such that non-linear position information of the characters may be added to the matrix represented by the text lines, which may help the CNN model and the transducer model learn the position information and then output the characters with the correct position and arrangement order.

Step 230, obtaining a first matrix according to the second image feature, where the first matrix includes a first key (key) matrix and/or a first value (value) matrix, where the first key matrix is used to indicate the queried second image feature, and the first value matrix is used to indicate a value of the queried second image feature.

To illustrate the first matrix, attention mechanisms and self-attention mechanisms are first introduced. The attention mechanism mimics the internal process of biological observation behavior, i.e., a mechanism that aligns internal experience with external sensations to increase the observation finesse of a partial region, enabling rapid screening of high value information from a large amount of information with limited attention resources. The attention mechanism can quickly extract important features of sparse data. While the self-attention mechanism (self-attention mechanism) is an improvement of the attention mechanism, which reduces reliance on external information, and is more adept at capturing internal dependencies of data or features. The essential idea of the attention mechanism can be rewritten as the following formula:

Wherein lx= |source|represents the length of source information (source), the meaning of the formula is that the constituent elements in the source are imagined to be composed of a series of data pairs, namely key-value pairs, and at the moment, a query matrix is given, and the attention score corresponding to the key matrix is obtained by calculating the similarity or correlation of the query matrix and the key matrix. The attention score is converted into a numerical value to obtain a weight coefficient. And carrying out weighted summation on the value (value) matrix according to the weight coefficient to obtain a final attention (attention) value. Conceptually, attention mechanisms can be understood as selectively screening out small and important pieces of information from a large amount of information and focusing on those pieces of information, ignoring most of the unimportant pieces of information. The focusing process is embodied in the calculation of a weight coefficient, and the larger the weight is, the more focused on the corresponding value matrix, namely the weight represents the importance of the information, and the value matrix is the corresponding information.

The calculation formula of the self-attention mechanism is as follows:

wherein K is a key matrix, Q is a query matrix, V is a value matrix, T represents matrix transposition operation, and d _k Representing the dimensions of the key matrix. The softmax function, also known as a normalized exponential function, may convert the predicted outcome to a non-negative number and make the sum of the probabilities of the respective predicted outcomes 1.

However, in the calculation formula of the self-attention mechanism, the data dimension input into the module is assumed to be n×c, that is, the Q, K, V matrix is N rows and C columns, so that the time complexity is O (N ² ) This time complexity is very disadvantageous for large images.

A multi-head attention (head) mechanism is to use multiple query matrices to compute multiple information choices from the input information in parallel, each head focusing on a different part of the input information. The multi-headed attentiveness mechanism includes a plurality of self-attentiveness mechanisms. When processing data to be processed based on a self-attention mechanism, matrix projection is needed to be performed on the data to be processed to obtain a query matrix, a key matrix and a value matrix, the query matrix, the key matrix and the value matrix are processed according to the self-attention mechanism to obtain weights based on the query matrix and the key matrix, then the value matrix is weighted according to the weights, and then linear transformation is performed to output a processing result.

In this embodiment of the present application, a first matrix is obtained according to the second image feature, where the first matrix may be a first key matrix, or may be a first value matrix, and the first matrix may also be two, that is, a first key matrix and a first value matrix. It should be appreciated that the first matrix may not include the first query matrix.

For example, the weight matrix W may be pre-trained _q 、W _k and W_v Processing the second image feature according to the weight matrices may obtain a first key matrix and a first value matrix.

In particular, the weight matrix W may be trained according to a transducer model _q 、W _k and W_v . The second image feature may include a first feature matrix, and the first feature matrix is respectively combined with the weight matrix W _q 、W _k and W_v Multiplication may result in a first query matrix, a first key matrix, and a first value matrix. The first query matrix is used for querying the second image feature, that is, the first query matrix can be used for dot product with the first key matrix to obtain the attention score. A first key matrix may be used to indicate the second image feature being queried and a first value matrix may be used to indicate the value of the second image feature resulting from the query.

However, the dimensions of the first key matrix and the first value matrix are larger, so that the calculation amount of subsequent encoding and decoding is increased, and the efficiency of image text recognition is reduced.

Step 240, scaling the first matrix to obtain a second matrix, wherein the second matrix has a dimension lower than the dimension of the first matrix.

The embodiment of the application can scale the first matrix to obtain the second matrix.

The embodiment of the application can perform row-column transformation on the first matrix. For example, the first key matrix is 4 rows and 8 columns, and the second key matrix is 2 rows and 16 columns after column-row transformation. For another example, the first value matrix is 16 rows and 2 columns, and after column-row transformation, the second value matrix is 4 rows and 8 columns.

As an example, the first matrix may be subjected to weight transformation. For example, a weight matrix W may be generated, and the first matrix may be multiplied by the weight matrix W to obtain a second matrix. It should be appreciated that the number of rows and columns of the weight matrix W generated is adapted to the number of rows and columns of the first matrix, i.e. the first matrix is multiplicable with the weight matrix W. One skilled in the art can generate a suitable weight matrix W such that the dimensions of the second matrix are lower than the first matrix.

As another example, the first matrix may be subjected to row-column transformation and then to weight transformation to obtain the second matrix. One skilled in the art can generate a suitable weight matrix W such that the dimensions of the second matrix are lower than the first matrix.

It should be understood that the scaling of the first matrix according to the embodiments of the present application is not limited to the manner of weight transformation, for example, the first matrix is regarded as a plurality of vectors, and the plurality of vectors may be combined, so as to scale the first matrix.

And step 250, according to the second matrix, encoding and decoding the second image feature based on a multi-head attention mechanism to obtain text information in the image to be recognized.

The encoding and decoding is performed on the basis of the second matrix, that is, during the encoding and decoding process, the first matrix is no longer input into the encoder or decoder based on the multi-head attention mechanism, but the scaled second matrix is input into the encoder or decoder based on the multi-head attention mechanism.

The encoding and decoding methods are not limited in this application, and any encoding and decoding methods based on a multi-head attention mechanism are within the scope of this application. For example, embodiments of the present application may encode and decode based on a transducer model. The transducer model can be further divided into an encoder (which may also be referred to as an encoding module) and a decoder (which may also be referred to as a decoding module). For example, as shown in fig. 3, each encoder and decoder contains 6 blocks (blocks), the input of the encoder is transferred layer by layer in each layer block, i.e., first through the first layer block, the output of the first layer block is taken as the input of the second layer block, the output of the second layer block is taken as the input … … of the third layer block, and so on, and finally the output of the sixth layer block is taken as the input of the decoder.

The transducer model is not limited to 6 blocks, for example, 4 blocks may be included. The encoding and decoding process will be described in detail below with reference to fig. 4, taking a transducer model of 4 blocks as an example.

In the encoding section, as described above, the first image feature is subjected to nonlinear position encoding, and the second image feature can be obtained. The second image feature is input into a transducer encoder, passes through a multi-head attention mechanism layer and a feedforward neural network layer, is repeated for 4 times, and then outputs a coding sequence.

It should be noted that the multi-head attention mechanism layer and the feedforward neural network layer are subjected to residual connection and layer normalization before input and before output. Thus, the final output of the multi-headed attention mechanism layer or feedforward neural network is LayerNorm (x+sublayer (x)), where LayerNorm is the layer normalization function and Sublayer (x) is the original output of the multi-headed attention mechanism layer or feedforward neural network. By normalizing each formula, model convergence can be accelerated, gradient disappearance and gradient explosion can be prevented, and a certain regularization effect can be achieved

The basic processing of the multi-headed attention mechanism may be referred to above. Through a multi-head attention mechanism, the information learned from different heads can be combined, and the feature expression capability of the model is enhanced.

The feed-forward neural network (feed forward neural networks, FFN) is similar to a multi-layer persistence (MLP), but the FFN is composed of multiple layers of continuous nonlinear models. In FFN, different neurons belong to different layers, and each layer of neurons can receive signals from a previous layer of neurons, and then the neurons of this layer generate signals to output to the next layer. The 0 th layer is an input layer, the last layer is an output layer, the middle layer is a hidden layer, no feedback exists in the whole network, and signals can be represented by a useful loop-free graph in one-way propagation from the input layer to the output layer. The FNN may be fully connected.

FFN may employ gaussian error linear units (gaussian error linear units, gel) as an activation function, specifically formulated as follows:

x P(X≤x)＝xΦ(x)

where Φ (x) is the cumulative distribution of the gaussian normal distribution of x, and P represents the probability.

FFN may employ a random inactivation (dropout) mechanism, dropout may reduce interactions between hidden layers, reducing the overfitting phenomenon by setting the value of a portion of the hidden layer nodes to 0.

The non-linear expression capability of the model can be increased by adopting the FFN of full connection, GELU function and dropout.

For example, FFN may perform data processing according to the following formula:

x _out ＝MLP(GELU(Conv _3×3 (MLP(x _in ))))+x _in

wherein ,x_in Is input, x _out Is output, MLP is nn. Linear transformation operation, conv _3×3 Is a convolution transformation and the GELU function is as described above.

The decoding part also comprises a multi-head attention mechanism layer and an FFN layer, and the multi-head attention mechanism layer and the FFN layer are subjected to residual connection and layer normalization processing before input and before output, and are not repeated here.

The decoding section has one more multi-head attention mechanism layer than the encoding section. As shown in fig. 4, Q may be from the last output of the decoder; k and V may be from the output of the encoder or from the last output.

The multi-headed attention mechanism layer in the decoder may be provided with a masking mechanism. Since the decoding process is sequential decoding, future data cannot be known when the decoder is predicting the current output. Thus, by introducing a masking mechanism, for example by constructing a lower triangular matrix, the situation at model prediction can be simulated.

Referring to fig. 4, the output sequence may be output from the last decoder, and when the present process is the first decoding, training data may be used as the output sequence. The output sequence may be indicative of predicted characters.

For the output sequence to carry position information, the output sequence may be linear position encoded according to the following formula:

PE(pos,2i)＝sin(pos/10000 ^2i/d )；

PE(pos,2i+1)＝cos(pos/10000 ^2i/d )；

wherein PE represents position encoding, pos is used to indicate the arrangement position of the predicted character, d is used to indicate the dimension of the output sequence, i E [0, d/2].

In the last decoding, a linear layer and a normalized index layer are also needed to obtain text information of the image to be identified.

The linear layer may be a fully connected neural network that projects the output matrix generated in the decoder into a larger-dimensioned log probability (logits) matrix, the dimensions of which may be used to represent the fraction of the output matrix.

The normalized index layer may convert the score of the output matrix to a probability and add to one. The character corresponding to the output matrix with the highest probability is taken as the output of the current time point, and then the text information in the image to be recognized is obtained.

Step 260, outputting the text information in the image to be recognized.

Optionally, scaling the first matrix to obtain a second matrix, including:

performing row-column transformation on the first matrix to obtain an intermediate matrix, wherein the number of rows of the intermediate matrix is smaller than that of the first matrix;

and carrying out weight transformation on the intermediate matrix to obtain a second matrix, wherein the number of columns of the second matrix is smaller than that of the intermediate matrix.

In this embodiment of the present application, row-column transformation and weight transformation may be performed on the first matrix to obtain the second matrix. It will be appreciated that the dimensions of the second matrix may be smaller than the first matrix, in particular the number of rows of the second matrix may be smaller than the number of rows of the first matrix.

The embodiment of the application can perform row-column transformation on the first matrix. For example, the first key matrix is 8 rows and 8 columns, and the intermediate matrix is 2 rows and 32 columns after column-row transformation. For another example, the first value matrix is 16 rows and 8 columns, and the intermediate matrix is 8 rows and 16 columns after column-row transformation.

The application can perform weight transformation on the intermediate matrix. For example, a weight matrix W may be generated, and the intermediate matrix may be multiplied by the weight matrix W to obtain a second matrix. It should be appreciated that the number of rows and columns of the weight matrix W generated is adapted to the number of rows and columns of the intermediate matrix, i.e. the intermediate matrix is multiplicable with the weight matrix W. One skilled in the art can generate a suitable weight matrix W such that the dimensions of the second matrix are lower than the first matrix.

The implementation manner of the row-column transformation is not limited, for example, the reshape function can be used for carrying out row-column transformation on the first matrix, and other programming methods can also be used for carrying out row-column transformation on the first matrix, so long as the row-column transformation is carried out on the first matrix, all the methods are within the scope of the application.

The implementation manner of the weight transformation is not limited, for example, the linear function can be adopted to perform the weight transformation on the intermediate matrix, and a method of presetting the weight matrix can also be adopted to perform the weight transformation.

Optionally, scaling the first matrix to obtain a second matrix, including:

scaling the first matrix to obtain the second matrix according to the following formula:

It should be understood that the first matrix may be a first key matrix or a first value matrix. That is, scaling the first matrix by using the reshape function and the linear function may be performed only on the first key matrix, and the first value matrix is not processed; or the first key matrix is not processed only for the first value matrix; it is also possible to target both the first key matrix and the first value matrix.

When the reshape function and the linear function are adopted to process the first key matrix and the first value matrix, the same scaling parameter G can be selected, and different scaling parameters G can be selected. For example, the first key matrix and the first value matrix may be processed with a scaling parameter of 2, or the first key matrix may be processed with a scaling parameter of 8, and the first value matrix may be processed with a scaling parameter of 4.

It should be understood that G is a factor of N and G is not 1. Generally, G may be 2, 4, or 8, but G in the embodiments of the present application is not limited to the above values.

If the first key matrix is combined with the first key matrixA value matrix is directly input into a transducer model, and the time complexity is O (N ² ). By processing the reshape function and the linear function, the time complexity is reduced to O (N) ² /G)。

Optionally, non-linear position encoding the first image feature to obtain a second image feature, comprising:

acquiring horizontal position information and vertical position information of the first image feature;

obtaining a horizontal coding vector according to the horizontal position information, and obtaining a vertical coding vector according to the vertical position information;

splicing the horizontal coding vector and the vertical coding vector into a position coding vector;

the first image feature is processed according to the position-coding vector to obtain a second image feature.

In the embodiment of the present application, the horizontal position information of the first image feature may be used to indicate the horizontal position of the pixel on the feature map. Accordingly, the vertical position information of the first image feature may be used to indicate the vertical position of the pixel on the feature map.

A horizontal encoding vector is obtained from the horizontal position information, i.e. the horizontal position information is encoded in the horizontal dimension. Accordingly, a vertical encoding vector is obtained from the vertical position information, i.e., the vertical position information is encoded in the vertical dimension.

The position coding method is not limited in this application, and for example, the position coding method may be performed by adopting a trigonometric function method, or may be performed by adopting an average pooling and activation function method.

Splicing the horizontal coding vector and the vertical coding vector into a position coding vector, wherein the horizontal coding vector can be used as a first half section of the position coding vector, and the vertical coding vector can be used as a second half section of the position coding vector; it is also possible to add the horizontal code vector to the vertical code vector to obtain the position code vector.

Those skilled in the art will appreciate that the horizontal code vector, the vertical code vector, and the position code vector can be regarded as a horizontal code matrix, a vertical code matrix, and a position code matrix.

Optionally, the position-coding vector is obtained according to the following formula:

wherein ,

It will be appreciated that x may be a specific form of horizontal position information and y may be a specific form of vertical position information.

In the embodiment of the present application, the first half of the position-coding vector is used to indicate horizontal position information, and the second half of the position-coding vector is used to indicate vertical position information. Compared with the method for averaging pooling and activating functions, the scheme of the embodiment of the application does not mix the horizontal position information and the vertical position information in the nonlinear position coding, but keeps the horizontal position information and the vertical position information in different parts of the position coding vector respectively, so that the calculated amount is reduced, and the problem that root cause analysis cannot be performed when the operation is wrong can be effectively avoided.

Optionally, encoding and decoding the second image feature based on a multi-headed gaze mechanism to obtain the recognition text, comprising:

encoding and decoding the second image feature based on a multi-head attention mechanism using a transform model to obtain a recognition text, wherein the transform model comprises an encoding layer having at least 2 encoding modules that output third image features having different resolutions, and a decoding layer that processes the third image features having different resolutions to obtain the recognition text.

For a specific process of encoding and decoding the second image feature based on the multi-head attention mechanism using the transducer model, please refer to the foregoing, and a detailed description is omitted here.

When processing the second image feature, the transducer module does not process all the second image feature at a time, but divides the second image feature into a plurality of small second image sub-features (patches), the kernel (kernel) or the filter or the feature detector only views one patch of the second image feature at a time, and the filter moves to another patch of the second image feature after one patch is processed, and so on. This enables the filter to process only small blocks of the second image feature at a time in order to detect features such as edges, while also enabling good regularization properties.

For example, when the second image feature corresponds to a matrix, the second image sub-feature may correspond to a sub-matrix in the matrix.

Optionally, encoding and decoding the second image feature based on a multi-head attention mechanism according to the second matrix to obtain text information in the image to be recognized, including:

and decoding the second image characteristic according to the cluster searching strategy to obtain text information in the image to be identified.

The beam search (beam search) is a heuristic graph search algorithm, which is generally used under the condition that the solution space of a graph is relatively large, in order to reduce the space and time occupied by searching, when each step of depth expansion is performed, some nodes with relatively poor quality are cut off, and some nodes with relatively high quality are reserved. This reduces space consumption and improves time efficiency.

During decoding, the bundle search can prevent the decoded output sequence from falling into a locally optimal solution.

Optionally, the bundle search policy is specifically used for:

when decoding is performed based on a multi-head attention mechanism, the optimal two decoding information are selected in a first decoding step as the output of the first decoding step.

It will be appreciated that the decoding step may also be referred to as a time step, which is a division of the decoding process according to a time sequence. The first decoding step is one of a plurality of decoding steps in the decoding process, and is not particularly limited. It should be appreciated that the first decoding step may also be the first decoding step.

Each decoding step may output a plurality of decoding information. Conventional decoding strategies (such as greedy strategies) only retain one of the plurality of decoding information output by this decoding step, which is prone to falling into a locally optimal solution, and reduces decoding accuracy in a global sense.

Optionally, acquiring the first image feature of the image to be identified includes:

a first image feature of an image to be identified is acquired from a Convolutional Neural Network (CNN).

The meaning of the CNN, how to obtain the first image feature of the image to be identified according to the CNN, please refer to the foregoing, and no description is given here.

The method embodiments of the present application are described above in detail with reference to fig. 2, fig. 3 and fig. 4, and the apparatus embodiments of the present application are described below in reference to fig. 5, where the apparatus embodiments and the method embodiments correspond to each other, so that a portion not described in detail may refer to the foregoing method embodiments, and any possible implementation manner of the foregoing method may be implemented by the apparatus.

Fig. 5 illustrates an image text recognition apparatus according to an embodiment of the present application, where the apparatus 500 may perform the voice recognition method according to the embodiment of the present application described above, and for example, the apparatus 500 may be the image text recognition device 110 described above.

As shown in fig. 5, the apparatus 500 may include:

an obtaining module 510, configured to obtain a first image feature of an image to be identified, where the first image feature includes a plurality of feature maps;

a position encoding module 520, configured to perform nonlinear position encoding on the first image feature to obtain a second image feature, where the second image feature includes horizontal position information and vertical position information;

the obtaining module 510 is further configured to obtain a first matrix according to the second image feature, where the first matrix includes a first key matrix and/or a first value matrix, the first key matrix is used to indicate the queried second image feature, and the first value matrix is used to indicate a value of the queried second image feature;

a scaling module 530, configured to scale the first matrix to obtain a second matrix, where a dimension of the second matrix is lower than a dimension of the first matrix;

the encoding and decoding module 540 is configured to encode and decode the second image feature based on a multi-head attention mechanism according to the second matrix, so as to obtain text information in the image to be identified;

and an output module 550, configured to output text information in the image to be identified.

In the embodiment of the application, when encoding and decoding are performed based on a multi-head attention mechanism, the key matrix and/or the value matrix are scaled, so that the dimension of the key matrix and/or the value matrix is reduced. This reduces the amount of encoding and decoding calculation, thereby improving the efficiency of image text recognition.

Optionally, the scaling module is specifically configured to:

In the embodiment of the application, row-column transformation and weight transformation are sequentially performed on the first matrix to obtain a second matrix. The dimensions of the second matrix are lower than the dimensions of the first matrix, in particular the number of rows of the second matrix is smaller than the number of rows of the first matrix, that is to say the dimensions of the key matrix and/or the value matrix are reduced. This reduces the amount of encoding and decoding calculation, thereby improving the efficiency of image text recognition.

Optionally, the scaling module is specifically configured to:

According to the embodiment of the application, the reshape function and the linear function are sequentially adopted for processing the first matrix, so that a second matrix is obtained. The dimensions of the second matrix are lower than the dimensions of the first matrix, in particular the number of rows of the second matrix is smaller than the number of rows of the first matrix, that is to say the dimensions of the key matrix and/or the value matrix are reduced. Thus, the time complexity of the multi-head attention mechanism can be reduced, and the efficiency of image text recognition is improved.

Optionally, the position coding module is specifically configured to:

According to the method and the device for identifying the image text, the position information of the image to be identified is reserved through nonlinear position coding, and the image text identification can be effectively carried out.

wherein ,

Optionally, the codec module is specifically configured to:

In the embodiment of the application, the second image feature is decoded according to the cluster search strategy, so that the decoded output sequence is prevented from sinking into a local optimal solution, and the decoding accuracy of the model can be improved, thereby improving the accuracy of image text recognition.

Optionally, the bundle search policy is specifically used for:

In the embodiment of the application, the two optimal decoding information are selected in one decoding step to be used as the output of the decoding step, so that the decoded output sequence is prevented from sinking into a local optimal solution, the decoding accuracy of the model can be improved, and the accuracy of image text recognition is improved.

Optionally, the acquiring module is specifically configured to:

and acquiring a first image characteristic of the image to be identified according to the convolutional neural network CNN.

Embodiments of the present application also provide a computing device including a processor and a memory, the processor configured to execute instructions stored in the memory, to cause the computing device to perform any implementation of the foregoing image text recognition method.

Embodiments of the present application also provide a computer program product comprising instructions. The computer program product may be software or a program product containing instructions capable of running on a computing device or stored in any useful medium. The computer program product, when run on at least one computing device, causes the at least one computing device to perform an image text recognition method.

Embodiments of the present application also provide a computer-readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer-readable storage medium includes instructions that instruct a computing device to perform an image text recognition method.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In addition, each functional module in each embodiment of the present application may be integrated into one module, or each module may exist alone physically, or two or more modules may be integrated into one module.

If the functions are implemented in the form of software functional modules and sold or used as a stand-alone product, they can be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; these modifications or substitutions do not depart from the essence of the corresponding technical solutions from the protection scope of the technical solutions of the embodiments of the present application.

Claims

1. An image text recognition method, comprising:

acquiring first image features of an image to be identified, wherein the first image features comprise a plurality of feature maps;

performing nonlinear position coding on the first image feature to obtain a second image feature, wherein the second image feature comprises horizontal position information and vertical position information;

acquiring a first matrix according to the second image features, wherein the first matrix comprises a first key matrix and/or a first value matrix, the first key matrix is used for indicating the queried second image features, and the first value matrix is used for indicating values of the queried second image features;

scaling the first matrix to obtain a second matrix, wherein the second matrix has a dimension lower than the dimension of the first matrix;

encoding and decoding the second image features based on a multi-head attention mechanism according to the second matrix to obtain text information in the image to be identified;

and outputting text information in the image to be identified.

2. The method of claim 1, wherein scaling the first matrix to obtain a second matrix comprises:

3. The method of claim 2, wherein scaling the first matrix to obtain a second matrix comprises:

4. A method according to any of claims 1-3, wherein non-linear position encoding the first image feature to obtain a second image feature comprises:

obtaining a horizontal coding vector according to the horizontal position information and obtaining a vertical coding vector according to the vertical position information;

and processing the first image feature according to the position coding vector to obtain a second image feature.

5. The method of claim 4, wherein the position-coding vector is obtained according to the following formula:

/>

wherein ,

for the position encoding vector c is 10 ^-4 X is used to indicate the horizontal position of a pixel in a feature map, y is used to indicate the vertical position of the pixel in the feature map, d is used to indicate the dimension of the position-coded vector, i, j e [0, d/4]The pixel belongs to the feature map, and the feature map belongs to the first image feature.

6. The method of any of claims 1-5, wherein encoding and decoding the second image feature based on a multi-headed gaze mechanism to obtain the recognition text comprises:

encoding and decoding the second image features based on a multi-head attention mechanism using a transform model to obtain a recognition text, wherein the transform model comprises an encoding layer having at least 2 encoding modules that output third image features having different resolutions, and a decoding layer that processes the third image features having different resolutions to obtain the recognition text.

7. The method according to any of claims 1-6, wherein encoding and decoding the second image features based on a multi-headed attention mechanism according to the second matrix to obtain text information in the image to be identified comprises:

and decoding the second image features according to the bundle searching strategy to obtain text information in the image to be identified.

8. The method of claim 7, wherein the bundle search policy is specifically for:

9. The method according to any one of claims 1-8, wherein acquiring a first image feature of the image to be identified comprises:

10. An image text recognition apparatus, comprising:

the acquisition module is used for acquiring first image features of the image to be identified, wherein the first image features comprise a plurality of feature maps;

the position coding module is used for carrying out nonlinear position coding on the first image feature so as to obtain a second image feature, wherein the second image feature comprises horizontal position information and vertical position information;

The acquisition module is further configured to acquire a first matrix according to the second image feature, where the first matrix includes a first key matrix and/or a first value matrix, the first key matrix is used to indicate the queried second image feature, and the first value matrix is used to indicate a value of the queried second image feature;

a scaling module, configured to scale the first matrix to obtain a second matrix, where a dimension of the second matrix is lower than a dimension of the first matrix;

the encoding and decoding module is used for encoding and decoding the second image features based on a multi-head attention mechanism according to the second matrix so as to obtain text information in the image to be identified;

and the output module is used for outputting text information in the image to be identified.

11. The apparatus of claim 10, wherein the scaling module is specifically configured to:

12. The apparatus of claim 11, wherein the scaling module is specifically configured to:

13. The apparatus according to any one of claims 10-12, wherein the position coding module is specifically configured to:

14. The apparatus of claim 13, wherein the position-coding vector is obtained according to the following formula:

wherein ,

15. The apparatus according to any of claims 10-14, wherein the codec module is specifically configured to:

16. The apparatus according to any of claims 10-15, wherein the codec module is specifically configured to:

17. The apparatus of claim 16, wherein the bundle search policy is specifically configured to:

18. The apparatus according to any one of claims 10-17, wherein the acquisition module is specifically configured to:

19. A computing device comprising a processor and a memory, the processor to execute instructions stored in the memory to cause the computing device to perform the method of any of claims 1-9.

20. A computer program product containing instructions that, when executed by a computing device, cause the computing device to perform the method of any of claims 1-9.

21. A computer readable storage medium comprising computer program instructions which, when executed by a computing device, perform the method of any of claims 1-9.