CN110516541B

CN110516541B - Text positioning method and device, computer readable storage medium and computer equipment

Info

Publication number: CN110516541B
Application number: CN201910653482.3A
Authority: CN
Inventors: 胡志成
Original assignee: Kingdee Software China Co Ltd
Current assignee: Kingdee Software China Co Ltd
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2022-06-10
Anticipated expiration: 2039-07-19
Also published as: CN110516541A

Abstract

The application relates to a text positioning method, a text positioning device, a computer readable storage medium and computer equipment, wherein the method comprises the following steps: acquiring an invoice image; extracting text features in the invoice image through a multitask network model; the multitask network model comprises a classification network and a boundary determination network; determining a text box boundary of the text feature through the boundary determination network; classifying the text features according to the classification network to obtain text features belonging to each text type; generating a position distribution diagram for representing the text features in the invoice image according to the presentation mode corresponding to each text type, and generating a boundary image for representing the text features according to the text frame boundary; and determining the position of the text in the invoice image according to the position distribution map and the boundary image. The scheme provided by the application can accurately position the text in the invoice even when the background of the invoice is complex.

Description

Text positioning method and device, computer readable storage medium and computer equipment

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a text positioning method, a text positioning device, a computer-readable storage medium, and a computer device.

Background

When the invoice is input into the system, the invoice information input into the system generally needs to be checked, and a large amount of labor cost is required to be invested by adopting a manual checking mode; when the machine checks the time, the invoice needs to be recognized firstly, and the invoice is complicated in background, such as texture, background patterns and characters on a seal cover or are close to the invoice characters, so that the positioning of the text in the invoice is interfered, and the recognition of the invoice is further influenced.

In a traditional text positioning scheme, a threshold value of an invoice image is calculated firstly, binarization processing is carried out on the invoice image, then a rectangular frame of characters is obtained by utilizing a connected domain, and the rectangular frame is subjected to doubling for multiple times to obtain a text frame of the line text. However, when the context of the invoice is complex, the accuracy of the text location will be affected.

Disclosure of Invention

Based on this, it is necessary to provide a text positioning method, apparatus, computer readable storage medium and computer device for solving the technical problem that when the invoice image has texture and pattern background, the text positioning accuracy will be affected.

A text localization method, comprising:

acquiring an invoice image;

extracting text features in the invoice image through a multitask network model, wherein the multitask network model comprises a classification network and a boundary determination network;

determining a text box boundary of the text feature through the boundary determination network;

classifying the text features according to the classification network to obtain text features belonging to each text type;

generating a position distribution diagram for representing the text features in the invoice image according to the presentation mode corresponding to each text type, and generating a boundary image for representing the text features according to the text box boundary;

and determining the position of the text in the invoice image according to the position distribution map and the boundary image.

In one embodiment, before determining the location of text in the invoice image from the location distribution map and the boundary image, the method further comprises:

adjusting the size of the feature graph corresponding to the text feature according to a preset size to obtain an adjusted feature graph;

carrying out convolution and pooling treatment on the adjusted feature map in sequence;

stretching the characteristic diagram obtained after the pooling treatment to obtain a one-dimensional characteristic vector;

and inputting the one-dimensional feature vector into a full connection layer of the multitask network model, and processing the one-dimensional feature vector through an activation function to obtain a direction vector for expressing the rotation direction of the invoice image.

In one embodiment, the text features include a plurality of text features resulting from processing of a plurality of specified convolutional layers; the extracting text features in the invoice image through a multitask network model comprises:

performing up-sampling and at least two convolution treatments on the text features obtained by the last layer of the appointed convolution layer to obtain intermediate text features;

fusing the intermediate text features with the text features obtained by the last layer of the appointed convolutional layer to obtain intermediate fusion features;

and performing up-sampling and convolution processing at least twice on the intermediate fusion feature graph, and executing the step of fusing the intermediate text feature with the text feature obtained by the last layer of the appointed convolutional layer until the intermediate text feature obtained by processing is fused with the text feature obtained by the first layer of the appointed convolutional layer to obtain a fused text feature.

In one embodiment, the method further comprises:

when an invoice image sample and a corresponding reference label are obtained, extracting training text characteristics of the invoice image sample through the multitask network model;

determining a training text box boundary of the training text feature through the boundary determination network;

classifying the training text features according to the classification network to obtain training text features belonging to each text type;

generating a training position distribution diagram used for representing the training text features in the invoice image sample according to a presentation mode corresponding to each text type, and generating a training boundary image used for representing the text features according to the text frame boundary;

determining a predicted position of a text in the invoice image sample according to the training position distribution map and the training boundary image;

and calculating a loss value between the predicted position and the reference label, and adjusting parameters in the multitask network model through the loss value until the predicted position output by the multitask network model after the parameters are adjusted meets the position condition.

In one embodiment, the method further comprises: and performing at least one of filtering, image enhancement, gray scale adjustment, erosion operation, random cropping or random rotation operation on the invoice image samples to expand the number of invoice image samples.

A text-locating device, the device comprising:

the image acquisition module is used for acquiring an invoice image;

the characteristic extraction module is used for extracting text characteristics in the invoice image through a multitask network model; the multitask network model comprises a classification network and a boundary determination network;

the boundary determining module is used for determining the text box boundary of the text feature through the boundary determining network;

the feature classification module is used for classifying the text features according to the classification network to obtain the text features belonging to each text type;

the image generation module is used for generating a position distribution diagram for representing the text features in the invoice image according to the presentation mode corresponding to each text type and generating a boundary image for representing the text features according to the text frame boundary;

and the position determining module is used for determining the position of the text in the invoice image according to the position distribution map and the boundary image.

In one embodiment, the apparatus further comprises:

the direction prediction module is used for adjusting the size of the feature map corresponding to the text feature according to a preset size to obtain an adjusted feature map; carrying out convolution and pooling treatment on the adjusted feature map in sequence; stretching the characteristic diagram obtained after the pooling treatment to obtain a one-dimensional characteristic vector; and inputting the one-dimensional feature vector into a full connection layer of the multitask network model, and processing the one-dimensional feature vector through an activation function to obtain a direction vector for expressing the rotation direction of the invoice image.

In one embodiment, the feature extraction module is further configured to:

and performing up-sampling and convolution processing at least twice on the intermediate fusion feature graph, and executing the step of fusing the intermediate text feature and the text feature obtained by the last layer of the appointed convolutional layer until the processed intermediate text feature and the text feature obtained by the first layer of the appointed convolutional layer are fused to obtain the fused text feature.

In one embodiment, the apparatus further comprises:

the characteristic extraction module is also used for extracting the training text characteristics of the invoice image sample through the multitask network model when the invoice image sample and the corresponding reference label are obtained;

the boundary determining module is further used for determining the boundary of the training text box of the training text feature through the boundary determining network;

the feature classification module is further used for classifying the training text features according to the classification network to obtain training text features belonging to each text type;

the image generation module is further used for generating a training position distribution diagram used for representing the training text features in the invoice image sample according to the presentation mode corresponding to each text type, and generating a training boundary image used for representing the text features according to the text frame boundary;

the position determining module is used for determining the predicted position of the text in the invoice image sample according to the training position distribution map and the training boundary image;

and the parameter adjusting module is used for calculating a loss value between the predicted position and the reference label, and adjusting parameters in the multitask network model through the loss value until the predicted position output by the multitask network model after the parameters are adjusted meets the position condition.

In one embodiment, the apparatus further comprises:

and the processing module is used for performing at least one of filtering, image enhancement, gray scale adjustment, erosion operation, random cutting or random rotation operation on the invoice image samples so as to expand the number of the invoice image samples.

A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the text localization method.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the text localization method.

According to the text positioning method, the text positioning device, the computer readable storage medium and the computer equipment, the text features in the invoice image are extracted through the multitask network model, the text box boundary of the text features is determined through the boundary determination network, the position distribution diagram for representing the text features in the invoice image is generated according to the presentation mode corresponding to each text type, and the text position in the invoice can be positioned according to the position distribution diagram. In addition, a boundary image used for representing text features is generated according to the text box boundary, the position of the text in the invoice image is determined according to the position distribution map and the boundary image, and the boundary image of the text features can avoid the condition that adjacent text boxes are adhered, so that even if the invoice has textures, background patterns and characters on a seal cover or are close to the invoice characters, the positioning of the text in the invoice cannot be influenced, and the accuracy of text positioning is effectively improved.

Drawings

FIG. 1 is a diagram of an application environment of a text localization method in one embodiment;

FIG. 2 is a flowchart illustrating a text location method according to an embodiment;

FIG. 3 is a diagram illustrating the structure of a multitasking network model in one embodiment;

FIG. 4 is a schematic diagram of a sample invoice image in one embodiment;

FIG. 5 is a schematic illustration of a histogram and a boundary image in one embodiment;

FIG. 6 is a flowchart illustrating a text positioning method according to another embodiment;

FIG. 7 is a flowchart illustrating the training step for the multitasking network model in one embodiment;

FIG. 8 is a block diagram of a text-locating device in one embodiment;

FIG. 9 is a block diagram showing the structure of a text-locating device in another embodiment;

FIG. 10 is a block diagram showing a configuration of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

FIG. 1 is a diagram of an application environment for the text localization method in one embodiment. Referring to fig. 1, the text localization method is applied to a text localization system. The text-locating system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The server 120 acquires an invoice image photographed or scanned by the terminal 110; extracting text features in the invoice image through a multitask network model; the multitask network model comprises a classification network and a boundary determination network; determining the text box boundary of the text characteristic through the boundary determination network; classifying the text features according to a classification network to obtain text features belonging to each text type; generating a position distribution diagram for representing the text features in the invoice image according to the presentation mode corresponding to each text type, and generating a boundary image for representing the text features according to the text frame boundary; and determining the position of the text in the invoice image according to the position distribution map and the boundary image.

The terminal 110 may specifically be a scanner, a camera or other terminals with cameras, and the terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

As shown in FIG. 2, in one embodiment, a text localization method is provided. The embodiment is mainly illustrated by applying the method to the server 120 in fig. 1. Referring to fig. 2, the text positioning method specifically includes the following steps:

and S202, acquiring an invoice image.

The invoice image is an image obtained by shooting (or scanning) an invoice, and the invoice image contains invoice information.

In one embodiment, the server obtains an invoice image taken by the terminal. Specifically, when an invoice image is obtained by shooting the invoice, the terminal uploads the shot invoice image to the server in real time; the server receives the invoice image sent by the terminal. Or when the invoice image is obtained by shooting the invoice, the terminal stores the shot invoice image, and when the uploading instruction is received, the stored invoice image is sent to the server, so that the server obtains the invoice image. The terminal can shoot the invoice to obtain the invoice image, and can scan the invoice to obtain the invoice image.

S204, extracting text features in the invoice image through a multitask network model; the multitasking network model includes a classification network and a boundary determining network.

The multitask network model may be a neural network model capable of completing multiple tasks, such as determining a rotation direction of the invoice image, determining a distribution condition of texts in the invoice image, and determining a boundary of each line of texts. The multitask network model comprises a plurality of convolution layers, each convolution layer is convoluted with the feature map (or the input invoice image) output by the previous layer through a convolution kernel, and the convolution result is input into an activation function to obtain a new feature map. Parameter sharing can be achieved by utilizing convolution, the calculated amount can be reduced, and meanwhile the generalization capability of the multi-task network model can be improved. Text refers to the manifestation of written language, usually a word, sentence, or combination of sentences that have a complete, systematic meaning.

In one embodiment, the multitask network model is trained according to invoice image samples and corresponding reference labels, wherein the invoice image samples can refer to fig. 4, the reference labels can be the text contents in fig. 4 and the corresponding position information, such as { "XX industrial park national tax bureau general quota invoice", "120.7694318343, 77.6124962491", "741.334197955, 81.7459398897", "741.1408126866,124.4490705736", "120.4274402846,118.034681087" }, the text part is the text content, and the number part is the coordinate information of four corners of the text content. Therefore, in the prediction process, the server extracts the features of the invoice through the multitask network, and finally the text position of the invoice image can be located.

In one embodiment, the text features include a plurality of text features resulting from processing of a plurality of specified convolutional layers; s204 may specifically include: performing up-sampling and at least two convolution treatments on the text features obtained by the last layer of the appointed convolution layer to obtain intermediate text features; fusing the intermediate text features with the text features obtained by the last layer of the specified convolutional layer; and performing up-sampling and convolution processing at least twice on the intermediate fusion feature graph, and executing a step of fusing the intermediate text feature and the text feature obtained by the last layer of the appointed convolutional layer until the intermediate text feature obtained by processing is fused with the text feature obtained by the first layer of the appointed convolutional layer to obtain a fused text feature. Fig. 3 may be referred to for specifying a convolutional layer.

In one embodiment, the server extracts textual features in the invoice image through a shared network layer of the multitasking network model. For example, as shown in fig. 3, the shared network layer may be a network layer from an Input (Input) layer to a feature concatenation (depthpolycat) layer, the shared network layer has a series of convolution layers (Conv) and pooling layers (MaxPool), and two layers of void convolution layers are introduced after feature3 in order to expand the receptive field of the neuron, the expansion ratio (or void coefficient) of the first layer of void convolution layers is 2, and the expansion ratio of the second layer of void convolution layers is 3. The server, through a shared network layer, a) performs at least two convolution operations (convolution kernel size 3 × 3) on the input invoice image, then performs pooling (template size 3 × 3, step size 2) operations, then performs at least two convolution operations (convolution kernel size 3 × 3), then performs pooling (template size 3 × 3, step size 2) operations, and then performs at least one convolution operation (convolution kernel size 3 × 3) and one convolution operation (convolution kernel size 1 × 1) to obtain feature 1. b) Feature1 is subjected to at least one pooling (template size 3 × 3, step 2) operation, at least one convolution (convolution kernel size 3 × 3) operation, and at least one convolution (convolution kernel size 1 × 1) operation, resulting in feature 2. c) Then, at least two pooling (template size 2 × 2, step 2) operations and at least two convolution (convolution kernel size 3 × 3) operations were performed on feature2, resulting in feature 3. 4) And sequentially performing convolution (convolution kernel size is 3 multiplied by 3) operation with the step size of 2 and the step size of 3 on the feature3 at least once, and then splicing the convolved features with the feature3 to obtain the feature 4. Feature4 is subjected to convolution (convolution kernel size is 1 x 1), then upsampling layer (step size is 2) and convolution (convolution kernel size is 3 x 3) operations are carried out, and then the obtained features are fused with feature3 to obtain fused feature de _ feature. feature4 and/or de _ feature are text features in the image of the invoice from the multitask network model.

S206, determining the text box boundary of the text feature through the boundary determination network.

Wherein the boundary determination network for determining the boundary of the text box is similar to the network structure of the classification network.

For example, as shown in FIG. 3, the server processes the de _ feature through the boundary determination network, drawing the text box boundary according to where the text feature is located. Specifically, de _ feature is convolved, then upsampling (step size 2) and convolving (convolution kernel size 3 × 3) are performed, the obtained feature is fused with feature2, and the above operations are repeated until the obtained feature is fused with feature1, so that a text box boundary for the fixed text feature can be obtained. Wherein the activation function in the boundary determination network adopts softmax.

And S208, classifying the text features according to the classification network to obtain the text features belonging to each text type.

The text type refers to a type to which the text belongs, such as a chinese type text, a numeric type text, and other types of text. In the invoice, as shown in fig. 4, the text of the chinese type includes: invoice head-up, invoice number, invoice code and capital money. The numeric type of text includes: a specific invoice number, a specific invoice code, and a numerical type amount, etc.

In one embodiment, the server classifies the text features according to different text types through a classification network to obtain the text features belonging to each text type.

For example, as shown in fig. 3, the server classifies the de _ feature through a classification network, and divides the text of chinese type into one class, which can be labeled as 1; the text of the numeric type is divided into a class which can be marked as 2; the other types are classified into one class, which may be labeled 3. Specifically, de _ feature is convolved, then upsampling (step size 2) and convolving (convolution kernel size 3 × 3) are performed, the obtained features are fused with feature2, and the above operations are repeated until the obtained features are fused with feature1, so that classified text features can be obtained. Wherein the activation function in the classification network adopts softmax.

S210, generating a position distribution diagram for representing the text features in the invoice image according to the presentation mode corresponding to each text type, and generating a boundary image for representing the text features according to the text box boundary.

The presentation manner may be that the graphic signs corresponding to different text types are different, such as different filling colors, filling dotted lines, straight lines, or oblique lines. As shown in fig. 5(a), the image markers of the chinese type text are filled in gray, and the image markers of the numeric type text are filled in black. As shown in fig. 5(b), the text in each invoice image is framed to determine the boundary of the text, so as to avoid positioning the text line with a short distance as a line of text, thereby improving the recognition rate of the text in the invoice image.

And S212, determining the position of the text in the invoice image according to the position distribution map and the boundary image.

In one embodiment, before determining the location of the text in the invoice image from the location profile and the boundary image, the method may further comprise: the server adjusts the size of the feature map corresponding to the text feature according to the preset size to obtain an adjusted feature map; carrying out convolution and pooling treatment on the adjusted feature map in sequence; stretching the characteristic diagram obtained after the pooling treatment to obtain a one-dimensional characteristic vector; and inputting the one-dimensional feature vector into a full connection layer of the multitask network model, and processing the one-dimensional feature vector through an activation function to obtain a direction vector for expressing the rotation direction of the invoice image.

For example, as shown in fig. 3, the server resizes feature4, i.e., resizes feature4 to a fixed size, e.g., 14 × 14, for subsequent full connectivity layer operations on feature 4. In order to further expand the receptive field of the neuron, the feature4 is subjected to convolution operation (the size of a convolution kernel is 3 × 3), in order to reduce the calculation amount, the feature4 can be subjected to pooling (step 2), then the pooled feature4 is stretched (scatter) to obtain a one-dimensional vector, the one-dimensional vector is subjected to Full Connectivity (FC) operation, and the activation function softmax is adopted for processing, so that the final predicted direction vector is obtained.

As an example, as shown in fig. 6, the server acquires an invoice image, performs a preprocessing on the invoice image, such as cutting the invoice image, adjusting the size of the invoice image, adjusting the brightness or gray scale of the invoice image, and the like, and may also perform a normalization processing on the invoice image. And then inputting the invoice image into a multitask network model for processing to obtain the classified text features, the boundaries of the text features and corresponding direction vectors, performing rotation correction on the position distribution diagram and the directions of the boundary images according to the direction vectors, and outputting a text positioning result in the invoice image, wherein the result can be a result of framing the text in the invoice image or coordinate information of the text in the invoice image.

In the above embodiment, the text features in the invoice image are extracted through the multitask network model, the text box boundary of the text features is determined through the boundary determination network, a position distribution diagram for representing the text features in the invoice image is generated according to the presentation mode corresponding to each text type, and the text positions in the invoice can be located according to the position distribution diagram. In addition, a boundary image used for representing text features is generated according to the text box boundary, the position of the text in the invoice image is determined according to the position distribution map and the boundary image, and the boundary image of the text features can avoid the condition that adjacent text boxes are adhered, so that even if the invoice has textures, background patterns and characters on a seal cover or are close to the invoice characters, the positioning of the text in the invoice cannot be influenced, and the accuracy of text positioning is effectively improved.

In one embodiment, as shown in fig. 7, the method further comprises:

s702, when the invoice image sample and the corresponding reference label are obtained, extracting the training text characteristics of the invoice image sample through a multitask network model.

The invoice image sample is an image obtained by shooting (or scanning) an invoice, and the invoice image sample contains invoice information.

In one embodiment, after S702, the method further comprises: and the server performs at least one of filtering, image enhancement, gray level adjustment, corrosion operation, random cutting or random rotation operation on the invoice image samples to expand the number of the invoice image samples.

For example, the server performs filtering processing on the invoice image sample to eliminate noise in the invoice image sample, so as to obtain a filtered invoice image sample; and/or, performing brightness conversion on the invoice image sample to enhance or reduce the brightness of the invoice image sample so as to obtain the invoice image sample after brightness conversion; and/or adjusting the gray level of the invoice image sample to change the gray level value of the invoice image sample, so as to obtain the invoice image sample with the changed gray level value; and/or carrying out corrosion and expansion operation on the invoice image sample to enable the text characteristic image in the invoice image sample to be thinned or thickened so as to obtain the invoice image sample subjected to corrosion and expansion processing; and/or randomly cutting the invoice image sample to cut the size of the invoice image sample so that the size of the invoice image sample becomes a fixed-size invoice image sample; and/or performing random rotation processing on the invoice image samples to obtain invoice image samples of each rotation angle.

S704, determining the training text box boundary of the training text feature through the boundary determination network.

S706, classifying the training text features according to the classification network to obtain the training text features belonging to each text type.

And S708, generating a training position distribution diagram for representing the training text features in the invoice image sample according to the presentation mode corresponding to each text type, and generating a training boundary image for representing the text features according to the text box boundary.

And S710, determining the predicted position of the text in the invoice image sample according to the training position distribution map and the training boundary image.

In S704 to S710, the method of S204 to S210 is referred to.

And S712, calculating a loss value between the predicted position and the reference label, and adjusting parameters in the multitask network model through the loss value until the predicted position output by the multitask network model after the parameters are adjusted meets the position condition.

In one embodiment, the server calculates a loss value between the predicted location and the reference tag based on a loss function. Wherein the loss function may be any of: mean Squared Error (Mean Squared Error), sparse classification cross entropy (sparse classification cross entropy) Loss function, L2 local function, and Focal local function.

In one embodiment, the server propagates the loss values back to each layer of the multitask network model, obtaining gradients for each layer parameter; and adjusting parameters of each layer in the multitask network model according to the gradient.

For example, in the multi-task network model training process, loss functions of three branch networks of directions, text feature classification and text boundaries can adopt sparse classification cross entropy, and the weight ratio (loss weights) is 1:1: 0.5. The network of text feature classification is the classification network, and the network of text boundary is the boundary determination network. And (3) carrying out back propagation on all output neurons of the multitask network model, modifying the loss functions of the two networks of text feature classification and text boundary after the networks converge, calculating positive samples (namely invoice image samples with the tag median value larger than 0) and negative samples (invoice image samples with the tag median value equal to 0), and carrying out back propagation according to the ratio of the positive samples to the negative samples being 1: 3. And if the number of the negative samples does not meet 3 times of the number of the positive samples, performing back propagation on all the negative samples.

In the above embodiment, the training text features of the invoice image sample are extracted through the multitask network model, the training text box boundary of the training text features is determined according to the boundary determination network, the training position distribution diagram for representing the training text features in the invoice image sample is generated according to the presentation mode corresponding to each text type, the training boundary image for representing the text features is generated according to the text box boundary, the predicted position of the text in the invoice image sample is determined according to the training position distribution diagram and the training boundary image, the loss value between the predicted position and the reference label is calculated, the parameters in the multitask network model are adjusted through the loss value until the predicted position output by the multitask network model after the parameters are adjusted meets the position condition, so that the multitask network model capable of positioning the text in the invoice can be obtained, and the invoice image is positioned through the multitask network model, the condition that adjacent text boxes are adhered can be avoided, even if the invoice has textures, background patterns and characters on the seal cover or are close to the invoice characters, the positioning of the texts in the invoice cannot be influenced, and the accuracy of text positioning is effectively improved.

Fig. 2 and 7 are schematic flow charts of a text positioning method in one embodiment. It should be understood that although the steps in the flowcharts of fig. 2 and 7 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 7 may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

As shown in fig. 8, in an embodiment, a text positioning apparatus is provided, which specifically includes: an image acquisition module 802, a feature extraction module 804, a boundary determination module 806, a feature classification module 808, an image generation module 810, and a location determination module 812; wherein:

an image acquisition module 802 for acquiring an invoice image;

the feature extraction module 804 is used for extracting text features in the invoice image through the multitask network model; the multitask network model comprises a classification network and a boundary determination network;

a boundary determining module 806, configured to determine a text box boundary of the text feature through the boundary determining network;

the feature classification module 808 is configured to classify the text features according to a classification network to obtain text features belonging to each text type;

the image generation module 810 is configured to generate a position distribution map used for representing text features in the invoice image according to the presentation mode corresponding to each text type, and generate a boundary image used for representing the text features according to a text frame boundary;

and a position determining module 812 for determining the position of the text in the invoice image according to the position distribution map and the boundary image.

In one embodiment, as shown in fig. 9, the apparatus further comprises: a direction prediction module 814; wherein:

the direction prediction module 814 is configured to, before determining the position of the text in the invoice image according to the position distribution map and the boundary image, adjust the size of a feature map corresponding to the text feature according to a preset size to obtain an adjusted feature map; carrying out convolution and pooling treatment on the adjusted feature map in sequence; stretching the characteristic diagram obtained after the pooling treatment to obtain a one-dimensional characteristic vector; and inputting the one-dimensional feature vector into a full connection layer of the multitask network model, and processing the one-dimensional feature vector through an activation function to obtain a direction vector for expressing the rotation direction of the invoice image.

In one embodiment, the feature extraction module 804 is further configured to:

fusing the intermediate text features with the text features obtained by the last layer of the specified convolutional layer to obtain intermediate fusion features;

and performing up-sampling and convolution processing at least twice on the intermediate fusion feature graph, and executing a step of fusing the intermediate text feature and the text feature obtained by the last layer of the appointed convolutional layer until the intermediate text feature obtained by processing is fused with the text feature obtained by the first layer of the appointed convolutional layer to obtain a fused text feature.

In one embodiment, as shown in fig. 9, the apparatus further comprises: a parameter adjustment module 816; wherein:

the feature extraction module 804 is further configured to extract training text features of the invoice image samples through the multitask network model when the invoice image samples and the corresponding reference labels are obtained;

the boundary determining module 806 is further configured to determine a boundary of a training text box of the training text feature through the boundary determining network;

the feature classification module 808 is further configured to classify the training text features according to a classification network to obtain training text features belonging to each text type;

the image generation module 810 is further configured to generate a training position distribution map used for representing training text features in the invoice image sample according to the presentation mode corresponding to each text type, and generate a training boundary image used for representing the text features according to a text box boundary;

the position determination module 812 is configured to determine a predicted position of the text in the invoice image sample according to the training position distribution map and the training boundary image;

and the parameter adjusting module 816 is configured to calculate a loss value between the predicted position and the reference tag, and adjust a parameter in the multitask network model according to the loss value until the predicted position output by the multitask network model after the parameter adjustment meets the position condition.

In one embodiment, as shown in fig. 9, the apparatus further comprises: a processing module 818; wherein:

a processing module 818 for performing at least one of the following steps to expand the number of samples: carrying out filtering processing on the invoice image sample; performing brightness transformation on the invoice image sample; adjusting the gray scale of the invoice image sample; carrying out corrosion and expansion operations on the invoice image sample; randomly cutting an invoice image sample; and carrying out random rotation processing on the invoice image sample.

FIG. 10 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the server 120 in fig. 1. As shown in fig. 10, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a text-based pointing method. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a text-based targeting method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the text-locating device provided herein may be implemented in the form of a computer program that is executable on a computer device such as that shown in fig. 10. The memory of the computer device may store various program modules that make up the text location apparatus, such as the image acquisition module 802, the feature extraction module 804, the feature classification module 808, the boundary determination module 806, the image generation module 810, and the location determination module 812 shown in fig. 8. The computer program constituted by the respective program modules causes the processor to execute the steps in the text localization method of the embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 10 may execute S202 by the image acquisition module 802 in the text pointing device shown in fig. 8. The computer device may perform S204 by the feature extraction module 804. The computer device may perform S206 by the boundary determination module 806. The computer device may perform S208 by the feature classification module 808. The computer device may perform S210 through the image generation module 810. The computer device may perform S212 by the location determination module 812.

In one embodiment, there is provided a computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform: acquiring an invoice image; extracting text features in the invoice image through a multitask network model; the multitask network model comprises a classification network and a boundary determination network; determining the text box boundary of the text characteristic through the boundary determination network; classifying the text features according to a classification network to obtain text features belonging to each text type; generating a position distribution diagram for representing the text features in the invoice image according to the presentation mode corresponding to each text type, and generating a boundary image for representing the text features according to the text frame boundary; and determining the position of the text in the invoice image according to the position distribution map and the boundary image.

In one embodiment, the computer program, when executed by the processor, causes the processor to further perform: before determining the position of the text in the invoice image according to the position distribution map and the boundary image, adjusting the size of a feature map corresponding to the text feature according to a preset size to obtain an adjusted feature map; carrying out convolution and pooling treatment on the adjusted feature map in sequence; stretching the characteristic diagram obtained after the pooling treatment to obtain a one-dimensional characteristic vector; and inputting the one-dimensional feature vector into a full connection layer of the multitask network model, and processing the one-dimensional feature vector through an activation function to obtain a direction vector for expressing the rotation direction of the invoice image.

In one embodiment, the text features comprise a plurality of text features resulting from processing of a plurality of specified convolutional layers, and the computer program, when executed by the processor, causes the processor to perform the steps of extracting text features in the invoice image via a multitasking network model, in particular: performing up-sampling and at least two convolution treatments on the text features obtained by the last layer of the appointed convolution layer to obtain intermediate text features; fusing the intermediate text features with the text features obtained by the last layer of the specified convolutional layer to obtain intermediate fusion features; and performing up-sampling and convolution processing at least twice on the intermediate fusion feature graph, and executing a step of fusing the intermediate text feature and the text feature obtained by the last layer of the appointed convolutional layer until the intermediate text feature obtained by processing is fused with the text feature obtained by the first layer of the appointed convolutional layer to obtain a fused text feature.

In one embodiment, the computer program, when executed by the processor, causes the processor to further perform: when acquiring an invoice image sample and a corresponding reference label, extracting training text characteristics of the invoice image sample through a multitask network model; determining the training text box boundary of the training text characteristic through a boundary determination network; classifying the training text features according to a classification network to obtain training text features belonging to each text type; generating a training position distribution diagram used for representing the training text features in the invoice image sample according to the corresponding presentation mode of each text type, and generating a training boundary image used for representing the text features according to the text frame boundary; determining the predicted position of the text in the invoice image sample according to the training position distribution map and the training boundary image; and calculating a loss value between the predicted position and the reference label, and adjusting parameters in the multitask network model through the loss value until the predicted position output by the multitask network model after the parameters are adjusted meets the position condition.

In one embodiment, the computer program, when executed by the processor, causes the processor to further perform: performing at least one of the following steps to expand the number of samples: carrying out filtering processing on the invoice image sample; performing brightness transformation on the invoice image sample; adjusting the gray scale of the invoice image sample; carrying out corrosion and expansion operations on the invoice image sample; randomly cutting an invoice image sample; and carrying out random rotation processing on the invoice image sample.

In one embodiment, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform: acquiring an invoice image; extracting text features in the invoice image through a multitask network model; the multitask network model comprises a classification network and a boundary determination network; determining the text box boundary of the text characteristic through the boundary determination network; classifying the text features according to a classification network to obtain text features belonging to each text type; generating a position distribution diagram for representing the text features in the invoice image according to the presentation mode corresponding to each text type, and generating a boundary image for representing the text features according to the text frame boundary; and determining the position of the text in the invoice image according to the position distribution map and the boundary image.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A text localization method, comprising:

acquiring an invoice image;

extracting text features in the invoice image through a multitask network model, and determining a direction vector of a rotation direction of the invoice image, wherein the multitask network model comprises a classification network and a boundary determination network;

respectively drawing text box boundaries based on the positions of the text features through the boundary determination network;

generating a position distribution diagram for representing the text features in the invoice image according to the presentation mode corresponding to each text type, and generating a boundary image for representing the text features according to the text frame boundary; the presentation mode indicates that the graphic signs corresponding to different text types and in the position distribution diagram are different;

and performing rotation correction on the position distribution map and the boundary image according to the direction vector, and determining the position of the text in the invoice image according to the corrected position distribution map and the corrected boundary image.

2. The method of claim 1, wherein prior to determining the location of text in the invoice image from the location distribution map and the boundary image, the method further comprises:

3. The method of claim 1, wherein the text features comprise a plurality of text features resulting from a plurality of specified convolutional layer processes; the extracting text features in the invoice image through a multitask network model comprises:

4. The method according to any one of claims 1 to 3, further comprising:

5. The method of claim 4, further comprising: and performing at least one of filtering, image enhancement, gray scale adjustment, erosion operation, random cropping or random rotation operation on the invoice image samples to expand the number of invoice image samples.

6. A text-locating device, the device comprising:

the image acquisition module is used for acquiring an invoice image;

the characteristic extraction module is used for extracting text characteristics in the invoice image through a multitask network model and determining a direction vector of the rotation direction of the invoice image, wherein the multitask network model comprises a classification network and a boundary determination network;

the boundary determining module is used for respectively drawing text box boundaries based on the positions of the text features through the boundary determining network;

the image generation module is used for generating a position distribution diagram for representing the text features in the invoice image according to the presentation mode corresponding to each text type and generating a boundary image for representing the text features according to the text frame boundary; the presentation mode indicates that the graphic signs corresponding to different text types and in the position distribution diagram are different;

and the position determining module is used for performing rotation correction on the position distribution map and the boundary image according to the direction vector, and determining the position of the text in the invoice image according to the corrected position distribution map and the corrected boundary image.

7. The apparatus of claim 6, further comprising:

the direction prediction module is used for adjusting the size of the feature map corresponding to the text feature according to a preset size to obtain an adjusted feature map; carrying out convolution and pooling treatment on the adjusted feature map in sequence; stretching the characteristic diagram obtained after the pooling treatment to obtain a one-dimensional characteristic vector; and inputting the one-dimensional characteristic vector into a full connection layer of the multi-task network model, and processing through an activation function to obtain a direction vector for expressing the rotation direction of the invoice image.

8. The apparatus of claim 6, wherein the feature extraction module is further configured to:

9. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 5.

10. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 5.