CN112257708A

CN112257708A - Character-level text detection method and device, computer equipment and storage medium

Info

Publication number: CN112257708A
Application number: CN202011141227.XA
Authority: CN
Inventors: 刘雨桐; 石强; 熊娇; 张健; 王国勋
Original assignee: Runlian Software System Shenzhen Co Ltd
Current assignee: Runlian Software System Shenzhen Co Ltd
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2021-01-22

Abstract

The invention discloses a character-level text detection method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: and (3) extracting down-sampling features based on residual learning, establishing connection with up-sampling by adopting a connection mode of U-Net, decoding a feature graph, outputting probability that pixel points are positioned in the center of characters and probability that pixel points are positioned in gaps between characters through a series of convolutional layers, and performing network training by adopting a weak supervision learning method to generate a text box. The method only needs to pay attention to the content of the character level without paying attention to the whole text example, can solve the problem of degradation caused by the increase of the network depth, has strong generalization capability, and simultaneously improves the accurate segmentation and calculation efficiency of the text region.

Description

Character-level text detection method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of text detection technologies, and in particular, to a method and an apparatus for detecting a character-level text, a computer device, and a storage medium.

Background

With the development of the internet and the popularization of mobile terminals such as smart phones and digital cameras, a large number of images are emerging continuously, especially images in natural scenes, and text information in the images is not only an important supplement to the scenes, but also a very important clue for understanding the scenes. Therefore, text detection in images of natural scenes has become one of the research hotspots in recent years, and the applications thereof are very wide, such as human-computer interaction, image search, industrial automation, license plate recognition and the like.

Scene text detection is an important step in scene text recognition and is also a challenging problem. Unlike general object detection, the main challenge of scene text detection is that the text in the natural scene image has features of arbitrary orientation, small size, and various aspect ratios. In other words, text length, large pattern variations, and complex backgrounds constitute major challenges for accurate detection of text. Since text detection generally requires higher positioning accuracy, it is difficult to directly apply a general target detection system to scene text detection.

In general object detection, each object has a well-defined closed boundary, whereas in text there may not be such a well-defined boundary, because a text line or word is composed of many individual characters or strokes. Meanwhile, due to the folding and curling phenomena of the documents, the newspapers and the bills, the difficulty of later character detection and recognition of the shot images is increased. Therefore, the text is a fine-grained recognition task, and needs to overcome the influence of complex background to carry out correct detection and cover the whole area of text lines or words. There have been fairly sophisticated solutions to conventional optical character recognition techniques, with dramatic results in processing document text. However, the text detection and recognition technology in the natural scene image cannot meet the requirements of practical application.

Under the background of big data driving application, with the improvement of computer hardware performance, the target detection and image segmentation algorithm based on deep learning breaks through the bottleneck of the traditional algorithm and becomes the mainstream algorithm in the current computer vision field, and the scene character detection task is influenced by the development of the target detection and image segmentation algorithm, so that the method has great breakthrough in recent years. The target detection task needs to identify the category of the object in the image and know the position of the object. Target detection methods based on deep learning can be divided into two major categories: one is an algorithm based on target candidate regions, such as R-CNN (Region-CNN, the first algorithm to successfully apply deep learning to target detection); the other type is an algorithm based on a regression method, and the class probability and the position coordinates of the object are directly obtained without generating a candidate region, such as YOLO (a deep neural network-based object recognition and positioning algorithm). However, the conventional detection methods have to be improved in generalization ability, efficiency and accuracy.

Disclosure of Invention

The invention aims to provide a character-level text detection method, a character-level text detection device, computer equipment and a storage medium, and aims to solve the problems that the existing text detection method is weak in generalization capability and needs to be improved in efficiency and accuracy.

In a first aspect, an embodiment of the present invention provides a character-level natural scene text detection method based on residual error learning, including:

inputting an original image with the size of w multiplied by h into a convolution layer for convolution, and outputting a characteristic diagram with the size of w/2 multiplied by h/2;

inputting the characteristic diagram with the size of w/2 xh/2 into a maximum pooling layer for dimensionality reduction, and outputting the characteristic diagram with the size of w/4 xh/4;

inputting the feature map with the size of w/4 × h/4 into a first bottleneck residual error module for convolution and feature extraction, inputting the output into a second bottleneck residual error module for convolution and feature extraction, inputting the output into a third bottleneck residual error module for convolution and feature extraction, inputting the output into a fourth bottleneck residual error module for convolution and feature extraction, and finally outputting the feature map with the size of w/32 × h/32 through the fourth bottleneck residual error module;

inputting the feature map output by the fourth bottleneck residual error module into a first up-sampling module for up-sampling, and outputting a feature map with the size of w/16 × h/16;

stacking the feature maps output by the first upsampling module and the third bottleneck residual error module, inputting the feature maps into a second upsampling module for upsampling, and outputting the feature maps with the size of w/8 multiplied by h/8;

stacking the feature maps output by the second upsampling module and the second bottleneck residual error module, inputting the feature maps into a third upsampling module for upsampling, and outputting a feature map with the size of w/4 multiplied by h/4;

stacking the feature maps output by the third upsampling module and the first bottleneck residual error module, inputting the feature maps into a fourth upsampling module for upsampling, and outputting a feature map with the size of w/2 multiplied by h/2;

and convolving the feature map output by the fourth up-sampling module by a series of convolution layers, and outputting two branch results: the first branch is the probability that each pixel point is in the character center; the second branch is the probability that each pixel point is in the gap between characters;

respectively obtaining the central probability of a single character region and the central probability of an adjacent character region based on two branch results, performing network training by adopting a weak supervised learning method, outputting a text box, and constructing to obtain a text detection model;

and inputting the image to be detected into the text detection model, and outputting a text box.

In a second aspect, an embodiment of the present invention provides a device for detecting a text in a character-level natural scene based on residual error learning, including:

the first convolution unit is used for inputting an original image with the size of w multiplied by h into a convolution layer for convolution and outputting a characteristic diagram with the size of w/2 multiplied by h/2;

the dimension reduction unit is used for inputting the characteristic diagram with the size of w/2 xh/2 into the maximum pooling layer for dimension reduction and outputting the characteristic diagram with the size of w/4 xh/4;

the residual error extraction unit is used for inputting the feature map with the size of w/4 xh/4 into a first bottleneck residual error module for convolution and feature extraction, then inputting the output into a second bottleneck residual error module for convolution and feature extraction, then inputting the output into a third bottleneck residual error module for convolution and feature extraction, then inputting the output into a fourth bottleneck residual error module for convolution and feature extraction, and finally outputting the feature map with the size of w/32 xh/32 through the fourth bottleneck residual error module;

the first up-sampling unit is used for inputting the feature map output by the fourth bottleneck residual error module into the first up-sampling module for up-sampling and outputting the feature map with the size of w/16 multiplied by h/16;

the second up-sampling unit is used for stacking the feature maps output by the first up-sampling module and the third bottleneck residual error module, inputting the feature maps into the second up-sampling module for up-sampling, and outputting the feature maps with the size of w/8 multiplied by h/8;

the third up-sampling unit is used for stacking the feature maps output by the second up-sampling module and the second bottleneck residual error module, inputting the feature maps into the third up-sampling module for up-sampling, and outputting the feature map with the size of w/4 multiplied by h/4;

the fourth upsampling unit is used for stacking the feature maps output by the third upsampling module and the first bottleneck residual error module, inputting the feature maps into the fourth upsampling module for upsampling, and outputting the feature map with the size of w/2 multiplied by h/2;

a second convolution unit, configured to perform convolution on the feature map output by the fourth upsampling module through a series of convolution layers, and output two branch results: the first branch is the probability that each pixel point is in the character center; the second branch is the probability that each pixel point is in the gap between characters;

the training unit is used for respectively obtaining the central probability of a single character region and the central probability of an adjacent character region based on two branch results, performing network training by adopting a weak supervised learning method, outputting a text box and constructing to obtain a text detection model;

and the prediction unit is used for inputting the image to be detected into the text detection model and outputting a text box.

In a third aspect, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the character-level natural scene text detection method based on residual error learning as described above when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the character-level natural scene text detection method based on residual error learning as described above.

The embodiment of the invention discloses a character-level text detection method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: and (3) extracting down-sampling features based on residual learning, establishing connection with up-sampling by adopting a connection mode of U-Net, decoding a feature graph, outputting probability that pixel points are positioned in the center of characters and probability that pixel points are positioned in gaps between characters through a series of convolutional layers, and performing network training by adopting a weak supervision learning method to generate a text box. The embodiment of the invention only needs to pay attention to the content of the character level without paying attention to the whole text example, can solve the degradation problem caused by the increase of the network depth, has strong generalization capability and simultaneously improves the accurate segmentation and calculation efficiency of the text region.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a character-level natural scene text detection method based on residual error learning according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a principle of a character-level natural scene text detection method based on residual error learning according to an embodiment of the present invention;

fig. 3 is a schematic block diagram of a character-level natural scene text detection apparatus based on residual error learning according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, a method for detecting a text in a character-level natural scene based on residual error learning according to an embodiment of the present invention includes steps S101 to S110:

s101, inputting an original image with the size of w multiplied by h into a convolution layer for convolution, and outputting a characteristic diagram with the size of w/2 multiplied by h/2;

s102, inputting the feature map with the size of w/2 xh/2 into a maximum pooling layer for dimensionality reduction, and outputting the feature map with the size of w/4 xh/4;

s103, inputting the feature map with the size of w/4 × h/4 into a first bottleneck residual error module for convolution and feature extraction, inputting the output into a second bottleneck residual error module for convolution and feature extraction, inputting the output into a third bottleneck residual error module for convolution and feature extraction, inputting the output into a fourth bottleneck residual error module for convolution and feature extraction, and finally outputting the feature map with the size of w/32 × h/32 through the fourth bottleneck residual error module;

s104, inputting the feature map output by the fourth bottleneck residual error module into a first up-sampling module for up-sampling, and outputting a feature map with the size of w/16 multiplied by h/16;

s105, stacking the feature maps output by the first upsampling module and the third bottleneck residual error module, inputting the feature maps into a second upsampling module for upsampling, and outputting a feature map with the size of w/8 multiplied by h/8;

s106, stacking the feature maps output by the second upsampling module and the second bottleneck residual error module, inputting the feature maps to a third upsampling module for upsampling, and outputting a feature map with the size of w/4 multiplied by h/4;

s107, stacking the feature maps output by the third upsampling module and the first bottleneck residual error module, inputting the feature maps into a fourth upsampling module for upsampling, and outputting a feature map with the size of w/2 x h/2;

s108, carrying out convolution on the feature map output by the fourth up-sampling module through a series of convolution layers, and outputting two branch results: the first branch is the probability that each pixel point is in the character center; the second branch is the probability that each pixel point is in the gap between characters;

s109, respectively obtaining the central probability of a single character region and the central probability of an adjacent character region based on two branch results, performing network training by adopting a weak supervised learning method, outputting a text box, and constructing to obtain a text detection model;

and S110, inputting the image to be detected into the text detection model, and outputting a text box.

Referring to fig. 2, in step S101, the size of the original image is w × h × 3 (the number of channels is 3 for example, obviously, other numbers of channels can also be implemented by the method of this embodiment), and a feature map with a size of w/2 × h/2 × 32 is output through convolution of a convolutional layer; the convolution kernel of the convolutional layer is 7 × 7 × 32, and the step size (i.e., step length) s is 2. The kernel size k of the convolution kernel is 7.

In step S102, dimension reduction is performed on the feature map with the size of w/2 × h/2 × 32 obtained in step S101 through a largest pooling layer, so as to obtain a feature map with the size of w/4 × h/4 × 32. The kernel size k of the maximum pooling layer is 3 and the step length s is 2.

In step S103, the feature map with the size of w/4 × h/4 × 32 obtained in step S102 is convolved and feature extracted.

In one embodiment, the step S103 includes:

inputting the feature map with the size of w/4 × h/4 × 32 into a first bottleneck residual error module for convolution and feature extraction, and outputting the feature map with the size of w/4 × h/4 × 64;

inputting the feature map with the size of w/4 × h/4 × 64 into a second bottleneck residual error module for convolution and feature extraction, and outputting the feature map with the size of w/8 × h/8 × 128;

inputting the feature map with the size of w/8 × h/8 × 128 into a third bottleneck residual error module for convolution and feature extraction, and outputting the feature map with the size of w/16 × h/16 × 256;

and inputting the feature map with the size of w/16 × h/16 × 256 into a fourth bottleneck residual error module for convolution and feature extraction, and outputting the feature map with the size of w/32 × h/32 × 512.

In this embodiment, the first bottleneck residual error module includes 3 first bottleneck residual error structures sequentially arranged; each first bottleneck residual error structure comprises two paths, wherein one path (the first path) comprises 3 convolution layers with convolution kernels of 1 × 1, 3 × 3 and 1 × 1 which are sequentially arranged;

the second bottleneck residual error module comprises 4 second bottleneck residual error structures which are sequentially arranged; each second bottleneck residual error structure comprises two paths, wherein one path (the first path) comprises 3 convolution layers with convolution kernels of 1 × 1, 3 × 3 and 1 × 1 which are sequentially arranged;

the third bottleneck residual error module comprises 6 third bottleneck residual error structures which are sequentially arranged; each third bottleneck residual error structure comprises two paths, wherein one path (the first path) comprises 3 convolution layers with convolution kernels of 1 × 1, 3 × 3 and 1 × 1 which are sequentially arranged;

the fourth bottleneck residual error module comprises 3 fourth bottleneck residual error structures which are sequentially arranged; each fourth bottleneck residual structure includes two paths, wherein one path (the first path) includes 3 convolutional layers with convolutional kernels of 1 × 1, 3 × 3, and 1 × 1, which are sequentially arranged.

The other path (second path) of each first bottleneck residual structure comprises a convolution layer with a convolution kernel of 1 × 1; the other path (second path) of each second bottleneck residual structure comprises a convolution layer with a convolution kernel of 1 × 1; the other path (second path) of each third bottleneck residual structure comprises a convolution layer with a convolution kernel of 1 × 1; the other path (second path) of each fourth bottleneck residual structure includes convolution layer with convolution kernel of 1 × 1.

Specifically, the first path of the first bottleneck residual structure includes 3 convolutional layers with convolutional kernels of 1 × 1, 3 × 3, and 1 × 1, which are sequentially arranged. The 1 × 1 convolution layer can play a role in increasing or decreasing the dimension of the number of channels, so that convolution operation is performed on 3 × 3 convolution with relatively low input, and the calculation efficiency is improved. The first bottleneck residual structure includes two paths or two mappings: the first is self x; the other branch is part f (x), which is called residual mapping. The final output is F (x) + x, and then ReLU activation is performed. When the number of the characteristic diagram channels of the input and output connection parts is not equal, F (x) + Wx is adopted for calculation, wherein W is convolution operation and is used for adjusting the channel dimension of the input x. There is one residual learning for each bottleneck structure. The size of the final output characteristic graph of the module is w/4 multiplied by h/4 multiplied by 64.

For example, the feature map of the first bottleneck residual structure is input as w/4 × h/4 × 32, and one path is a feature map of w/4 × h/4 × 64 obtained by sequentially performing convolution with 1 × 1 × 32(k is 1 and s is 1), batch normalization, a ReLU activation layer, convolution with 3 × 3 × 32(k is 3 and s is 1), batch normalization, a ReLU activation layer, convolution with 1 × 1 × 64(k is 1 and s is 1), and batch normalization. The other path is to directly perform convolution and batch normalization on the input feature map with a convolution kernel of 1 × 1 × 64 and a step of 1(k ═ 1, and s ═ 1) to obtain a w/4 × h/4 × 64 feature map. And (4) superposing the feature diagram results obtained by the two paths, inputting the result into a ReLU activation layer and outputting the result. On the basis, the feature diagram output by the first bottleneck residual error structure is input into the second first bottleneck residual error structure for processing, and then the feature diagram output by the second bottleneck residual error structure is input into the third first bottleneck residual error structure for processing.

The second bottleneck residual structure is somewhat similar to the first bottleneck residual structure. For example, the feature map of the first second bottleneck residual structure is w/4 × h/4 × 64, and one path is a feature map of w/8 × h/8 × 128 obtained by sequentially performing convolution with 1 × 1 × 64(k ═ 1, s ═ 1), batch normalization, ReLU activation layer, convolution with 3 × 3 × 64 step size of 2(k ═ 3, s ═ 2), batch normalization, ReLU activation layer, convolution with 1 × 1 × 128(k ═ 1, s ═ 1), and batch normalization. The other path is to directly perform convolution and batch normalization on the input feature map with a convolution kernel of 1 × 1 × 128 and a step of 2(k ═ 1, s ═ 2) to obtain a w/8 × h/8 × 128 feature map. And (4) superposing the graph results obtained by the two paths, inputting the graph results into a ReLU activation layer and outputting the graph results. On the basis, three second bottleneck residual error structures with the same structure are connected in series. The convolution steps for these three identically structured second bottleneck residual structures may all be 1.

The third bottleneck residual structure is somewhat similar to the second bottleneck residual structure. For example, the feature map input to the first third bottleneck residual structure is w/8 × h/8 × 128, and one path is a w/16 × h/16 × 256 feature map obtained by sequentially performing convolution with 1 × 1 × 128(k ═ 1, s ═ 1), batch normalization, ReLU activation layer, convolution with 3 × 3 × 128 step size of 2(k ═ 3, s ═ 2), batch normalization, ReLU activation layer, convolution with 1 × 1 × 256(k ═ 1, s ═ 1), and batch normalization. The other path is to directly perform convolution and batch normalization on the input feature map with a convolution kernel of 1 × 1 × 256 and a step of 2(k ═ 1, s ═ 2) to obtain a w/16 × h/16 × 256 feature map. And (4) superposing the graph results obtained by the two paths, inputting the graph results into a ReLU activation layer and outputting the graph results. On the basis, five second bottleneck residual error structures with the same structure are connected in series, and the convolution steps of the five second bottleneck residual error structures with the same structure can be all 1.

The fourth bottleneck residual structure is somewhat similar to the third bottleneck residual structure. For example, the feature map input to the fourth bottleneck residual structure is w/16 × h/16 × 256, and one path is subjected to convolution with 1 × 1 × 256(k is 1 and s is 1), batch normalization, a ReLU activation layer, convolution with 3 × 3 × 256 steps of 2(k is 3 and s is 2), batch normalization, a ReLU activation layer, convolution with 1 × 1 × 512(k is 1 and s is 1), and batch normalization in this order to obtain a w/32 × h/32 × 512 feature map. The other path is to directly perform convolution and batch normalization on the input feature map with a convolution kernel of 1 × 1 × 512 and a step of 2(k ═ 1, s ═ 2) to obtain a w/32 × h/32 × 512 feature map. And (4) superposing the graph results obtained by the two paths, inputting the graph results into a ReLU activation layer and outputting the graph results. On the basis, two fourth bottleneck residual error structures with the same structure are connected in series, and the convolution steps of the fourth bottleneck residual error structures with the same structure can be all 1.

In the step S104, the feature map with the size of w/32 × h/32 × 512 output by the fourth bottleneck residual module is input to the first upsampling module for upsampling, and a feature map with the size of w/16 × h/16 × 256 is output.

In the step S105, the feature map with the size of w/16 × h/16 × 256 output by the first upsampling module and the feature map with the size of w/16 × h/16 × 256 output by the third bottleneck residual module are stacked, and then input to the second upsampling module for upsampling, so as to output the feature map with the size of w/8 × h/8 × 128.

In step S106, the feature map with the size of w/8 × h/8 × 128 output by the second upsampling module and the feature map with the size of w/8 × h/8 × 128 output by the second bottleneck residual module are stacked, and then input to a third upsampling module for upsampling, so as to output the feature map with the size of w/4 × h/4 × 64.

In step S107, the feature map with the size of w/4 × h/4 × 64 output by the third upsampling module and the feature map with the size of w/4 × h/4 × 64 output by the first bottleneck residual error module are stacked, and then input to the fourth upsampling module for upsampling, so as to output the feature map with the size of w/2 × h/2 × 32.

The steps S104 to S107 can be upsampled by using a U-Net method. UNet is the name of a network structure that is mainly characterized by its U-shaped structure and hop layer connections. The decoding part is mainly used for resolution recovery, layer jump connection is introduced for reducing spatial information loss caused by a down-sampling process, and more low-level semantic information is contained in a feature graph of up-sampling recovery in a Concat (stacking mode) mode, so that the fineness of a result is better. Each up-sampling module completes up-sampling by means of deconvolution.

In step S108, the feature map with the size of w/2 × h/2 × 32 output by the fourth upsampling module is convolved by a series of convolutional layers, and two branch results are output: the first branch is the probability that each pixel point is in the character center; the second branch is the probability that each pixel is in the gap between characters.

Wherein the series of convolutional layers may comprise: the convolutional layers with convolution kernels of 3 × 3 × 32, 3 × 3 × 16 and 1 × 1 × 16 are sequentially arranged.

That is, in one embodiment, the feature map output by the fourth upsampling module is convolved by a series of convolutional layers, and two branch results are output: the first branch is the probability that each pixel point is in the character center; the second branch is the probability that each pixel point is in the gap between characters, and comprises the following steps:

inputting the feature map output by the fourth upsampling module into a 3 × 3 × 32 convolutional layer for convolution, then inputting the output into the 3 × 3 × 16 convolutional layer for convolution, then inputting the output into the 1 × 1 × 16 convolutional layer for convolution, and outputting two branch results: the first branch is the probability that each pixel point is in the character center; the second branch is the probability that each pixel is in the gap between characters.

Wherein the sizes of the first branch result and the second branch result are w/2 xh/2 x 1.

In step S109, based on the two branch results, the single character central region probability and the adjacent character central region probability are obtained, network training is performed by using a weak supervised learning method, a text box is output, and a text detection model is constructed.

The first branch result obtained in step S108: the probability that each pixel point is located at the character center can obtain the probability (character position) of the single character center region, and the second branch result obtained in step S108 is: the probability that each pixel point is in the gap between characters can obtain the center probability (the connection condition between the characters) of the adjacent character areas, and then the results are integrated into a text box.

In the training process, for each image, the probability of the central region of a single character (character position) and the probability of the center of the adjacent character region (character connection condition) need to be marked. The probability of a character center is encoded using a gaussian heat map. Since this method has a high degree of flexibility in handling non-strictly limited target labeling areas.

In the early training stage, the synthesized non-real pictures can be used for training, because the synthesized pictures have the accurate marking information of the character boxes. And when the text detection model has certain prediction capability, the real picture is used for training. Firstly, intercepting a text line, and predicting the position score of each pixel point by using a currently trained model; and then, segmenting the number and the positions of the character frames judged by the current model from the distribution of the position scores by using a watershed algorithm, and training the model by using the character frames as labels. Since the accuracy of the character box predicted by the model is not guaranteed at this time, when the loss function is included, a confidence probability is multiplied to the corresponding loss. The actual number of characters (text label length) is known and the position of the character box is unknown. Therefore, the difference between the predicted and actual number of characters can be used to measure the accuracy of the prediction, i.e. the confidence probability is 1-difference of number of characters/actual number of characters, i.e.:

the pixel-level confidence map Sc for an image can be calculated as:

wherein R (w) and l (w) respectively represent the frame area and word length of the sample w, and the character length l can be obtained according to character segmentation^c(w) is carried out. Where p represents a pixel in the region r (w).

In an embodiment, the obtaining of the central probability of the single character region and the central probability of the adjacent character region based on the two branch results respectively, performing network training by using a weak supervised learning method, outputting a text box, and constructing a text detection model includes:

training by using the synthesized unreal image, then training by using the real image, and training according to the following loss function:

wherein the content of the first and second substances,

and

respectively representing the true value of the probability of the central region of a single character and the true value of the probability of the central region of an adjacent character, S_r(p) and S_a(p) respectively represent the predicted single-character center region probability and adjacent-character region center probability. Using composition data (i.e. composition)Imaging) during training, S_c(p) is set to 1.

In step S110, the image to be detected is input to the trained text detection model, and a text box is output.

In an embodiment, the inputting the image to be detected into the text detection model and outputting the text box includes:

obtaining the position scores and the neighborhood scores of all pixel points through the text detection model;

marking the pixel points with the position score and the field score of the pixel points of which at least one party is higher than the threshold value as 1 and the other points as 0, and then setting all the connected pixel points with the value of 1 as a text target;

and finding out the shape which has the smallest area and surrounds the whole text object to be output as a text box.

In this embodiment, after the position scores and the neighborhood scores of all the pixel points are obtained, the results are integrated into a final text box and output. A threshold value (which can be the same or different) can be respectively set for the position score and the neighborhood score, at least one of the pixel points of the position score and the neighborhood score, which is higher than the corresponding threshold value, is marked as 1, the other pixel points of the position score and the neighborhood score are marked as 0, and then all the connected pixel points with the value of 1 are determined as a text target. In the task of detecting the rectangular text box, a rectangle which has the smallest area and surrounds the whole target can be found and output as the text box; in the task of detecting a polygonal text box, a polygon can be constructed according to the center point and the width of each character and output as the text box.

In overview, the network architecture of embodiments of the present invention: the main down-sampling network for feature extraction adopts a convolution network structure with residual learning and batch normalization, a decoder part adopts a U-net connection mode and a top-down feature aggregation mode, and finally two channels are output: the probability of the center region of a single character and the probability of the center of an adjacent character region.

By the method provided by the embodiment of the invention, the large text and the long text can be predicted by using the small receptive field, only the content of the character level is required to be paid attention to, but the whole text example is not required to be paid attention to, meanwhile, the degradation problem caused by the increase of the network depth can be solved, the generalization capability is strong, the method can be used for processing the texts, curve texts, distorted texts and the like in any direction, and meanwhile, the accurate segmentation and calculation efficiency of the text region is improved.

Referring to fig. 3, fig. 3 is a schematic block diagram of a device for detecting a character-level natural scene text based on residual error learning according to an embodiment of the present invention, where the device 300 for detecting a character-level natural scene text based on residual error learning includes:

a first convolution unit 301, configured to input an original image with a size of w × h into a convolution layer for convolution, and output a feature map with a size of w/2 × h/2;

a dimension reduction unit 302, configured to input the feature map with the size of w/2 × h/2 to a maximum pooling layer for dimension reduction, and output a feature map with the size of w/4 × h/4;

a residual extraction unit 303, configured to input the feature map with the size of w/4 × h/4 to a first bottleneck residual module for convolution and feature extraction, input the output to a second bottleneck residual module for convolution and feature extraction, input the output to a third bottleneck residual module for convolution and feature extraction, input the output to a fourth bottleneck residual module for convolution and feature extraction, and output a feature map with the size of w/32 × h/32 through the fourth bottleneck residual module;

a first upsampling unit 304, configured to input the feature map output by the fourth bottleneck residual module into a first upsampling module for upsampling, and output a feature map with a size of w/16 × h/16;

a second upsampling unit 305, configured to stack the feature maps output by the first upsampling module and the third bottleneck residual error module, and then input the stacked feature maps to the second upsampling module for upsampling, so as to output a feature map with a size of w/8 × h/8;

a third upsampling unit 306, configured to stack the feature maps output by the second upsampling module and the second bottleneck residual error module, and then input the stacked feature maps to the third upsampling module for upsampling, so as to output a feature map with a size of w/4 × h/4;

a fourth upsampling unit 307, configured to stack the feature maps output by the third upsampling module and the first bottleneck residual error module, and then input the feature maps to the fourth upsampling module for upsampling, so as to output a feature map with a size of w/2 × h/2;

a second convolution unit 308, configured to convolve the feature map output by the fourth upsampling module with a series of convolutional layers, and output two branch results: the first branch is the probability that each pixel point is in the character center; the second branch is the probability that each pixel point is in the gap between characters;

the training unit 309 is configured to obtain a single character central region probability and an adjacent character region central probability respectively based on two branch results, perform network training by using a weak supervised learning method, output a text box, and construct a text detection model;

and the prediction unit 310 is configured to input the image to be detected into the text detection model, and output a text box.

The embodiment of the invention provides computer equipment, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the character-level natural scene text detection method based on residual error learning when executing the computer program.

An embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the character-level natural scene text detection method based on residual error learning as described above.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A character-level natural scene text detection method based on residual error learning is characterized by comprising the following steps:

2. The method for detecting the text of the character-level natural scene based on the residual error learning of claim 1, wherein the step of inputting the feature map with the size of w/4 × h/4 into a first bottleneck residual error module for convolution and feature extraction, then inputting the output into a second bottleneck residual error module for convolution and feature extraction, then inputting the output into a third bottleneck residual error module for convolution and feature extraction, then inputting the output into a fourth bottleneck residual error module for convolution and feature extraction, and finally outputting the feature map with the size of w/32 × h/32 through the fourth bottleneck residual error module comprises the steps of:

inputting the feature map with the size of w/4 xh/4 into a first bottleneck residual error module for convolution and feature extraction, and outputting the feature map with the size of w/4 xh/4 and doubled channel number;

inputting the feature map with the size of w/4 × h/4 and doubled channel number into a second bottleneck residual error module for convolution and feature extraction, and outputting the feature map with the size of w/8 × h/8 and doubled channel number;

inputting the feature map with the size of w/8 × h/8 and the number of channels doubled to a third bottleneck residual error module for convolution and feature extraction, and outputting the feature map with the size of w/16 × h/16 and the number of channels tripled;

and inputting the feature map with the size of w/16 × h/16 and the number of channels doubled to a fourth bottleneck residual error module for convolution and feature extraction, and outputting the feature map with the size of w/32 × h/32 and the number of channels doubled.

3. The method according to claim 1, wherein the feature map output by the fourth upsampling module is convolved by a series of convolutional layers to output two branch results: the first branch is the probability that each pixel point is in the character center; the second branch is the probability that each pixel point is in the gap between characters, and comprises the following steps:

4. The method for detecting the text of the character-level natural scene based on the residual error learning of claim 1, wherein the first bottleneck residual error module comprises 3 first bottleneck residual error structures which are sequentially arranged; each first bottleneck residual error structure comprises two paths, wherein one path comprises 3 convolution layers with convolution kernels of 1 × 1, 3 × 3 and 1 × 1 which are sequentially arranged;

the second bottleneck residual error module comprises 4 second bottleneck residual error structures which are sequentially arranged; each second bottleneck residual error structure comprises two paths, wherein one path comprises 3 convolution layers with convolution kernels of 1 × 1, 3 × 3 and 1 × 1 which are sequentially arranged;

the third bottleneck residual error module comprises 6 third bottleneck residual error structures which are sequentially arranged; each third bottleneck residual error structure comprises two paths, wherein one path comprises 3 convolution layers with convolution kernels of 1 × 1, 3 × 3 and 1 × 1 which are sequentially arranged;

the fourth bottleneck residual error module comprises 3 fourth bottleneck residual error structures which are sequentially arranged; each fourth bottleneck residual error structure comprises two paths, wherein one path comprises 3 convolution layers with convolution kernels of 1 × 1, 3 × 3 and 1 × 1 which are sequentially arranged.

5. The method of claim 4, wherein another path of each first bottleneck residual structure comprises a convolution layer with a convolution kernel of 1 x 1; the other path of each second bottleneck residual error structure comprises a convolution layer with a convolution kernel of 1 multiplied by 1; the other path of each third bottleneck residual error structure comprises a convolution layer with a convolution kernel of 1 multiplied by 1; another path of each fourth bottleneck residual structure includes convolution layers with convolution kernel of 1 × 1.

6. The method for detecting the text in the character-level natural scene based on residual error learning of claim 1, wherein the method for detecting the text in the character-level natural scene based on residual error learning is characterized in that a single character central region probability and an adjacent character central region probability are obtained respectively based on two branch results, a weak supervised learning method is adopted for network training, a text box is output, and a text detection model is constructed and obtained, and comprises the following steps:

wherein the content of the first and second substances,

and

respectively representing the true value of the probability of the central region of a single character and the true value of the probability of the central region of an adjacent character, S_r(p) and S_a(p) respectively represent the predicted single-character center region probability and adjacent-character region center probability.

7. The method for detecting the text of the character-level natural scene based on residual error learning of claim 1, wherein the inputting the image to be detected into the text detection model and outputting the text box comprises:

8. A character-level natural scene text detection device based on residual learning, comprising:

9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements the residual learning-based character-level natural scene text detection method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the method for character-level natural scene text detection based on residual learning of any one of claims 1 to 7.