CN111582021B - Text detection method and device in scene image and computer equipment - Google Patents
Text detection method and device in scene image and computer equipment Download PDFInfo
- Publication number
- CN111582021B CN111582021B CN202010223195.1A CN202010223195A CN111582021B CN 111582021 B CN111582021 B CN 111582021B CN 202010223195 A CN202010223195 A CN 202010223195A CN 111582021 B CN111582021 B CN 111582021B
- Authority
- CN
- China
- Prior art keywords
- text
- pixel points
- text prediction
- circumscribed rectangle
- confidence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 claims abstract description 39
- 238000012216 screening Methods 0.000 claims abstract description 20
- 238000005520 cutting process Methods 0.000 claims abstract description 10
- 238000012549 training Methods 0.000 claims description 37
- 238000003860 storage Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 11
- 238000005457 optimization Methods 0.000 claims description 9
- 238000002372 labelling Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 abstract description 8
- 238000010586 diagram Methods 0.000 description 12
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000000605 extraction Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000012937 correction Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000009826 distribution Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
- G06V10/225—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on a marking or identifier characterising the area
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Geometry (AREA)
- Computer Graphics (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of image processing, in particular to a text detection method and device of a scene image and computer equipment, wherein the method comprises the following steps: detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; screening high confidence pixel points in the text prediction box; calculating a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points; when the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction frame is adjusted through the minimum circumscribed rectangle; and cutting in the scene image to obtain a text image to be identified and identifying text information therein. The method provided by the embodiment of the invention can correct and adjust the width of the text prediction frame through the region with high confidence on the basis of realizing text detection by using the EAST method, so that the width of the text prediction frame is reliably reduced, and more accurate text recognition is realized.
Description
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for detecting text in a scene image, and a computer device.
Background
Word recognition based on computer vision is of great use in the present big data age. Which is the basis for the implementation of many intelligent functions, such as recommendation systems, machine translation, etc. And text detection is used as a precondition of a text recognition process, and the detection accuracy has a remarkable influence on the text recognition effect.
In a complex natural scene, the text has the characteristics of distribution at various positions, various arrangement forms, inconsistent distribution directions, multi-language mixing and the like, so that the task of text detection is very challenging.
There is a text detection algorithm called CTPN in the conventional technology, which is based on the idea of dividing, detecting and then merging complete text to realize text detection in natural scenes. The conventional technology detects text by means of segmentation and recombination, on the one hand, the detection accuracy is inaccurate, on the other hand, the detection time is excessively consumed, and the user experience is poor, and on the basis of the detection time, a text detection method called EAST (AN EFFICIENT AND accurate scene text detector) is also proposed. The method performs feature extraction and learning by means of the FCN architecture, and performs end-to-end training and optimization directly, so that unnecessary intermediate steps are eliminated.
However, in the practical application process of EAST, there are still many limitations, and the practical application requirements cannot be well met. For example, the width of the finally obtained text prediction box does not coincide with the actual text in the scene, so that the conventional technology needs to be further improved on the basis of the actual application of EAST.
Disclosure of Invention
The invention aims to solve the technical problem that the identification precision of the existing EAST algorithm can not meet the actual use requirement.
In order to solve the above technical problems, in a first aspect, an embodiment of the present invention provides a method for detecting text in a scene image, including: training and optimizing the full convolution network model;
Detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction box as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction box and are output by the full convolution network model; calculating a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction box; calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle; when the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction box is adjusted through the minimum circumscribed rectangle; cutting the adjusted text prediction box in the scene image to obtain a text image to be identified; and identifying the characters in the text image to be identified.
Optionally, before calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle, the method further comprises:
calculating a confidence average value of the high-confidence pixel points in the minimum circumscribed rectangle;
and when the confidence coefficient average value is smaller than a preset screening threshold value, eliminating the minimum circumscribed rectangle.
Optionally, the training optimization on the full convolution network model includes: constructing a full convolution network model; labeling a training label and constructing a training data set; and training and optimizing the full convolution network model through the training data set and a preset loss function.
Optionally, the calculating the overlapping degree between the text prediction box and the corresponding minimum bounding rectangle includes:
Determining a pixel point which is simultaneously in the text prediction box and the minimum circumscribed rectangle as a first pixel point; determining that the pixel points only belonging to the text prediction box or the minimum circumscribed rectangle are second pixel points; calculating the sum of the numbers of the first pixel points and the second pixel points; and calculating the ratio between the number of the first pixel points and the sum of the number of the first pixel points and the number of the second pixel points as the overlapping degree.
Optionally, when the overlapping degree is greater than a preset overlapping degree threshold, the text prediction box is adjusted by the following formula:
P1=w*p+(1-w)*d,
wherein P1 is the width of the text prediction frame after adjustment, w is a weight coefficient, P is the width of the text prediction frame, and d is the width of the corresponding minimum circumscribed rectangle.
Optionally, the calculating, according to the high confidence pixel point, a minimum bounding rectangle corresponding to the text prediction box includes:
Determining two pixel points with the highest confidence coefficient in the pixel points with the high confidence coefficient as length calibration pixel points;
Taking a connecting line between the length calibration pixel points as a first direction, and determining two high-confidence pixel points with the farthest distance as width calibration pixel points in a second direction perpendicular to the first direction;
and the first line segment passing through the length calibration pixel points and perpendicular to the connecting line between the length calibration pixel points is taken as a long line, and the second line segment passing through the width calibration pixel points and perpendicular to the connecting line between the width calibration pixel points is taken as a wide line at the same time, so that the minimum circumscribed rectangle is formed.
In a second aspect, an embodiment of the present invention provides a text detection apparatus for a scene image, including:
The training unit is used for training and optimizing the full convolution network model; the text prediction box detection unit is used for detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; the screening unit is used for screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction frame as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction frame and are output by the full convolution network model; the minimum circumscribed rectangle determining unit is used for calculating a minimum circumscribed rectangle corresponding to the text prediction frame according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction frame; the overlapping degree calculating unit is used for calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle; the adjusting unit is used for adjusting the width of the text prediction frame through the minimum circumscribed rectangle when the overlapping degree is larger than a preset overlapping degree threshold value; the cutting unit is used for cutting the adjusted text prediction box in the scene image to obtain a text image to be identified; and the text recognition unit is used for recognizing the text information in the text image to be recognized.
Optionally, the method further comprises: the confidence coefficient calculating unit is used for calculating a confidence coefficient average value of the high-confidence coefficient pixel points in the minimum circumscribed rectangle; and the minimum circumscribed rectangle screening unit is used for eliminating the minimum circumscribed rectangle when the confidence coefficient average value is smaller than a preset screening threshold value.
In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the text detection method of the scene image when executing the computer program.
In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the above-described text detection method of a scene image.
According to the text detection method provided by the embodiment of the invention, on the basis of realizing text detection by using an EAST method, the width of the text prediction box is corrected and adjusted through the high-confidence region, so that the width of the text prediction box is reliably reduced, and more accurate text recognition is realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic structural diagram of a computer device according to an embodiment of the present invention;
Fig. 2 is a flow chart of a text detection method of a scene image according to an embodiment of the present invention;
FIG. 3 is a flow chart of step 20 in FIG. 1;
FIG. 4 is a schematic flow chart of screening minimum bounding rectangles according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a text detection device for a scene image according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a text detection device for a scene image according to another embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
The embodiment of the invention firstly provides a text detection method of a scene image, and the text detection method of the scene image can be applied to adjust the width of a text detection box through a region with high confidence on the basis of realizing text detection by using an EAST method, so as to realize more accurate text recognition.
Referring to fig. 1, fig. 1 is a schematic diagram of a computer device 100 according to an embodiment of the invention. The computer device 100 may be a computer, a cluster of computers, a mainstream computer, a computing device dedicated to providing online content, or a computer network comprising a group of computers operating in a centralized or distributed manner.
As shown in fig. 1, the computer device 100 includes: a processor 102, a memory, and a network interface 105 connected by a system bus 101; the memory may include a nonvolatile storage medium 103 and an internal memory 104.
In an embodiment of the present invention, the Processor 102 may be a central processing unit (Central Processing Unit, CPU), the Processor 102 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., depending on the type of hardware used. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The number of processors 102 may be one or more, and the one or more processors 102 may execute sequences of computer program instructions to perform the text detection method for various scene images, as will be described in more detail below.
Computer program instructions are stored by the non-volatile storage medium 103, accessed, and read from the non-volatile storage medium 103 for execution by the processor 10 to implement the tuning methods disclosed in the embodiments of the present invention described below. For example, the nonvolatile storage medium 103 stores a software application that performs the adjustment method described below. Further, the non-volatile storage medium 103 may store the entire software application or only a portion of the software application that may be executed by the processor 102. It should be noted that although only one block is shown in fig. 1, the nonvolatile storage medium 103 may include a plurality of physical devices installed on a central processing device or different computing devices.
The network interface 105 is used for network communication, such as providing for transmission of data information, etc. It will be appreciated by those skilled in the art that the structure shown in FIG. 1 is merely a block diagram of some of the structures associated with the present inventive arrangements and does not constitute a limitation of the computer device 100 to which the present inventive arrangements are applied, and that a particular computer device 100 may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.
The embodiment of the invention also provides a computer readable storage medium. The computer readable storage medium may be a non-volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program when executed by a processor implements the text detection method of a scene image disclosed in the embodiment of the invention. The computer program product is embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer program code.
In the case of implementing the computer device 100 in software, fig. 2 shows a schematic diagram of a method for adjusting scene text according to an embodiment, and the method in fig. 2 is described in detail below. Referring to fig. 2, the method includes the steps of:
and step 20, training and optimizing the full convolution network model.
The full convolution network model is one type of neural network model. Before use, offline training with training data is required to determine the transfer weight parameters among neurons.
In some embodiments, as shown in fig. 3, the step 20 specifically includes the following steps:
and 200, constructing a full convolution network model.
The method comprises the steps of carrying out feature extraction on image data related to an input scene picture through a full convolution network model, and finally generating a text fractional feature map of a single-channel pixel level and a geometric figure feature map of multiple channels. Specifically, the network structure of the full convolution network model can be broken down into three parts: the device comprises a feature extraction layer, a feature combination layer and an output layer.
First, the feature extraction layer adopts a general convolution network as a base network. And during training, initializing parameters of the convolutional network and then extracting features. After training is completed, the optimized convolutional network parameters are obtained. In practical application, the acceleration model performance (Pvanet, performance Vs Accuracy), the VGG16 model (Visual Geometry Group) and other basic networks can be selected and used according to the requirements of practical situations. The embodiment of the invention can obtain four levels of feature graphs by the convolution network extraction, and the sizes of the feature graphs are sequentially 1/32,1/16,1/8 and 1/4 of the input image data. Since a large receptive field is required for locating large text, a small receptive field is correspondingly required for locating small text regions. Therefore, the use requirements of large difference of the sizes of the text areas in natural scenes can be met by using the characteristic diagrams with different levels.
And secondly, combining the four levels of feature graphs layer by using the U-shaped idea, thereby realizing the effect of reducing the later calculation cost. Wherein, the layer-by-layer merging method can be represented by the following formula:
The specific process of the above formula is as follows: in each merging stage, the feature map from the previous stage is first input to the upper pooling layer (unpool layers) to expand its size. It is then combined with the current layer feature map. Finally, the number of channels and the amount of computation is reduced by the convolutional layer (conv layer), in particular the conv1 x 1 layer, and the local information is fused by the conv3 x 3 layer to finally produce the output of the merging stage. After the last merging stage (i.e. i=4), the conv3×3 layer generates the final feature map of the merging branches and sends it to the output layer.
And finally, outputting the text score characteristic diagram and the geometric figure characteristic diagram with the sizes of original figures 1/4 at an output layer, wherein the number of channels of the text score characteristic diagram is 1, and the number of channels of the geometric figure characteristic diagram is 5. The text score feature map indicates the confidence that each pixel belongs to the text prediction box.
And 202, labeling a training label and constructing a training data set.
The labeling of the training labels can be accomplished in any suitable manner in the prior art, and the training labels are used as a training data set to train the full convolution network model. In some cases, the training or testing may also be performed directly using an existing training data set.
And 204, training and optimizing the full convolution network model through the training data set and a preset loss function.
Training optimization is a learning optimization process for parameters of the full convolution network model. When the parameter optimization is completed, the fully-convolution network model with the completed training can be applied to the text detection of the actual scene.
Besides the well-marked training data, the optimization process also needs to provide a proper loss function for evaluating the effect of the full convolution network model, and parameter optimization is realized by minimizing loss.
In the present application, the loss function can be expressed by the following expression:
L=Ls+λgLg
where L is a loss function, ls is a loss of the text feature score map, lg is a loss of the geometric feature map, and λg represents importance between the two losses, which can be set to 1.
In particular, for the loss of text feature score graphs, class-balanced cross entropy may be used for computation. The penalty for the geometric feature map may be calculated using an overlap (IOU, interaction over union) penalty function.
And step 22, detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model.
And determining a text prediction box in the scene image to be detected through the trained full convolution network model. I.e. the area of the scene image containing text.
As described above, the output layer of the full convolutional network model may include a text score feature map and a geometric feature map. The text score feature map records the probability that each pixel belongs to a text prediction box when the pixel is mapped to an image to be detected. The geometric figure feature map records the distance between each pixel and the text prediction box when the pixel is mapped to the image to be detected.
The full convolutional network model typically outputs a greater number of candidate text prediction boxes. Therefore, in a preferred embodiment, a non-maximum suppression algorithm may also be applied to eliminate redundant text prediction boxes to determine the position of the best text prediction box, which is the text prediction box in the embodiment of the present invention.
The scene picture is a picture which can be interpreted as being taken in a real scene in the present embodiment, for example, a picture obtained by framing through any suitable camera-equipped terminal.
And step 24, screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction frame to serve as high-confidence coefficient pixels.
The confidence coefficient is the probability that the pixel points belong to the text prediction box and is output by the full convolution network model. That is, the confidence of each pixel point is represented in the text feature score chart, so that the situation that text prediction boxes may exist at different positions is represented. In this step, through a proper screening method, some pixels with higher confidence coefficient are screened out and can be used for further adjustment and optimization of the text prediction box.
Specifically, high confidence pixel points can be screened in the text feature score map by setting a proper confidence threshold. For example, the confidence threshold may be set to 0.7, and then, it is sequentially determined whether the pixel point in the text feature score map is greater than the confidence threshold. If yes, the pixel point is determined to be the high-confidence pixel point. If not, discarding the pixel point.
In one image to be detected, there may be a plurality of different text prediction boxes. Thus, these high confidence pixels may belong to different text boxes in the scene. Accordingly, to avoid errors in adjustment or correction, high confidence pixels need to be marked and distinguished. Specifically, according to the position of the pixel point, which text prediction box the pixel point specifically belongs to can be determined, so that the pixel point with high confidence coefficient is respectively classified into the corresponding text prediction boxes.
And 26, calculating the minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points.
The minimum bounding rectangle (MBR, minimum bounding rectangle) is represented in two-dimensional coordinates, which is the maximum range of high confidence pixels in the same text prediction box. The method is characterized in that the method represents a rectangular area given by high-confidence pixel points of the same text prediction box, and the rectangular area comprises all the high-confidence pixel points in the text prediction box and has the smallest area.
In particular, any suitable algorithm may be used to computationally determine the minimum bounding rectangle for each text prediction box.
In some embodiments, the method specifically includes the following steps:
Firstly, determining two high-confidence pixel points with the farthest distance from the high-confidence pixel points as length calibration pixel points.
And then, taking a connecting line between the length calibration pixel points as a first direction, and determining two pixel points with high confidence coefficient, which are farthest in a second direction perpendicular to the first direction, as width calibration pixel points.
And finally, taking a first line segment passing through the length calibration pixel points and perpendicular to the connecting line between the length calibration pixel points as a length, and taking a second line segment passing through the width calibration pixel points and perpendicular to the connecting line between the width calibration pixel points as a width, so that the minimum circumscribed rectangle can be enclosed.
And 28, calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle.
The degree of overlap (IOU), which may also be referred to as "overlap ratio", is used to characterize the degree of overlap between a text prediction box and a corresponding minimum bounding rectangle. Which is calculated from the area ratio between the intersection and union of two boxes. The higher the overlap, the higher the degree of matching between the two boxes.
In some embodiments, the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle may be calculated specifically by:
Firstly, respectively determining that the pixel points in the text prediction box and the minimum bounding rectangle are first pixel points and the pixel points which only belong to the text prediction box or the minimum bounding rectangle are second pixel points;
then, the sum of the numbers of the first pixel points and the second pixel points is calculated.
And finally, calculating the ratio between the number of the first pixel points and the sum of the number of the first pixel points and the number of the second pixel points as the overlapping degree.
And step 30, when the overlapping degree is larger than a preset overlapping degree threshold value, adjusting the width of the text prediction box through the minimum circumscribed rectangle.
The overlap threshold is an empirical value that can be set by the skilled person as required by the situation. Typically, the width of the smallest bounding rectangle is smaller than the width of the text prediction box, which indicates that the region within the smallest bounding rectangle has a greater likelihood of belonging to the text region. Therefore, the text prediction box can be properly adjusted through the minimum circumscribed rectangle, so that the width of the text prediction box is correspondingly reduced.
Specifically, when the overlapping degree is greater than a preset overlapping degree threshold value, the text prediction box is adjusted through the following formula:
P1=w*p+(1-w)*d,
wherein P1 is the width of the text prediction frame after adjustment, w is a weight coefficient, P is the width of the text prediction frame, and d is the width of the corresponding minimum circumscribed rectangle.
Through the above formula, after the proper w value is given, the width of the text prediction frame can be corrected and adjusted according to the smaller effective minimum circumscribed rectangle, so that the width of the text prediction frame can be reliably reduced, and more accurate text recognition is realized.
And step 32, cutting the adjusted text prediction box in the scene image to obtain a text image to be recognized.
The adjusted text prediction box prompts the scene image to contain the text. These text prediction boxes can thus be cut out of the scene image as text images to be recognized.
And step 34, identifying text information in the text image to be identified.
In particular, any type of algorithm or mode can be selected to identify and acquire text information in the text image, so as to obtain a text detection result of the final scene image. Which are well known to those skilled in the art and are not described in detail herein.
By applying the text detection method provided by the embodiment of the invention, the width of the text prediction box can be reliably reduced, more accurate text recognition is realized, the difficulty of subsequent processing is reduced, and the text detection accuracy is improved.
Since the minimum bounding rectangle is the standard for the final adjustment of the width of the text detection box. Therefore, it is necessary to ensure that the minimum bounding rectangle has good reliability, otherwise the subsequent adjustment process may instead have adverse consequences.
In some embodiments, prior to performing step 28, the method may further include the step of screening the minimum bounding rectangle as shown in fig. 4:
step 401: and calculating the confidence average value of the high-confidence pixel points in the minimum circumscribed rectangle.
The confidence average value refers to the confidence average value of the high-confidence pixel points, and represents the probability that the minimum bounding rectangle belongs to a text region as a whole.
Step 402: and judging whether the confidence average value is smaller than a preset screening threshold value. If yes, go to step 403. If not, go to step 404.
Step 403: and eliminating the minimum circumscribed rectangle.
It will be appreciated that the smallest bounding rectangles with low confidence averages do not actually have high reliability or probability of being text, and are not sufficient as criteria for correction. Therefore, the minimum bounding rectangles can be eliminated, and the width correction of the text prediction box is performed without using the minimum bounding rectangles.
Step 404: and reserving the minimum bounding rectangle as an effective minimum bounding rectangle. These effective minimum bounding rectangles can be used for further processing as references to adjust the text detection box.
An embodiment of the present invention further provides a text detection device corresponding to the text detection method of a scene image in the foregoing embodiment, referring to fig. 5, fig. 5 provides a block diagram of the text detection device of a scene image provided in the embodiment of the present invention, and as shown in fig. 5, the text detection device 500 includes: training unit 50, text prediction box detection unit 52, screening unit 54, minimum bounding rectangle determination unit 56, overlap calculation unit 58, adjustment unit 60, cutting unit 62, and text recognition unit 64.
The training unit 50 is used for training and optimizing the full convolution network model.
The text prediction box detection unit 52 is configured to screen, as high-confidence pixel points, pixels with confidence degrees greater than a preset confidence degree threshold value in the text prediction box, where the confidence degrees are probabilities that the pixels belong to the text prediction box and are output by the full convolution network model; the minimum bounding rectangle determining unit 54 is configured to calculate, according to the high confidence pixel points, a minimum bounding rectangle corresponding to the text prediction box, where the minimum bounding rectangle is a rectangle that includes all high confidence pixel points in the text prediction box and has a minimum area; the overlap calculating unit 58 is configured to calculate an overlap between the text prediction box and a corresponding minimum bounding rectangle. The adjusting unit 60 is configured to adjust the width of the text prediction box according to the minimum bounding rectangle when the overlapping degree is greater than a preset overlapping degree threshold. The cutting unit 62 is configured to cut the adjusted text prediction box in the scene image, to obtain a text image to be identified. The text recognition unit 64 is used for recognizing text information in the text image to be recognized.
According to the text detection device for the scene image, provided by the embodiment of the invention, on the basis of realizing text detection by using an EAST method, the width of the text prediction box is corrected and adjusted through the high-confidence region, so that the width of the text prediction box is reliably reduced, and more accurate text recognition is realized.
In some embodiments, as shown in fig. 6, the text detection apparatus 500 may further include, in addition to the functional modules shown in fig. 5: confidence level calculation unit 66 and minimum bounding rectangle screening unit 68.
The confidence calculating unit 66 is configured to calculate a confidence average value of the high-confidence pixel points in the minimum bounding rectangle. The minimum bounding rectangle filtering unit 68 is configured to reject the minimum bounding rectangle when the confidence average value is less than a preset filtering threshold.
The minimum bounding rectangle (MBR, minimum bounding rectangle) is the maximum range of high confidence pixels in the same text prediction box, expressed in two-dimensional coordinates. Which represents a rectangular region given by high confidence pixels of the same text prediction box. The minimum bounding rectangle may be determined or calculated using any suitable means, and the calculation to determine its corresponding minimum bounding rectangle given a plurality of pixels is well known to those skilled in the art and will not be outlined herein.
By applying the text detection device for the scene image, which is provided by the embodiment of the invention, the width of the text prediction box can be reliably reduced, and more accurate text recognition can be realized.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.
Claims (9)
1. A method for text detection of a scene image, comprising:
training and optimizing the full convolution network model;
Detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model;
screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction box as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction box and are output by the full convolution network model;
calculating a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction box;
calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle;
When the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction box is adjusted through the minimum circumscribed rectangle;
Cutting the adjusted text prediction box in the scene image to obtain a text image to be identified;
Identifying text information in the text image to be identified;
The calculating the minimum circumscribed rectangle corresponding to the text prediction box according to the high confidence pixel points comprises the following steps:
Determining two pixel points with the highest confidence coefficient in the pixel points with the high confidence coefficient as length calibration pixel points;
Taking a connecting line between the length calibration pixel points as a first direction, and determining two high-confidence pixel points with the farthest distance as width calibration pixel points in a second direction perpendicular to the first direction;
and the first line segment passing through the length calibration pixel points and perpendicular to the connecting line between the length calibration pixel points is taken as a long line, and the second line segment passing through the width calibration pixel points and perpendicular to the connecting line between the width calibration pixel points is taken as a wide line at the same time, so that the minimum circumscribed rectangle is formed.
2. The method of text detection of a scene image of claim 1, wherein prior to calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle, the method further comprises:
calculating a confidence average value of the high-confidence pixel points in the minimum circumscribed rectangle;
and when the confidence coefficient average value is smaller than a preset screening threshold value, eliminating the minimum circumscribed rectangle.
3. The method for text detection of a scene image as recited in claim 2, wherein said training optimization of the full convolutional network model comprises:
constructing a full convolution network model;
labeling a training label and constructing a training data set;
And training and optimizing the full convolution network model through the training data set and a preset loss function.
4. The method of claim 1, wherein said calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle comprises:
Determining a pixel point which is simultaneously in the text prediction box and the minimum circumscribed rectangle as a first pixel point;
Determining that the pixel points only belonging to the text prediction box or the minimum circumscribed rectangle are second pixel points;
calculating the sum of the numbers of the first pixel points and the second pixel points;
and calculating the ratio between the number of the first pixel points and the sum of the number of the first pixel points and the number of the second pixel points as the overlapping degree.
5. The method of claim 1, wherein when the degree of overlap is greater than a preset degree of overlap threshold, the text prediction box is adjusted by the following formula:
P1 = w*p+(1-w)*d,
wherein P1 is the width of the text prediction frame after adjustment, w is a weight coefficient, P is the width of the text prediction frame, and d is the width of the corresponding minimum circumscribed rectangle.
6. A text detection device for a scene image, comprising:
The training unit is used for training and optimizing the full convolution network model;
A text prediction box detection unit, configured to detect and determine a plurality of text prediction boxes in the scene image through the trained full convolution network model;
The screening unit is used for screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction frame as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction frame and are output by the full convolution network model;
The minimum circumscribed rectangle determining unit is used for calculating a minimum circumscribed rectangle corresponding to the text prediction frame according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction frame;
The overlapping degree calculating unit is used for calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle;
The adjusting unit is used for adjusting the width of the text prediction frame through the minimum circumscribed rectangle when the overlapping degree is larger than a preset overlapping degree threshold value;
The cutting unit is used for cutting the adjusted text prediction box in the scene image to obtain a text image to be identified;
a text recognition unit for recognizing text information in the text image to be recognized;
The minimum circumscribed rectangle determining unit includes:
Determining two pixel points with the highest confidence coefficient in the pixel points with the high confidence coefficient as length calibration pixel points;
Taking a connecting line between the length calibration pixel points as a first direction, and determining two high-confidence pixel points with the farthest distance as width calibration pixel points in a second direction perpendicular to the first direction;
and the first line segment passing through the length calibration pixel points and perpendicular to the connecting line between the length calibration pixel points is taken as a long line, and the second line segment passing through the width calibration pixel points and perpendicular to the connecting line between the width calibration pixel points is taken as a wide line at the same time, so that the minimum circumscribed rectangle is formed.
7. The apparatus as recited in claim 6, further comprising:
The confidence coefficient calculating unit is used for calculating a confidence coefficient average value of the high-confidence coefficient pixel points in the minimum circumscribed rectangle;
And the minimum circumscribed rectangle screening unit is used for eliminating the minimum circumscribed rectangle when the confidence coefficient average value is smaller than a preset screening threshold value.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a text detection method of a scene image as claimed in any of claims 1 to 5 when the computer program is executed.
9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the text detection method of a scene image according to any of claims 1 to 5.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010223195.1A CN111582021B (en) | 2020-03-26 | 2020-03-26 | Text detection method and device in scene image and computer equipment |
PCT/CN2020/131604 WO2021189889A1 (en) | 2020-03-26 | 2020-11-26 | Text detection method and apparatus in scene image, computer device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010223195.1A CN111582021B (en) | 2020-03-26 | 2020-03-26 | Text detection method and device in scene image and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111582021A CN111582021A (en) | 2020-08-25 |
CN111582021B true CN111582021B (en) | 2024-07-05 |
Family
ID=72124246
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010223195.1A Active CN111582021B (en) | 2020-03-26 | 2020-03-26 | Text detection method and device in scene image and computer equipment |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111582021B (en) |
WO (1) | WO2021189889A1 (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111582021B (en) * | 2020-03-26 | 2024-07-05 | 平安科技(深圳)有限公司 | Text detection method and device in scene image and computer equipment |
CN111932577B (en) * | 2020-09-16 | 2021-01-08 | 北京易真学思教育科技有限公司 | Text detection method, electronic device and computer readable medium |
CN111931784B (en) * | 2020-09-17 | 2021-01-01 | 深圳壹账通智能科技有限公司 | Bill recognition method, system, computer device and computer-readable storage medium |
CN112329765B (en) * | 2020-10-09 | 2024-05-24 | 中保车服科技服务股份有限公司 | Text detection method and device, storage medium and computer equipment |
CN112232340A (en) * | 2020-10-15 | 2021-01-15 | 马婧 | Method and device for identifying printed information on surface of object |
CN112613561B (en) * | 2020-12-24 | 2022-06-03 | 哈尔滨理工大学 | EAST algorithm optimization method |
CN112819937B (en) * | 2021-04-19 | 2021-07-06 | 清华大学 | Self-adaptive multi-object light field three-dimensional reconstruction method, device and equipment |
CN113298079B (en) * | 2021-06-28 | 2023-10-27 | 北京奇艺世纪科技有限公司 | Image processing method and device, electronic equipment and storage medium |
CN114067237A (en) * | 2021-10-28 | 2022-02-18 | 清华大学 | Video data processing method, device and equipment |
CN114037826A (en) * | 2021-11-16 | 2022-02-11 | 平安普惠企业管理有限公司 | Text recognition method, device, equipment and medium based on multi-scale enhanced features |
CN114495103B (en) * | 2022-01-28 | 2023-04-04 | 北京百度网讯科技有限公司 | Text recognition method and device, electronic equipment and medium |
CN115375987B (en) * | 2022-08-05 | 2023-09-05 | 北京百度网讯科技有限公司 | Data labeling method and device, electronic equipment and storage medium |
CN117649635B (en) * | 2024-01-30 | 2024-06-11 | 湖北经济学院 | Method, system and storage medium for detecting shadow eliminating point of narrow water channel scene |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110796082A (en) * | 2019-10-29 | 2020-02-14 | 上海眼控科技股份有限公司 | Nameplate text detection method and device, computer equipment and storage medium |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304761A (en) * | 2017-09-25 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Method for text detection, device, storage medium and computer equipment |
CN109886997B (en) * | 2019-01-23 | 2023-07-11 | 平安科技(深圳)有限公司 | Identification frame determining method and device based on target detection and terminal equipment |
CN109886174A (en) * | 2019-02-13 | 2019-06-14 | 东北大学 | A kind of natural scene character recognition method of warehouse shelf Sign Board Text region |
CN109977943B (en) * | 2019-02-14 | 2024-05-07 | 平安科技(深圳)有限公司 | Image target recognition method, system and storage medium based on YOLO |
CN110135424B (en) * | 2019-05-23 | 2021-06-11 | 阳光保险集团股份有限公司 | Inclined text detection model training method and ticket image text detection method |
CN110232713B (en) * | 2019-06-13 | 2022-09-20 | 腾讯数码(天津)有限公司 | Image target positioning correction method and related equipment |
CN110443140B (en) * | 2019-07-05 | 2023-10-03 | 平安科技(深圳)有限公司 | Text positioning method, device, computer equipment and storage medium |
CN110414499B (en) * | 2019-07-26 | 2021-06-04 | 第四范式(北京)技术有限公司 | Text position positioning method and system and model training method and system |
CN110874618B (en) * | 2020-01-19 | 2020-11-27 | 同盾控股有限公司 | OCR template learning method and device based on small sample, electronic equipment and medium |
CN111582021B (en) * | 2020-03-26 | 2024-07-05 | 平安科技(深圳)有限公司 | Text detection method and device in scene image and computer equipment |
-
2020
- 2020-03-26 CN CN202010223195.1A patent/CN111582021B/en active Active
- 2020-11-26 WO PCT/CN2020/131604 patent/WO2021189889A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110796082A (en) * | 2019-10-29 | 2020-02-14 | 上海眼控科技股份有限公司 | Nameplate text detection method and device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111582021A (en) | 2020-08-25 |
WO2021189889A1 (en) | 2021-09-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111582021B (en) | Text detection method and device in scene image and computer equipment | |
CN110751134B (en) | Target detection method, target detection device, storage medium and computer equipment | |
CN110458095B (en) | Effective gesture recognition method, control method and device and electronic equipment | |
CN110084299B (en) | Target detection method and device based on multi-head fusion attention | |
US10783643B1 (en) | Segmentation-based damage detection | |
CN110516541B (en) | Text positioning method and device, computer readable storage medium and computer equipment | |
CN111652181B (en) | Target tracking method and device and electronic equipment | |
CN111860398A (en) | Remote sensing image target detection method and system and terminal equipment | |
US20230137337A1 (en) | Enhanced machine learning model for joint detection and multi person pose estimation | |
CN112800955A (en) | Remote sensing image rotating target detection method and system based on weighted bidirectional feature pyramid | |
CN111639513A (en) | Ship shielding identification method and device and electronic equipment | |
CN114549369B (en) | Data restoration method and device, computer and readable storage medium | |
CN114038004A (en) | Certificate information extraction method, device, equipment and storage medium | |
CN115631112B (en) | Building contour correction method and device based on deep learning | |
CN111368632A (en) | Signature identification method and device | |
CN112348116A (en) | Target detection method and device using spatial context and computer equipment | |
CN113487610A (en) | Herpes image recognition method and device, computer equipment and storage medium | |
CN113205047A (en) | Drug name identification method and device, computer equipment and storage medium | |
CN112417947A (en) | Method and device for optimizing key point detection model and detecting face key points | |
CN108446602B (en) | Device and method for detecting human face | |
CN116468702A (en) | Chloasma assessment method, device, electronic equipment and computer readable storage medium | |
CN113706705B (en) | Image processing method, device, equipment and storage medium for high-precision map | |
CN114926631A (en) | Target frame generation method and device, nonvolatile storage medium and computer equipment | |
CN113033593B (en) | Text detection training method and device based on deep learning | |
CN114494833A (en) | State identification method and device for port of optical cable cross-connecting cabinet |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40032042 Country of ref document: HK |
|
GR01 | Patent grant | ||
GR01 | Patent grant |