CN111582021B - Text detection method and device in scene image and computer equipment - Google Patents

Text detection method and device in scene image and computer equipment Download PDF

Info

Publication number
CN111582021B
CN111582021B CN202010223195.1A CN202010223195A CN111582021B CN 111582021 B CN111582021 B CN 111582021B CN 202010223195 A CN202010223195 A CN 202010223195A CN 111582021 B CN111582021 B CN 111582021B
Authority
CN
China
Prior art keywords
text
pixel points
text prediction
circumscribed rectangle
confidence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010223195.1A
Other languages
Chinese (zh)
Other versions
CN111582021A (en
Inventor
高远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010223195.1A priority Critical patent/CN111582021B/en
Publication of CN111582021A publication Critical patent/CN111582021A/en
Priority to PCT/CN2020/131604 priority patent/WO2021189889A1/en
Application granted granted Critical
Publication of CN111582021B publication Critical patent/CN111582021B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • G06V10/225Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on a marking or identifier characterising the area
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of image processing, in particular to a text detection method and device of a scene image and computer equipment, wherein the method comprises the following steps: detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; screening high confidence pixel points in the text prediction box; calculating a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points; when the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction frame is adjusted through the minimum circumscribed rectangle; and cutting in the scene image to obtain a text image to be identified and identifying text information therein. The method provided by the embodiment of the invention can correct and adjust the width of the text prediction frame through the region with high confidence on the basis of realizing text detection by using the EAST method, so that the width of the text prediction frame is reliably reduced, and more accurate text recognition is realized.

Description

Text detection method and device in scene image and computer equipment
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for detecting text in a scene image, and a computer device.
Background
Word recognition based on computer vision is of great use in the present big data age. Which is the basis for the implementation of many intelligent functions, such as recommendation systems, machine translation, etc. And text detection is used as a precondition of a text recognition process, and the detection accuracy has a remarkable influence on the text recognition effect.
In a complex natural scene, the text has the characteristics of distribution at various positions, various arrangement forms, inconsistent distribution directions, multi-language mixing and the like, so that the task of text detection is very challenging.
There is a text detection algorithm called CTPN in the conventional technology, which is based on the idea of dividing, detecting and then merging complete text to realize text detection in natural scenes. The conventional technology detects text by means of segmentation and recombination, on the one hand, the detection accuracy is inaccurate, on the other hand, the detection time is excessively consumed, and the user experience is poor, and on the basis of the detection time, a text detection method called EAST (AN EFFICIENT AND accurate scene text detector) is also proposed. The method performs feature extraction and learning by means of the FCN architecture, and performs end-to-end training and optimization directly, so that unnecessary intermediate steps are eliminated.
However, in the practical application process of EAST, there are still many limitations, and the practical application requirements cannot be well met. For example, the width of the finally obtained text prediction box does not coincide with the actual text in the scene, so that the conventional technology needs to be further improved on the basis of the actual application of EAST.
Disclosure of Invention
The invention aims to solve the technical problem that the identification precision of the existing EAST algorithm can not meet the actual use requirement.
In order to solve the above technical problems, in a first aspect, an embodiment of the present invention provides a method for detecting text in a scene image, including: training and optimizing the full convolution network model;
Detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction box as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction box and are output by the full convolution network model; calculating a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction box; calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle; when the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction box is adjusted through the minimum circumscribed rectangle; cutting the adjusted text prediction box in the scene image to obtain a text image to be identified; and identifying the characters in the text image to be identified.
Optionally, before calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle, the method further comprises:
calculating a confidence average value of the high-confidence pixel points in the minimum circumscribed rectangle;
and when the confidence coefficient average value is smaller than a preset screening threshold value, eliminating the minimum circumscribed rectangle.
Optionally, the training optimization on the full convolution network model includes: constructing a full convolution network model; labeling a training label and constructing a training data set; and training and optimizing the full convolution network model through the training data set and a preset loss function.
Optionally, the calculating the overlapping degree between the text prediction box and the corresponding minimum bounding rectangle includes:
Determining a pixel point which is simultaneously in the text prediction box and the minimum circumscribed rectangle as a first pixel point; determining that the pixel points only belonging to the text prediction box or the minimum circumscribed rectangle are second pixel points; calculating the sum of the numbers of the first pixel points and the second pixel points; and calculating the ratio between the number of the first pixel points and the sum of the number of the first pixel points and the number of the second pixel points as the overlapping degree.
Optionally, when the overlapping degree is greater than a preset overlapping degree threshold, the text prediction box is adjusted by the following formula:
P1=w*p+(1-w)*d,
wherein P1 is the width of the text prediction frame after adjustment, w is a weight coefficient, P is the width of the text prediction frame, and d is the width of the corresponding minimum circumscribed rectangle.
Optionally, the calculating, according to the high confidence pixel point, a minimum bounding rectangle corresponding to the text prediction box includes:
Determining two pixel points with the highest confidence coefficient in the pixel points with the high confidence coefficient as length calibration pixel points;
Taking a connecting line between the length calibration pixel points as a first direction, and determining two high-confidence pixel points with the farthest distance as width calibration pixel points in a second direction perpendicular to the first direction;
and the first line segment passing through the length calibration pixel points and perpendicular to the connecting line between the length calibration pixel points is taken as a long line, and the second line segment passing through the width calibration pixel points and perpendicular to the connecting line between the width calibration pixel points is taken as a wide line at the same time, so that the minimum circumscribed rectangle is formed.
In a second aspect, an embodiment of the present invention provides a text detection apparatus for a scene image, including:
The training unit is used for training and optimizing the full convolution network model; the text prediction box detection unit is used for detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; the screening unit is used for screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction frame as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction frame and are output by the full convolution network model; the minimum circumscribed rectangle determining unit is used for calculating a minimum circumscribed rectangle corresponding to the text prediction frame according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction frame; the overlapping degree calculating unit is used for calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle; the adjusting unit is used for adjusting the width of the text prediction frame through the minimum circumscribed rectangle when the overlapping degree is larger than a preset overlapping degree threshold value; the cutting unit is used for cutting the adjusted text prediction box in the scene image to obtain a text image to be identified; and the text recognition unit is used for recognizing the text information in the text image to be recognized.
Optionally, the method further comprises: the confidence coefficient calculating unit is used for calculating a confidence coefficient average value of the high-confidence coefficient pixel points in the minimum circumscribed rectangle; and the minimum circumscribed rectangle screening unit is used for eliminating the minimum circumscribed rectangle when the confidence coefficient average value is smaller than a preset screening threshold value.
In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the text detection method of the scene image when executing the computer program.
In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the above-described text detection method of a scene image.
According to the text detection method provided by the embodiment of the invention, on the basis of realizing text detection by using an EAST method, the width of the text prediction box is corrected and adjusted through the high-confidence region, so that the width of the text prediction box is reliably reduced, and more accurate text recognition is realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic structural diagram of a computer device according to an embodiment of the present invention;
Fig. 2 is a flow chart of a text detection method of a scene image according to an embodiment of the present invention;
FIG. 3 is a flow chart of step 20 in FIG. 1;
FIG. 4 is a schematic flow chart of screening minimum bounding rectangles according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a text detection device for a scene image according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a text detection device for a scene image according to another embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
The embodiment of the invention firstly provides a text detection method of a scene image, and the text detection method of the scene image can be applied to adjust the width of a text detection box through a region with high confidence on the basis of realizing text detection by using an EAST method, so as to realize more accurate text recognition.
Referring to fig. 1, fig. 1 is a schematic diagram of a computer device 100 according to an embodiment of the invention. The computer device 100 may be a computer, a cluster of computers, a mainstream computer, a computing device dedicated to providing online content, or a computer network comprising a group of computers operating in a centralized or distributed manner.
As shown in fig. 1, the computer device 100 includes: a processor 102, a memory, and a network interface 105 connected by a system bus 101; the memory may include a nonvolatile storage medium 103 and an internal memory 104.
In an embodiment of the present invention, the Processor 102 may be a central processing unit (Central Processing Unit, CPU), the Processor 102 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., depending on the type of hardware used. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The number of processors 102 may be one or more, and the one or more processors 102 may execute sequences of computer program instructions to perform the text detection method for various scene images, as will be described in more detail below.
Computer program instructions are stored by the non-volatile storage medium 103, accessed, and read from the non-volatile storage medium 103 for execution by the processor 10 to implement the tuning methods disclosed in the embodiments of the present invention described below. For example, the nonvolatile storage medium 103 stores a software application that performs the adjustment method described below. Further, the non-volatile storage medium 103 may store the entire software application or only a portion of the software application that may be executed by the processor 102. It should be noted that although only one block is shown in fig. 1, the nonvolatile storage medium 103 may include a plurality of physical devices installed on a central processing device or different computing devices.
The network interface 105 is used for network communication, such as providing for transmission of data information, etc. It will be appreciated by those skilled in the art that the structure shown in FIG. 1 is merely a block diagram of some of the structures associated with the present inventive arrangements and does not constitute a limitation of the computer device 100 to which the present inventive arrangements are applied, and that a particular computer device 100 may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.
The embodiment of the invention also provides a computer readable storage medium. The computer readable storage medium may be a non-volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program when executed by a processor implements the text detection method of a scene image disclosed in the embodiment of the invention. The computer program product is embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer program code.
In the case of implementing the computer device 100 in software, fig. 2 shows a schematic diagram of a method for adjusting scene text according to an embodiment, and the method in fig. 2 is described in detail below. Referring to fig. 2, the method includes the steps of:
and step 20, training and optimizing the full convolution network model.
The full convolution network model is one type of neural network model. Before use, offline training with training data is required to determine the transfer weight parameters among neurons.
In some embodiments, as shown in fig. 3, the step 20 specifically includes the following steps:
and 200, constructing a full convolution network model.
The method comprises the steps of carrying out feature extraction on image data related to an input scene picture through a full convolution network model, and finally generating a text fractional feature map of a single-channel pixel level and a geometric figure feature map of multiple channels. Specifically, the network structure of the full convolution network model can be broken down into three parts: the device comprises a feature extraction layer, a feature combination layer and an output layer.
First, the feature extraction layer adopts a general convolution network as a base network. And during training, initializing parameters of the convolutional network and then extracting features. After training is completed, the optimized convolutional network parameters are obtained. In practical application, the acceleration model performance (Pvanet, performance Vs Accuracy), the VGG16 model (Visual Geometry Group) and other basic networks can be selected and used according to the requirements of practical situations. The embodiment of the invention can obtain four levels of feature graphs by the convolution network extraction, and the sizes of the feature graphs are sequentially 1/32,1/16,1/8 and 1/4 of the input image data. Since a large receptive field is required for locating large text, a small receptive field is correspondingly required for locating small text regions. Therefore, the use requirements of large difference of the sizes of the text areas in natural scenes can be met by using the characteristic diagrams with different levels.
And secondly, combining the four levels of feature graphs layer by using the U-shaped idea, thereby realizing the effect of reducing the later calculation cost. Wherein, the layer-by-layer merging method can be represented by the following formula:
The specific process of the above formula is as follows: in each merging stage, the feature map from the previous stage is first input to the upper pooling layer (unpool layers) to expand its size. It is then combined with the current layer feature map. Finally, the number of channels and the amount of computation is reduced by the convolutional layer (conv layer), in particular the conv1 x 1 layer, and the local information is fused by the conv3 x 3 layer to finally produce the output of the merging stage. After the last merging stage (i.e. i=4), the conv3×3 layer generates the final feature map of the merging branches and sends it to the output layer.
And finally, outputting the text score characteristic diagram and the geometric figure characteristic diagram with the sizes of original figures 1/4 at an output layer, wherein the number of channels of the text score characteristic diagram is 1, and the number of channels of the geometric figure characteristic diagram is 5. The text score feature map indicates the confidence that each pixel belongs to the text prediction box.
And 202, labeling a training label and constructing a training data set.
The labeling of the training labels can be accomplished in any suitable manner in the prior art, and the training labels are used as a training data set to train the full convolution network model. In some cases, the training or testing may also be performed directly using an existing training data set.
And 204, training and optimizing the full convolution network model through the training data set and a preset loss function.
Training optimization is a learning optimization process for parameters of the full convolution network model. When the parameter optimization is completed, the fully-convolution network model with the completed training can be applied to the text detection of the actual scene.
Besides the well-marked training data, the optimization process also needs to provide a proper loss function for evaluating the effect of the full convolution network model, and parameter optimization is realized by minimizing loss.
In the present application, the loss function can be expressed by the following expression:
L=Ls+λgLg
where L is a loss function, ls is a loss of the text feature score map, lg is a loss of the geometric feature map, and λg represents importance between the two losses, which can be set to 1.
In particular, for the loss of text feature score graphs, class-balanced cross entropy may be used for computation. The penalty for the geometric feature map may be calculated using an overlap (IOU, interaction over union) penalty function.
And step 22, detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model.
And determining a text prediction box in the scene image to be detected through the trained full convolution network model. I.e. the area of the scene image containing text.
As described above, the output layer of the full convolutional network model may include a text score feature map and a geometric feature map. The text score feature map records the probability that each pixel belongs to a text prediction box when the pixel is mapped to an image to be detected. The geometric figure feature map records the distance between each pixel and the text prediction box when the pixel is mapped to the image to be detected.
The full convolutional network model typically outputs a greater number of candidate text prediction boxes. Therefore, in a preferred embodiment, a non-maximum suppression algorithm may also be applied to eliminate redundant text prediction boxes to determine the position of the best text prediction box, which is the text prediction box in the embodiment of the present invention.
The scene picture is a picture which can be interpreted as being taken in a real scene in the present embodiment, for example, a picture obtained by framing through any suitable camera-equipped terminal.
And step 24, screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction frame to serve as high-confidence coefficient pixels.
The confidence coefficient is the probability that the pixel points belong to the text prediction box and is output by the full convolution network model. That is, the confidence of each pixel point is represented in the text feature score chart, so that the situation that text prediction boxes may exist at different positions is represented. In this step, through a proper screening method, some pixels with higher confidence coefficient are screened out and can be used for further adjustment and optimization of the text prediction box.
Specifically, high confidence pixel points can be screened in the text feature score map by setting a proper confidence threshold. For example, the confidence threshold may be set to 0.7, and then, it is sequentially determined whether the pixel point in the text feature score map is greater than the confidence threshold. If yes, the pixel point is determined to be the high-confidence pixel point. If not, discarding the pixel point.
In one image to be detected, there may be a plurality of different text prediction boxes. Thus, these high confidence pixels may belong to different text boxes in the scene. Accordingly, to avoid errors in adjustment or correction, high confidence pixels need to be marked and distinguished. Specifically, according to the position of the pixel point, which text prediction box the pixel point specifically belongs to can be determined, so that the pixel point with high confidence coefficient is respectively classified into the corresponding text prediction boxes.
And 26, calculating the minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points.
The minimum bounding rectangle (MBR, minimum bounding rectangle) is represented in two-dimensional coordinates, which is the maximum range of high confidence pixels in the same text prediction box. The method is characterized in that the method represents a rectangular area given by high-confidence pixel points of the same text prediction box, and the rectangular area comprises all the high-confidence pixel points in the text prediction box and has the smallest area.
In particular, any suitable algorithm may be used to computationally determine the minimum bounding rectangle for each text prediction box.
In some embodiments, the method specifically includes the following steps:
Firstly, determining two high-confidence pixel points with the farthest distance from the high-confidence pixel points as length calibration pixel points.
And then, taking a connecting line between the length calibration pixel points as a first direction, and determining two pixel points with high confidence coefficient, which are farthest in a second direction perpendicular to the first direction, as width calibration pixel points.
And finally, taking a first line segment passing through the length calibration pixel points and perpendicular to the connecting line between the length calibration pixel points as a length, and taking a second line segment passing through the width calibration pixel points and perpendicular to the connecting line between the width calibration pixel points as a width, so that the minimum circumscribed rectangle can be enclosed.
And 28, calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle.
The degree of overlap (IOU), which may also be referred to as "overlap ratio", is used to characterize the degree of overlap between a text prediction box and a corresponding minimum bounding rectangle. Which is calculated from the area ratio between the intersection and union of two boxes. The higher the overlap, the higher the degree of matching between the two boxes.
In some embodiments, the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle may be calculated specifically by:
Firstly, respectively determining that the pixel points in the text prediction box and the minimum bounding rectangle are first pixel points and the pixel points which only belong to the text prediction box or the minimum bounding rectangle are second pixel points;
then, the sum of the numbers of the first pixel points and the second pixel points is calculated.
And finally, calculating the ratio between the number of the first pixel points and the sum of the number of the first pixel points and the number of the second pixel points as the overlapping degree.
And step 30, when the overlapping degree is larger than a preset overlapping degree threshold value, adjusting the width of the text prediction box through the minimum circumscribed rectangle.
The overlap threshold is an empirical value that can be set by the skilled person as required by the situation. Typically, the width of the smallest bounding rectangle is smaller than the width of the text prediction box, which indicates that the region within the smallest bounding rectangle has a greater likelihood of belonging to the text region. Therefore, the text prediction box can be properly adjusted through the minimum circumscribed rectangle, so that the width of the text prediction box is correspondingly reduced.
Specifically, when the overlapping degree is greater than a preset overlapping degree threshold value, the text prediction box is adjusted through the following formula:
P1=w*p+(1-w)*d,
wherein P1 is the width of the text prediction frame after adjustment, w is a weight coefficient, P is the width of the text prediction frame, and d is the width of the corresponding minimum circumscribed rectangle.
Through the above formula, after the proper w value is given, the width of the text prediction frame can be corrected and adjusted according to the smaller effective minimum circumscribed rectangle, so that the width of the text prediction frame can be reliably reduced, and more accurate text recognition is realized.
And step 32, cutting the adjusted text prediction box in the scene image to obtain a text image to be recognized.
The adjusted text prediction box prompts the scene image to contain the text. These text prediction boxes can thus be cut out of the scene image as text images to be recognized.
And step 34, identifying text information in the text image to be identified.
In particular, any type of algorithm or mode can be selected to identify and acquire text information in the text image, so as to obtain a text detection result of the final scene image. Which are well known to those skilled in the art and are not described in detail herein.
By applying the text detection method provided by the embodiment of the invention, the width of the text prediction box can be reliably reduced, more accurate text recognition is realized, the difficulty of subsequent processing is reduced, and the text detection accuracy is improved.
Since the minimum bounding rectangle is the standard for the final adjustment of the width of the text detection box. Therefore, it is necessary to ensure that the minimum bounding rectangle has good reliability, otherwise the subsequent adjustment process may instead have adverse consequences.
In some embodiments, prior to performing step 28, the method may further include the step of screening the minimum bounding rectangle as shown in fig. 4:
step 401: and calculating the confidence average value of the high-confidence pixel points in the minimum circumscribed rectangle.
The confidence average value refers to the confidence average value of the high-confidence pixel points, and represents the probability that the minimum bounding rectangle belongs to a text region as a whole.
Step 402: and judging whether the confidence average value is smaller than a preset screening threshold value. If yes, go to step 403. If not, go to step 404.
Step 403: and eliminating the minimum circumscribed rectangle.
It will be appreciated that the smallest bounding rectangles with low confidence averages do not actually have high reliability or probability of being text, and are not sufficient as criteria for correction. Therefore, the minimum bounding rectangles can be eliminated, and the width correction of the text prediction box is performed without using the minimum bounding rectangles.
Step 404: and reserving the minimum bounding rectangle as an effective minimum bounding rectangle. These effective minimum bounding rectangles can be used for further processing as references to adjust the text detection box.
An embodiment of the present invention further provides a text detection device corresponding to the text detection method of a scene image in the foregoing embodiment, referring to fig. 5, fig. 5 provides a block diagram of the text detection device of a scene image provided in the embodiment of the present invention, and as shown in fig. 5, the text detection device 500 includes: training unit 50, text prediction box detection unit 52, screening unit 54, minimum bounding rectangle determination unit 56, overlap calculation unit 58, adjustment unit 60, cutting unit 62, and text recognition unit 64.
The training unit 50 is used for training and optimizing the full convolution network model.
The text prediction box detection unit 52 is configured to screen, as high-confidence pixel points, pixels with confidence degrees greater than a preset confidence degree threshold value in the text prediction box, where the confidence degrees are probabilities that the pixels belong to the text prediction box and are output by the full convolution network model; the minimum bounding rectangle determining unit 54 is configured to calculate, according to the high confidence pixel points, a minimum bounding rectangle corresponding to the text prediction box, where the minimum bounding rectangle is a rectangle that includes all high confidence pixel points in the text prediction box and has a minimum area; the overlap calculating unit 58 is configured to calculate an overlap between the text prediction box and a corresponding minimum bounding rectangle. The adjusting unit 60 is configured to adjust the width of the text prediction box according to the minimum bounding rectangle when the overlapping degree is greater than a preset overlapping degree threshold. The cutting unit 62 is configured to cut the adjusted text prediction box in the scene image, to obtain a text image to be identified. The text recognition unit 64 is used for recognizing text information in the text image to be recognized.
According to the text detection device for the scene image, provided by the embodiment of the invention, on the basis of realizing text detection by using an EAST method, the width of the text prediction box is corrected and adjusted through the high-confidence region, so that the width of the text prediction box is reliably reduced, and more accurate text recognition is realized.
In some embodiments, as shown in fig. 6, the text detection apparatus 500 may further include, in addition to the functional modules shown in fig. 5: confidence level calculation unit 66 and minimum bounding rectangle screening unit 68.
The confidence calculating unit 66 is configured to calculate a confidence average value of the high-confidence pixel points in the minimum bounding rectangle. The minimum bounding rectangle filtering unit 68 is configured to reject the minimum bounding rectangle when the confidence average value is less than a preset filtering threshold.
The minimum bounding rectangle (MBR, minimum bounding rectangle) is the maximum range of high confidence pixels in the same text prediction box, expressed in two-dimensional coordinates. Which represents a rectangular region given by high confidence pixels of the same text prediction box. The minimum bounding rectangle may be determined or calculated using any suitable means, and the calculation to determine its corresponding minimum bounding rectangle given a plurality of pixels is well known to those skilled in the art and will not be outlined herein.
By applying the text detection device for the scene image, which is provided by the embodiment of the invention, the width of the text prediction box can be reliably reduced, and more accurate text recognition can be realized.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (9)

1. A method for text detection of a scene image, comprising:
training and optimizing the full convolution network model;
Detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model;
screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction box as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction box and are output by the full convolution network model;
calculating a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction box;
calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle;
When the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction box is adjusted through the minimum circumscribed rectangle;
Cutting the adjusted text prediction box in the scene image to obtain a text image to be identified;
Identifying text information in the text image to be identified;
The calculating the minimum circumscribed rectangle corresponding to the text prediction box according to the high confidence pixel points comprises the following steps:
Determining two pixel points with the highest confidence coefficient in the pixel points with the high confidence coefficient as length calibration pixel points;
Taking a connecting line between the length calibration pixel points as a first direction, and determining two high-confidence pixel points with the farthest distance as width calibration pixel points in a second direction perpendicular to the first direction;
and the first line segment passing through the length calibration pixel points and perpendicular to the connecting line between the length calibration pixel points is taken as a long line, and the second line segment passing through the width calibration pixel points and perpendicular to the connecting line between the width calibration pixel points is taken as a wide line at the same time, so that the minimum circumscribed rectangle is formed.
2. The method of text detection of a scene image of claim 1, wherein prior to calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle, the method further comprises:
calculating a confidence average value of the high-confidence pixel points in the minimum circumscribed rectangle;
and when the confidence coefficient average value is smaller than a preset screening threshold value, eliminating the minimum circumscribed rectangle.
3. The method for text detection of a scene image as recited in claim 2, wherein said training optimization of the full convolutional network model comprises:
constructing a full convolution network model;
labeling a training label and constructing a training data set;
And training and optimizing the full convolution network model through the training data set and a preset loss function.
4. The method of claim 1, wherein said calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle comprises:
Determining a pixel point which is simultaneously in the text prediction box and the minimum circumscribed rectangle as a first pixel point;
Determining that the pixel points only belonging to the text prediction box or the minimum circumscribed rectangle are second pixel points;
calculating the sum of the numbers of the first pixel points and the second pixel points;
and calculating the ratio between the number of the first pixel points and the sum of the number of the first pixel points and the number of the second pixel points as the overlapping degree.
5. The method of claim 1, wherein when the degree of overlap is greater than a preset degree of overlap threshold, the text prediction box is adjusted by the following formula:
P1 = w*p+(1-w)*d,
wherein P1 is the width of the text prediction frame after adjustment, w is a weight coefficient, P is the width of the text prediction frame, and d is the width of the corresponding minimum circumscribed rectangle.
6. A text detection device for a scene image, comprising:
The training unit is used for training and optimizing the full convolution network model;
A text prediction box detection unit, configured to detect and determine a plurality of text prediction boxes in the scene image through the trained full convolution network model;
The screening unit is used for screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction frame as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction frame and are output by the full convolution network model;
The minimum circumscribed rectangle determining unit is used for calculating a minimum circumscribed rectangle corresponding to the text prediction frame according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction frame;
The overlapping degree calculating unit is used for calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle;
The adjusting unit is used for adjusting the width of the text prediction frame through the minimum circumscribed rectangle when the overlapping degree is larger than a preset overlapping degree threshold value;
The cutting unit is used for cutting the adjusted text prediction box in the scene image to obtain a text image to be identified;
a text recognition unit for recognizing text information in the text image to be recognized;
The minimum circumscribed rectangle determining unit includes:
Determining two pixel points with the highest confidence coefficient in the pixel points with the high confidence coefficient as length calibration pixel points;
Taking a connecting line between the length calibration pixel points as a first direction, and determining two high-confidence pixel points with the farthest distance as width calibration pixel points in a second direction perpendicular to the first direction;
and the first line segment passing through the length calibration pixel points and perpendicular to the connecting line between the length calibration pixel points is taken as a long line, and the second line segment passing through the width calibration pixel points and perpendicular to the connecting line between the width calibration pixel points is taken as a wide line at the same time, so that the minimum circumscribed rectangle is formed.
7. The apparatus as recited in claim 6, further comprising:
The confidence coefficient calculating unit is used for calculating a confidence coefficient average value of the high-confidence coefficient pixel points in the minimum circumscribed rectangle;
And the minimum circumscribed rectangle screening unit is used for eliminating the minimum circumscribed rectangle when the confidence coefficient average value is smaller than a preset screening threshold value.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a text detection method of a scene image as claimed in any of claims 1 to 5 when the computer program is executed.
9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the text detection method of a scene image according to any of claims 1 to 5.
CN202010223195.1A 2020-03-26 2020-03-26 Text detection method and device in scene image and computer equipment Active CN111582021B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010223195.1A CN111582021B (en) 2020-03-26 2020-03-26 Text detection method and device in scene image and computer equipment
PCT/CN2020/131604 WO2021189889A1 (en) 2020-03-26 2020-11-26 Text detection method and apparatus in scene image, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010223195.1A CN111582021B (en) 2020-03-26 2020-03-26 Text detection method and device in scene image and computer equipment

Publications (2)

Publication Number Publication Date
CN111582021A CN111582021A (en) 2020-08-25
CN111582021B true CN111582021B (en) 2024-07-05

Family

ID=72124246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010223195.1A Active CN111582021B (en) 2020-03-26 2020-03-26 Text detection method and device in scene image and computer equipment

Country Status (2)

Country Link
CN (1) CN111582021B (en)
WO (1) WO2021189889A1 (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111582021B (en) * 2020-03-26 2024-07-05 平安科技(深圳)有限公司 Text detection method and device in scene image and computer equipment
CN111932577B (en) * 2020-09-16 2021-01-08 北京易真学思教育科技有限公司 Text detection method, electronic device and computer readable medium
CN111931784B (en) * 2020-09-17 2021-01-01 深圳壹账通智能科技有限公司 Bill recognition method, system, computer device and computer-readable storage medium
CN112329765B (en) * 2020-10-09 2024-05-24 中保车服科技服务股份有限公司 Text detection method and device, storage medium and computer equipment
CN112232340A (en) * 2020-10-15 2021-01-15 马婧 Method and device for identifying printed information on surface of object
CN112613561B (en) * 2020-12-24 2022-06-03 哈尔滨理工大学 EAST algorithm optimization method
CN112819937B (en) * 2021-04-19 2021-07-06 清华大学 Self-adaptive multi-object light field three-dimensional reconstruction method, device and equipment
CN113298079B (en) * 2021-06-28 2023-10-27 北京奇艺世纪科技有限公司 Image processing method and device, electronic equipment and storage medium
CN114067237A (en) * 2021-10-28 2022-02-18 清华大学 Video data processing method, device and equipment
CN114037826A (en) * 2021-11-16 2022-02-11 平安普惠企业管理有限公司 Text recognition method, device, equipment and medium based on multi-scale enhanced features
CN114495103B (en) * 2022-01-28 2023-04-04 北京百度网讯科技有限公司 Text recognition method and device, electronic equipment and medium
CN115375987B (en) * 2022-08-05 2023-09-05 北京百度网讯科技有限公司 Data labeling method and device, electronic equipment and storage medium
CN117649635B (en) * 2024-01-30 2024-06-11 湖北经济学院 Method, system and storage medium for detecting shadow eliminating point of narrow water channel scene

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796082A (en) * 2019-10-29 2020-02-14 上海眼控科技股份有限公司 Nameplate text detection method and device, computer equipment and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304761A (en) * 2017-09-25 2018-07-20 腾讯科技(深圳)有限公司 Method for text detection, device, storage medium and computer equipment
CN109886997B (en) * 2019-01-23 2023-07-11 平安科技(深圳)有限公司 Identification frame determining method and device based on target detection and terminal equipment
CN109886174A (en) * 2019-02-13 2019-06-14 东北大学 A kind of natural scene character recognition method of warehouse shelf Sign Board Text region
CN109977943B (en) * 2019-02-14 2024-05-07 平安科技(深圳)有限公司 Image target recognition method, system and storage medium based on YOLO
CN110135424B (en) * 2019-05-23 2021-06-11 阳光保险集团股份有限公司 Inclined text detection model training method and ticket image text detection method
CN110232713B (en) * 2019-06-13 2022-09-20 腾讯数码(天津)有限公司 Image target positioning correction method and related equipment
CN110443140B (en) * 2019-07-05 2023-10-03 平安科技(深圳)有限公司 Text positioning method, device, computer equipment and storage medium
CN110414499B (en) * 2019-07-26 2021-06-04 第四范式(北京)技术有限公司 Text position positioning method and system and model training method and system
CN110874618B (en) * 2020-01-19 2020-11-27 同盾控股有限公司 OCR template learning method and device based on small sample, electronic equipment and medium
CN111582021B (en) * 2020-03-26 2024-07-05 平安科技(深圳)有限公司 Text detection method and device in scene image and computer equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796082A (en) * 2019-10-29 2020-02-14 上海眼控科技股份有限公司 Nameplate text detection method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN111582021A (en) 2020-08-25
WO2021189889A1 (en) 2021-09-30

Similar Documents

Publication Publication Date Title
CN111582021B (en) Text detection method and device in scene image and computer equipment
CN110751134B (en) Target detection method, target detection device, storage medium and computer equipment
CN110458095B (en) Effective gesture recognition method, control method and device and electronic equipment
CN110084299B (en) Target detection method and device based on multi-head fusion attention
US10783643B1 (en) Segmentation-based damage detection
CN110516541B (en) Text positioning method and device, computer readable storage medium and computer equipment
CN111652181B (en) Target tracking method and device and electronic equipment
CN111860398A (en) Remote sensing image target detection method and system and terminal equipment
US20230137337A1 (en) Enhanced machine learning model for joint detection and multi person pose estimation
CN112800955A (en) Remote sensing image rotating target detection method and system based on weighted bidirectional feature pyramid
CN111639513A (en) Ship shielding identification method and device and electronic equipment
CN114549369B (en) Data restoration method and device, computer and readable storage medium
CN114038004A (en) Certificate information extraction method, device, equipment and storage medium
CN115631112B (en) Building contour correction method and device based on deep learning
CN111368632A (en) Signature identification method and device
CN112348116A (en) Target detection method and device using spatial context and computer equipment
CN113487610A (en) Herpes image recognition method and device, computer equipment and storage medium
CN113205047A (en) Drug name identification method and device, computer equipment and storage medium
CN112417947A (en) Method and device for optimizing key point detection model and detecting face key points
CN108446602B (en) Device and method for detecting human face
CN116468702A (en) Chloasma assessment method, device, electronic equipment and computer readable storage medium
CN113706705B (en) Image processing method, device, equipment and storage medium for high-precision map
CN114926631A (en) Target frame generation method and device, nonvolatile storage medium and computer equipment
CN113033593B (en) Text detection training method and device based on deep learning
CN114494833A (en) State identification method and device for port of optical cable cross-connecting cabinet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40032042

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant