CN111582021B

CN111582021B - Text detection method and device in scene image and computer equipment

Info

Publication number: CN111582021B
Application number: CN202010223195.1A
Authority: CN
Inventors: 高远
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2024-07-05
Anticipated expiration: 2040-03-26
Also published as: CN111582021A; WO2021189889A1

Abstract

The invention relates to the technical field of image processing, in particular to a text detection method and device of a scene image and computer equipment, wherein the method comprises the following steps: detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; screening high confidence pixel points in the text prediction box; calculating a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points; when the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction frame is adjusted through the minimum circumscribed rectangle; and cutting in the scene image to obtain a text image to be identified and identifying text information therein. The method provided by the embodiment of the invention can correct and adjust the width of the text prediction frame through the region with high confidence on the basis of realizing text detection by using the EAST method, so that the width of the text prediction frame is reliably reduced, and more accurate text recognition is realized.

Description

Text detection method and device in scene image and computer equipment

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for detecting text in a scene image, and a computer device.

Background

Word recognition based on computer vision is of great use in the present big data age. Which is the basis for the implementation of many intelligent functions, such as recommendation systems, machine translation, etc. And text detection is used as a precondition of a text recognition process, and the detection accuracy has a remarkable influence on the text recognition effect.

In a complex natural scene, the text has the characteristics of distribution at various positions, various arrangement forms, inconsistent distribution directions, multi-language mixing and the like, so that the task of text detection is very challenging.

There is a text detection algorithm called CTPN in the conventional technology, which is based on the idea of dividing, detecting and then merging complete text to realize text detection in natural scenes. The conventional technology detects text by means of segmentation and recombination, on the one hand, the detection accuracy is inaccurate, on the other hand, the detection time is excessively consumed, and the user experience is poor, and on the basis of the detection time, a text detection method called EAST (AN EFFICIENT AND accurate scene text detector) is also proposed. The method performs feature extraction and learning by means of the FCN architecture, and performs end-to-end training and optimization directly, so that unnecessary intermediate steps are eliminated.

However, in the practical application process of EAST, there are still many limitations, and the practical application requirements cannot be well met. For example, the width of the finally obtained text prediction box does not coincide with the actual text in the scene, so that the conventional technology needs to be further improved on the basis of the actual application of EAST.

Disclosure of Invention

The invention aims to solve the technical problem that the identification precision of the existing EAST algorithm can not meet the actual use requirement.

In order to solve the above technical problems, in a first aspect, an embodiment of the present invention provides a method for detecting text in a scene image, including: training and optimizing the full convolution network model;

Detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction box as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction box and are output by the full convolution network model; calculating a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction box; calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle; when the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction box is adjusted through the minimum circumscribed rectangle; cutting the adjusted text prediction box in the scene image to obtain a text image to be identified; and identifying the characters in the text image to be identified.

Optionally, before calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle, the method further comprises:

calculating a confidence average value of the high-confidence pixel points in the minimum circumscribed rectangle;

and when the confidence coefficient average value is smaller than a preset screening threshold value, eliminating the minimum circumscribed rectangle.

Optionally, the training optimization on the full convolution network model includes: constructing a full convolution network model; labeling a training label and constructing a training data set; and training and optimizing the full convolution network model through the training data set and a preset loss function.

Optionally, the calculating the overlapping degree between the text prediction box and the corresponding minimum bounding rectangle includes:

Determining a pixel point which is simultaneously in the text prediction box and the minimum circumscribed rectangle as a first pixel point; determining that the pixel points only belonging to the text prediction box or the minimum circumscribed rectangle are second pixel points; calculating the sum of the numbers of the first pixel points and the second pixel points; and calculating the ratio between the number of the first pixel points and the sum of the number of the first pixel points and the number of the second pixel points as the overlapping degree.

Optionally, when the overlapping degree is greater than a preset overlapping degree threshold, the text prediction box is adjusted by the following formula:

P1＝w*p+(1-w)*d，

wherein P1 is the width of the text prediction frame after adjustment, w is a weight coefficient, P is the width of the text prediction frame, and d is the width of the corresponding minimum circumscribed rectangle.

Optionally, the calculating, according to the high confidence pixel point, a minimum bounding rectangle corresponding to the text prediction box includes:

Determining two pixel points with the highest confidence coefficient in the pixel points with the high confidence coefficient as length calibration pixel points;

Taking a connecting line between the length calibration pixel points as a first direction, and determining two high-confidence pixel points with the farthest distance as width calibration pixel points in a second direction perpendicular to the first direction;

and the first line segment passing through the length calibration pixel points and perpendicular to the connecting line between the length calibration pixel points is taken as a long line, and the second line segment passing through the width calibration pixel points and perpendicular to the connecting line between the width calibration pixel points is taken as a wide line at the same time, so that the minimum circumscribed rectangle is formed.

In a second aspect, an embodiment of the present invention provides a text detection apparatus for a scene image, including:

The training unit is used for training and optimizing the full convolution network model; the text prediction box detection unit is used for detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model; the screening unit is used for screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction frame as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction frame and are output by the full convolution network model; the minimum circumscribed rectangle determining unit is used for calculating a minimum circumscribed rectangle corresponding to the text prediction frame according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction frame; the overlapping degree calculating unit is used for calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle; the adjusting unit is used for adjusting the width of the text prediction frame through the minimum circumscribed rectangle when the overlapping degree is larger than a preset overlapping degree threshold value; the cutting unit is used for cutting the adjusted text prediction box in the scene image to obtain a text image to be identified; and the text recognition unit is used for recognizing the text information in the text image to be recognized.

Optionally, the method further comprises: the confidence coefficient calculating unit is used for calculating a confidence coefficient average value of the high-confidence coefficient pixel points in the minimum circumscribed rectangle; and the minimum circumscribed rectangle screening unit is used for eliminating the minimum circumscribed rectangle when the confidence coefficient average value is smaller than a preset screening threshold value.

In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the text detection method of the scene image when executing the computer program.

In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the above-described text detection method of a scene image.

According to the text detection method provided by the embodiment of the invention, on the basis of realizing text detection by using an EAST method, the width of the text prediction box is corrected and adjusted through the high-confidence region, so that the width of the text prediction box is reliably reduced, and more accurate text recognition is realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a computer device according to an embodiment of the present invention;

Fig. 2 is a flow chart of a text detection method of a scene image according to an embodiment of the present invention;

FIG. 3 is a flow chart of step 20 in FIG. 1;

FIG. 4 is a schematic flow chart of screening minimum bounding rectangles according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a text detection device for a scene image according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a text detection device for a scene image according to another embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

The embodiment of the invention firstly provides a text detection method of a scene image, and the text detection method of the scene image can be applied to adjust the width of a text detection box through a region with high confidence on the basis of realizing text detection by using an EAST method, so as to realize more accurate text recognition.

Referring to fig. 1, fig. 1 is a schematic diagram of a computer device 100 according to an embodiment of the invention. The computer device 100 may be a computer, a cluster of computers, a mainstream computer, a computing device dedicated to providing online content, or a computer network comprising a group of computers operating in a centralized or distributed manner.

As shown in fig. 1, the computer device 100 includes: a processor 102, a memory, and a network interface 105 connected by a system bus 101; the memory may include a nonvolatile storage medium 103 and an internal memory 104.

In an embodiment of the present invention, the Processor 102 may be a central processing unit (Central Processing Unit, CPU), the Processor 102 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., depending on the type of hardware used. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The number of processors 102 may be one or more, and the one or more processors 102 may execute sequences of computer program instructions to perform the text detection method for various scene images, as will be described in more detail below.

Computer program instructions are stored by the non-volatile storage medium 103, accessed, and read from the non-volatile storage medium 103 for execution by the processor 10 to implement the tuning methods disclosed in the embodiments of the present invention described below. For example, the nonvolatile storage medium 103 stores a software application that performs the adjustment method described below. Further, the non-volatile storage medium 103 may store the entire software application or only a portion of the software application that may be executed by the processor 102. It should be noted that although only one block is shown in fig. 1, the nonvolatile storage medium 103 may include a plurality of physical devices installed on a central processing device or different computing devices.

The network interface 105 is used for network communication, such as providing for transmission of data information, etc. It will be appreciated by those skilled in the art that the structure shown in FIG. 1 is merely a block diagram of some of the structures associated with the present inventive arrangements and does not constitute a limitation of the computer device 100 to which the present inventive arrangements are applied, and that a particular computer device 100 may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

The embodiment of the invention also provides a computer readable storage medium. The computer readable storage medium may be a non-volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program when executed by a processor implements the text detection method of a scene image disclosed in the embodiment of the invention. The computer program product is embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer program code.

In the case of implementing the computer device 100 in software, fig. 2 shows a schematic diagram of a method for adjusting scene text according to an embodiment, and the method in fig. 2 is described in detail below. Referring to fig. 2, the method includes the steps of:

and step 20, training and optimizing the full convolution network model.

The full convolution network model is one type of neural network model. Before use, offline training with training data is required to determine the transfer weight parameters among neurons.

In some embodiments, as shown in fig. 3, the step 20 specifically includes the following steps:

and 200, constructing a full convolution network model.

The method comprises the steps of carrying out feature extraction on image data related to an input scene picture through a full convolution network model, and finally generating a text fractional feature map of a single-channel pixel level and a geometric figure feature map of multiple channels. Specifically, the network structure of the full convolution network model can be broken down into three parts: the device comprises a feature extraction layer, a feature combination layer and an output layer.

First, the feature extraction layer adopts a general convolution network as a base network. And during training, initializing parameters of the convolutional network and then extracting features. After training is completed, the optimized convolutional network parameters are obtained. In practical application, the acceleration model performance (Pvanet, performance Vs Accuracy), the VGG16 model (Visual Geometry Group) and other basic networks can be selected and used according to the requirements of practical situations. The embodiment of the invention can obtain four levels of feature graphs by the convolution network extraction, and the sizes of the feature graphs are sequentially 1/32,1/16,1/8 and 1/4 of the input image data. Since a large receptive field is required for locating large text, a small receptive field is correspondingly required for locating small text regions. Therefore, the use requirements of large difference of the sizes of the text areas in natural scenes can be met by using the characteristic diagrams with different levels.

And secondly, combining the four levels of feature graphs layer by using the U-shaped idea, thereby realizing the effect of reducing the later calculation cost. Wherein, the layer-by-layer merging method can be represented by the following formula:

The specific process of the above formula is as follows: in each merging stage, the feature map from the previous stage is first input to the upper pooling layer (unpool layers) to expand its size. It is then combined with the current layer feature map. Finally, the number of channels and the amount of computation is reduced by the convolutional layer (conv layer), in particular the conv1 x 1 layer, and the local information is fused by the conv3 x 3 layer to finally produce the output of the merging stage. After the last merging stage (i.e. i=4), the conv3×3 layer generates the final feature map of the merging branches and sends it to the output layer.

And finally, outputting the text score characteristic diagram and the geometric figure characteristic diagram with the sizes of original figures 1/4 at an output layer, wherein the number of channels of the text score characteristic diagram is 1, and the number of channels of the geometric figure characteristic diagram is 5. The text score feature map indicates the confidence that each pixel belongs to the text prediction box.

And 202, labeling a training label and constructing a training data set.

The labeling of the training labels can be accomplished in any suitable manner in the prior art, and the training labels are used as a training data set to train the full convolution network model. In some cases, the training or testing may also be performed directly using an existing training data set.

And 204, training and optimizing the full convolution network model through the training data set and a preset loss function.

Training optimization is a learning optimization process for parameters of the full convolution network model. When the parameter optimization is completed, the fully-convolution network model with the completed training can be applied to the text detection of the actual scene.

Besides the well-marked training data, the optimization process also needs to provide a proper loss function for evaluating the effect of the full convolution network model, and parameter optimization is realized by minimizing loss.

In the present application, the loss function can be expressed by the following expression:

L＝Ls+λgLg

where L is a loss function, ls is a loss of the text feature score map, lg is a loss of the geometric feature map, and λg represents importance between the two losses, which can be set to 1.

In particular, for the loss of text feature score graphs, class-balanced cross entropy may be used for computation. The penalty for the geometric feature map may be calculated using an overlap (IOU, interaction over union) penalty function.

And step 22, detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model.

And determining a text prediction box in the scene image to be detected through the trained full convolution network model. I.e. the area of the scene image containing text.

As described above, the output layer of the full convolutional network model may include a text score feature map and a geometric feature map. The text score feature map records the probability that each pixel belongs to a text prediction box when the pixel is mapped to an image to be detected. The geometric figure feature map records the distance between each pixel and the text prediction box when the pixel is mapped to the image to be detected.

The full convolutional network model typically outputs a greater number of candidate text prediction boxes. Therefore, in a preferred embodiment, a non-maximum suppression algorithm may also be applied to eliminate redundant text prediction boxes to determine the position of the best text prediction box, which is the text prediction box in the embodiment of the present invention.

The scene picture is a picture which can be interpreted as being taken in a real scene in the present embodiment, for example, a picture obtained by framing through any suitable camera-equipped terminal.

And step 24, screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction frame to serve as high-confidence coefficient pixels.

The confidence coefficient is the probability that the pixel points belong to the text prediction box and is output by the full convolution network model. That is, the confidence of each pixel point is represented in the text feature score chart, so that the situation that text prediction boxes may exist at different positions is represented. In this step, through a proper screening method, some pixels with higher confidence coefficient are screened out and can be used for further adjustment and optimization of the text prediction box.

Specifically, high confidence pixel points can be screened in the text feature score map by setting a proper confidence threshold. For example, the confidence threshold may be set to 0.7, and then, it is sequentially determined whether the pixel point in the text feature score map is greater than the confidence threshold. If yes, the pixel point is determined to be the high-confidence pixel point. If not, discarding the pixel point.

In one image to be detected, there may be a plurality of different text prediction boxes. Thus, these high confidence pixels may belong to different text boxes in the scene. Accordingly, to avoid errors in adjustment or correction, high confidence pixels need to be marked and distinguished. Specifically, according to the position of the pixel point, which text prediction box the pixel point specifically belongs to can be determined, so that the pixel point with high confidence coefficient is respectively classified into the corresponding text prediction boxes.

And 26, calculating the minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points.

The minimum bounding rectangle (MBR, minimum bounding rectangle) is represented in two-dimensional coordinates, which is the maximum range of high confidence pixels in the same text prediction box. The method is characterized in that the method represents a rectangular area given by high-confidence pixel points of the same text prediction box, and the rectangular area comprises all the high-confidence pixel points in the text prediction box and has the smallest area.

In particular, any suitable algorithm may be used to computationally determine the minimum bounding rectangle for each text prediction box.

In some embodiments, the method specifically includes the following steps:

Firstly, determining two high-confidence pixel points with the farthest distance from the high-confidence pixel points as length calibration pixel points.

And then, taking a connecting line between the length calibration pixel points as a first direction, and determining two pixel points with high confidence coefficient, which are farthest in a second direction perpendicular to the first direction, as width calibration pixel points.

And finally, taking a first line segment passing through the length calibration pixel points and perpendicular to the connecting line between the length calibration pixel points as a length, and taking a second line segment passing through the width calibration pixel points and perpendicular to the connecting line between the width calibration pixel points as a width, so that the minimum circumscribed rectangle can be enclosed.

And 28, calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle.

The degree of overlap (IOU), which may also be referred to as "overlap ratio", is used to characterize the degree of overlap between a text prediction box and a corresponding minimum bounding rectangle. Which is calculated from the area ratio between the intersection and union of two boxes. The higher the overlap, the higher the degree of matching between the two boxes.

In some embodiments, the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle may be calculated specifically by:

Firstly, respectively determining that the pixel points in the text prediction box and the minimum bounding rectangle are first pixel points and the pixel points which only belong to the text prediction box or the minimum bounding rectangle are second pixel points;

then, the sum of the numbers of the first pixel points and the second pixel points is calculated.

And finally, calculating the ratio between the number of the first pixel points and the sum of the number of the first pixel points and the number of the second pixel points as the overlapping degree.

And step 30, when the overlapping degree is larger than a preset overlapping degree threshold value, adjusting the width of the text prediction box through the minimum circumscribed rectangle.

The overlap threshold is an empirical value that can be set by the skilled person as required by the situation. Typically, the width of the smallest bounding rectangle is smaller than the width of the text prediction box, which indicates that the region within the smallest bounding rectangle has a greater likelihood of belonging to the text region. Therefore, the text prediction box can be properly adjusted through the minimum circumscribed rectangle, so that the width of the text prediction box is correspondingly reduced.

Specifically, when the overlapping degree is greater than a preset overlapping degree threshold value, the text prediction box is adjusted through the following formula:

P1＝w*p+(1-w)*d，

Through the above formula, after the proper w value is given, the width of the text prediction frame can be corrected and adjusted according to the smaller effective minimum circumscribed rectangle, so that the width of the text prediction frame can be reliably reduced, and more accurate text recognition is realized.

And step 32, cutting the adjusted text prediction box in the scene image to obtain a text image to be recognized.

The adjusted text prediction box prompts the scene image to contain the text. These text prediction boxes can thus be cut out of the scene image as text images to be recognized.

And step 34, identifying text information in the text image to be identified.

In particular, any type of algorithm or mode can be selected to identify and acquire text information in the text image, so as to obtain a text detection result of the final scene image. Which are well known to those skilled in the art and are not described in detail herein.

By applying the text detection method provided by the embodiment of the invention, the width of the text prediction box can be reliably reduced, more accurate text recognition is realized, the difficulty of subsequent processing is reduced, and the text detection accuracy is improved.

Since the minimum bounding rectangle is the standard for the final adjustment of the width of the text detection box. Therefore, it is necessary to ensure that the minimum bounding rectangle has good reliability, otherwise the subsequent adjustment process may instead have adverse consequences.

In some embodiments, prior to performing step 28, the method may further include the step of screening the minimum bounding rectangle as shown in fig. 4:

step 401: and calculating the confidence average value of the high-confidence pixel points in the minimum circumscribed rectangle.

The confidence average value refers to the confidence average value of the high-confidence pixel points, and represents the probability that the minimum bounding rectangle belongs to a text region as a whole.

Step 402: and judging whether the confidence average value is smaller than a preset screening threshold value. If yes, go to step 403. If not, go to step 404.

Step 403: and eliminating the minimum circumscribed rectangle.

It will be appreciated that the smallest bounding rectangles with low confidence averages do not actually have high reliability or probability of being text, and are not sufficient as criteria for correction. Therefore, the minimum bounding rectangles can be eliminated, and the width correction of the text prediction box is performed without using the minimum bounding rectangles.

Step 404: and reserving the minimum bounding rectangle as an effective minimum bounding rectangle. These effective minimum bounding rectangles can be used for further processing as references to adjust the text detection box.

An embodiment of the present invention further provides a text detection device corresponding to the text detection method of a scene image in the foregoing embodiment, referring to fig. 5, fig. 5 provides a block diagram of the text detection device of a scene image provided in the embodiment of the present invention, and as shown in fig. 5, the text detection device 500 includes: training unit 50, text prediction box detection unit 52, screening unit 54, minimum bounding rectangle determination unit 56, overlap calculation unit 58, adjustment unit 60, cutting unit 62, and text recognition unit 64.

The training unit 50 is used for training and optimizing the full convolution network model.

The text prediction box detection unit 52 is configured to screen, as high-confidence pixel points, pixels with confidence degrees greater than a preset confidence degree threshold value in the text prediction box, where the confidence degrees are probabilities that the pixels belong to the text prediction box and are output by the full convolution network model; the minimum bounding rectangle determining unit 54 is configured to calculate, according to the high confidence pixel points, a minimum bounding rectangle corresponding to the text prediction box, where the minimum bounding rectangle is a rectangle that includes all high confidence pixel points in the text prediction box and has a minimum area; the overlap calculating unit 58 is configured to calculate an overlap between the text prediction box and a corresponding minimum bounding rectangle. The adjusting unit 60 is configured to adjust the width of the text prediction box according to the minimum bounding rectangle when the overlapping degree is greater than a preset overlapping degree threshold. The cutting unit 62 is configured to cut the adjusted text prediction box in the scene image, to obtain a text image to be identified. The text recognition unit 64 is used for recognizing text information in the text image to be recognized.

According to the text detection device for the scene image, provided by the embodiment of the invention, on the basis of realizing text detection by using an EAST method, the width of the text prediction box is corrected and adjusted through the high-confidence region, so that the width of the text prediction box is reliably reduced, and more accurate text recognition is realized.

In some embodiments, as shown in fig. 6, the text detection apparatus 500 may further include, in addition to the functional modules shown in fig. 5: confidence level calculation unit 66 and minimum bounding rectangle screening unit 68.

The confidence calculating unit 66 is configured to calculate a confidence average value of the high-confidence pixel points in the minimum bounding rectangle. The minimum bounding rectangle filtering unit 68 is configured to reject the minimum bounding rectangle when the confidence average value is less than a preset filtering threshold.

The minimum bounding rectangle (MBR, minimum bounding rectangle) is the maximum range of high confidence pixels in the same text prediction box, expressed in two-dimensional coordinates. Which represents a rectangular region given by high confidence pixels of the same text prediction box. The minimum bounding rectangle may be determined or calculated using any suitable means, and the calculation to determine its corresponding minimum bounding rectangle given a plurality of pixels is well known to those skilled in the art and will not be outlined herein.

By applying the text detection device for the scene image, which is provided by the embodiment of the invention, the width of the text prediction box can be reliably reduced, and more accurate text recognition can be realized.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for text detection of a scene image, comprising:

training and optimizing the full convolution network model;

Detecting and determining a plurality of text prediction boxes in the scene image through the trained full convolution network model;

screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction box as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction box and are output by the full convolution network model;

calculating a minimum circumscribed rectangle corresponding to the text prediction box according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction box;

calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle;

When the overlapping degree is larger than a preset overlapping degree threshold value, the width of the text prediction box is adjusted through the minimum circumscribed rectangle;

Cutting the adjusted text prediction box in the scene image to obtain a text image to be identified;

Identifying text information in the text image to be identified;

The calculating the minimum circumscribed rectangle corresponding to the text prediction box according to the high confidence pixel points comprises the following steps:

2. The method of text detection of a scene image of claim 1, wherein prior to calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle, the method further comprises:

3. The method for text detection of a scene image as recited in claim 2, wherein said training optimization of the full convolutional network model comprises:

constructing a full convolution network model;

labeling a training label and constructing a training data set;

And training and optimizing the full convolution network model through the training data set and a preset loss function.

4. The method of claim 1, wherein said calculating the degree of overlap between the text prediction box and the corresponding minimum bounding rectangle comprises:

Determining a pixel point which is simultaneously in the text prediction box and the minimum circumscribed rectangle as a first pixel point;

Determining that the pixel points only belonging to the text prediction box or the minimum circumscribed rectangle are second pixel points;

calculating the sum of the numbers of the first pixel points and the second pixel points;

and calculating the ratio between the number of the first pixel points and the sum of the number of the first pixel points and the number of the second pixel points as the overlapping degree.

5. The method of claim 1, wherein when the degree of overlap is greater than a preset degree of overlap threshold, the text prediction box is adjusted by the following formula:

P1 = w*p+(1-w)*d，

6. A text detection device for a scene image, comprising:

The training unit is used for training and optimizing the full convolution network model;

A text prediction box detection unit, configured to detect and determine a plurality of text prediction boxes in the scene image through the trained full convolution network model;

The screening unit is used for screening pixels with confidence coefficient larger than a preset confidence coefficient threshold value in the text prediction frame as high-confidence coefficient pixels, wherein the confidence coefficient is the probability that the pixels belong to the text prediction frame and are output by the full convolution network model;

The minimum circumscribed rectangle determining unit is used for calculating a minimum circumscribed rectangle corresponding to the text prediction frame according to the high-confidence pixel points, wherein the minimum circumscribed rectangle is a rectangle with the minimum area and containing all the high-confidence pixel points in the text prediction frame;

The overlapping degree calculating unit is used for calculating the overlapping degree between the text prediction box and the corresponding minimum circumscribed rectangle;

The adjusting unit is used for adjusting the width of the text prediction frame through the minimum circumscribed rectangle when the overlapping degree is larger than a preset overlapping degree threshold value;

The cutting unit is used for cutting the adjusted text prediction box in the scene image to obtain a text image to be identified;

a text recognition unit for recognizing text information in the text image to be recognized;

The minimum circumscribed rectangle determining unit includes:

7. The apparatus as recited in claim 6, further comprising:

The confidence coefficient calculating unit is used for calculating a confidence coefficient average value of the high-confidence coefficient pixel points in the minimum circumscribed rectangle;

And the minimum circumscribed rectangle screening unit is used for eliminating the minimum circumscribed rectangle when the confidence coefficient average value is smaller than a preset screening threshold value.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a text detection method of a scene image as claimed in any of claims 1 to 5 when the computer program is executed.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the text detection method of a scene image according to any of claims 1 to 5.