CN109993040B

CN109993040B - Text recognition method and device

Info

Publication number: CN109993040B
Application number: CN201810004874.2A
Authority: CN
Inventors: 高立宁
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2018-01-03
Filing date: 2018-01-03
Publication date: 2021-07-30
Anticipated expiration: 2038-01-03
Also published as: CN109993040A

Abstract

The embodiment of the invention provides a text recognition method and a text recognition device, wherein the text recognition method comprises the following steps: acquiring a text image to be detected, wherein the text image contains information of a plurality of characters; performing multi-scale transformation on the text image to obtain a plurality of sub-text images with different sizes; performing text detection on each sub-text image by using a convolutional neural network model to obtain a candidate text detection box corresponding to each character in each sub-text image; carrying out non-maximum suppression NMS processing on a plurality of candidate text detection boxes of all sub-text images of the same character, and filtering the processed candidate text detection boxes to determine effective text detection boxes; and performing text recognition on the text image based on the effective text detection box to obtain a text recognition result. By the embodiment of the invention, the accuracy of text detection and recognition of the text image is greatly improved.

Description

Text recognition method and device

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a text recognition method and a text recognition device.

Background

With the development of computer and internet technologies, learning and teaching assisted by computers and networks has become a trend. The photographing and question searching is gradually an important means for answering questions of students in the learning and teaching process.

The photo-taking and question-searching process is a process that a user shoots a question image through a mobile phone and other equipment, submits the shot question image to a learning platform, the learning platform searches a database according to the photo-taking content and returns the question stem and analysis of the corresponding question, and text detection on the photo-taking content is one of key technologies in the photo-taking and question-searching process.

The existing text detection technology mainly realizes text detection based on text feature extraction by manually designing the features of the text, such as MSER (maximum Stable extreme value region) detection, SWT (Stroke Width Transform) detection and other methods. However, in the process of extracting text features, the existing text detection methods essentially compress information and the like, and although these methods have good text detection performance for images with good quality, such as clear image quality, small noise interference in the background and the like, the existing text detection methods seriously degrade images with poor quality, such as complex background, distorted text form, blur and the like.

Therefore, how to accurately detect and identify texts of images containing text titles, especially poor-quality images, becomes a problem to be solved urgently.

Disclosure of Invention

In view of this, an embodiment of the present invention provides a text recognition scheme to solve the problem in the prior art that the accuracy of text detection is not high for an image containing a text topic, especially an image with poor quality.

According to a first aspect of the embodiments of the present invention, there is provided a text recognition method, including: acquiring a text image to be detected, wherein the text image contains information of a plurality of characters; performing multi-scale transformation on the text image to obtain a plurality of sub-text images with different sizes; performing text detection on each sub-text image by using a convolutional neural network model to obtain a candidate text detection box corresponding to each character in each sub-text image; carrying out non-maximum suppression NMS processing on a plurality of candidate text detection boxes of all sub-text images of the same character, and filtering the processed candidate text detection boxes to determine effective text detection boxes; and performing text recognition on the text image based on the effective text detection box to obtain a text recognition result.

According to a second aspect of the embodiments of the present invention, there is also provided a text recognition apparatus including: the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a text image to be detected, and the text image comprises information of a plurality of characters; the second acquisition module is used for carrying out multi-scale transformation on the text image to obtain a plurality of sub-text images with different sizes; the third acquisition module is used for performing text detection on each sub-text image by using the convolutional neural network model to acquire a candidate text detection box corresponding to each character in each sub-text image; the determining module is used for carrying out non-maximum suppression NMS processing on a plurality of candidate text detection boxes of all the sub-text images of the same character, filtering the processed candidate text detection boxes and determining effective text detection boxes; and the recognition module is used for performing text recognition on the text image based on the effective text detection box to obtain a text recognition result.

According to the scheme provided by the embodiment of the invention, when text detection and identification are carried out, firstly, multi-scale change is carried out on a text image to be detected so as to generate a plurality of sub-text images with different sizes; then, each sub-text image is detected through a convolutional neural network model, a candidate text detection box of each character in each sub-text image is obtained, and a plurality of corresponding candidate text detection boxes exist in a plurality of sub-text images for each character; further, all candidate text detection boxes are processed and filtered by using a non-maximum suppression method, so that an effective text detection box for detection can be determined finally; and performing text recognition based on the effective text detection boxes to obtain a text recognition result. Therefore, in the scheme, on one hand, a plurality of sub-text images with different sizes are generated according to the text image to be detected, so that the text detection rate can be effectively improved; on the other hand, text detection is carried out through the convolutional neural network model, the convolutional neural network model directly carries out feature extraction on the image, and image compression processing is not involved in the feature extraction process, so that features can be effectively extracted no matter the image is good in quality or poor in quality. And further, after the convolutional neural network model extracts features and outputs candidate text detection boxes, carrying out non-maximum suppression and filtering processing on the candidate text detection boxes, and determining the most effective candidate text detection box for text recognition. Therefore, the accuracy of text detection and recognition of the text image is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present invention, and it is also possible for a person skilled in the art to obtain other drawings based on the drawings.

FIG. 1 is a flow chart illustrating steps of a text recognition method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of a text recognition method according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a line of text generated in one of the embodiments shown in FIG. 2;

FIG. 4 is a schematic illustration of a vertical direction overlap ratio calculation in the embodiment of FIG. 2;

FIG. 5 is a schematic illustration of a generated text line set in one embodiment shown in FIG. 2;

FIG. 6 is a diagram of a cut-away text line in the embodiment of FIG. 2;

FIG. 7 is a diagram of a rectified line of text in the embodiment of FIG. 2;

fig. 8 is a block diagram of a text recognition apparatus according to a third embodiment of the present invention.

Detailed Description

Of course, it is not necessary for any particular embodiment of the invention to achieve all of the above advantages at the same time.

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention shall fall within the scope of the protection of the embodiments of the present invention.

Example one

Referring to fig. 1, a flowchart illustrating steps of a text recognition method according to a first embodiment of the present invention is shown.

The text recognition method of the embodiment comprises the following steps:

step S102: and acquiring a text image to be detected.

The text image to be detected contains information of a plurality of characters, wherein the characters include but are not limited to one or more of characters, symbols, numbers and letters.

In the embodiment of the invention, the text image to be detected is an image of a problem topic. But is not limited thereto, other text images are also equally applicable to the solution of the embodiment of the present invention.

Step S104: and performing multi-scale transformation on the text image to be detected to obtain a plurality of sub-text images with different sizes.

Multi-scale image techniques, also called multi-resolution techniques (MRA), refer to the use of multi-scale representations of images and processing them separately at different scales. The multi-scale image may be obtained by performing a multi-scale transformation on the original image. Multi-scale images are very common in extracting image features because in many cases features that are not readily apparent in one scale or that are acquired are easily found or extracted at some other scale. The multi-scale transformation techniques employed to obtain multi-scale images can be basically divided into three major categories: scale-space techniques, time-scale techniques, and time-frequency techniques. In practical applications, a person skilled in the art may perform multi-scale transformation on a text image to be detected by using any appropriate multi-scale transformation manner to obtain a plurality of sub-text images with different sizes, which is not limited in this embodiment of the present invention.

Step S106: and performing text detection on each sub-text image by using a convolutional neural network model to obtain a candidate text detection box corresponding to each character in each sub-text image.

The convolutional neural network model in the embodiment of the invention is used for carrying out text detection on the image, and can be a neural network model trained by a third party and can be directly used; the convolutional neural network model can also be obtained by training the convolutional neural network model in advance before text recognition.

Each character corresponds to one candidate text detection box in each sub-text image, and correspondingly, the character corresponds to a plurality of text detection boxes in a plurality of sub-text images. Suppose that the text image a to be detected contains N characters, which are respectively characters 1, 2, … …, and N, and after performing multi-scale transformation on the text image a to be detected, 4 sub-text images a1, a2, A3, and a4 are generated. Text detection is performed on a1, a2, A3 and a4 by using a convolutional neural network model, respectively, so that the detection result (i.e., candidate text detection box) of the character 1 in a1 is 11, the detection result in a2 is 12, the detection result in A3 is 13 and the detection result in a4 is 14; by analogy, the detection results of the character N in a1, a2, A3, and a4 are N1, N2, N3, and N4, respectively.

Step S108: and carrying out non-maximum suppression NMS (network management system) processing on a plurality of candidate text detection boxes of all the subfile images of the same character, filtering the processed candidate text detection boxes and determining effective text detection boxes.

NMS (Non Maximum Suppression) is used to search for local maxima and suppress Non-maxima elements, and is used to extract the window with the highest score in target detection. Specifically to the present embodiment, the NMS is used to extract optimal candidate text detection boxes for the same object as one character.

In some cases, the detection results of the plurality of sub-text images may include non-character detection boxes, so that after performing NMS processing on the candidate text detection boxes, filtering the candidate text detection boxes after the NMS processing is performed, so as to filter the invalid candidate text detection boxes and determine a final valid text detection box.

Step S110: and performing text recognition on the text image to be detected based on the effective text detection box to obtain a text recognition result.

After the valid text detection box is determined, the characters in the valid text detection box can be identified, that is, the text image is subjected to text identification, so that a text identification result, such as text data of a title, is obtained.

According to the embodiment, when text detection and identification are carried out, firstly, multi-scale change is carried out on a text image to be detected so as to generate a plurality of sub-text images with different sizes; then, each sub-text image is detected through a convolutional neural network model, a candidate text detection box of each character in each sub-text image is obtained, and a plurality of corresponding candidate text detection boxes exist in a plurality of sub-text images for each character; further, processing and filtering all candidate text detection boxes corresponding to each character by using a non-maximum suppression method, so that an effective text detection box for detection can be determined finally; and performing text recognition based on the effective text detection boxes to obtain a text recognition result. Therefore, in the scheme, on one hand, a plurality of sub-text images with different sizes are generated according to the text image to be detected, so that the text detection rate can be effectively improved; on the other hand, text detection is carried out through the convolutional neural network model, the convolutional neural network model directly carries out feature extraction on the image, and image compression processing is not involved in the feature extraction process, so that features can be effectively extracted no matter the image is good in quality or poor in quality. And further, after the convolutional neural network model extracts features and outputs candidate text detection boxes, carrying out non-maximum suppression and filtering processing on the candidate text detection boxes, and determining the most effective candidate text detection box for text recognition. Therefore, the accuracy of text detection and recognition of the text image is greatly improved.

The text recognition method of the embodiment can be implemented by any suitable device or apparatus with data processing function, including but not limited to various terminals and servers.

Example two

Referring to fig. 2, a flowchart illustrating steps of a text recognition method according to a second embodiment of the present invention is shown.

The text recognition method of the embodiment comprises the following steps:

step S202: and training a convolutional neural network model for text detection of the text image.

The step is an optional step, and as described above, in actual use, a convolutional neural network model trained by a third party can also be directly used for text detection.

During training, a large batch of text images can be automatically generated by using methods such as freetype and pygale according to the existing topic text data; the diversity of text images is enhanced using one or more of image blurring, image rotation, scaling, warping, and other geometric transformations, image contrast transformations, image noise pollution, and other techniques. Thus, a large batch of sample images used for convolutional neural network model training is acquired.

In the embodiment of the present invention, the convolutional neural network model adopts a VGG16 network model, and optionally, adopts an SSD network model structure in the VGG16 network model. An ssd (single Shot multi box detector) is a detection framework for target detection, and has a fast detection speed while being capable of performing accurate target detection. In this embodiment, the SSD network model mainly includes a feature extraction portion and a text box prediction portion, the feature extraction portion performs feature extraction of a text image, the text box prediction portion starts prediction of a text box by different convolution layers based on the feature extraction portion, and a loss function (loss function) of the SSD network model is composed of a yes/no text and a position offset.

It should be noted that, unlike the conventional SSD which scales the sample image to a fixed size during the training process, in the present embodiment, the sample image is clipped to the fixed size, so as to ensure that the sample image does not scale and maintain the original resolution. The training adopts a conventional random gradient descent algorithm SGD to train the SSD network model, and parameters such as learning rate and the like need to be adjusted in the training process.

The sample images used in the training process have the labeling information of the text, and the labeling information can be labeled based on the trained labeling model. However, there may be some cases of label information error, and manual re-correction is required to improve the accuracy of labeling and reduce the workload of labeling.

And performing tuning training on the SSD network model based on the labeled sample image, improving the generalization capability of the SSD network model, finishing the training if the detection precision of the SSD network model reaches an expected index, and otherwise, continuing to perform circular training.

After the training of the SSD network model is completed, corresponding text detection can be performed on the text image subsequently based on the model.

Step S204: and acquiring a text image to be detected.

The text image to be detected contains information of a plurality of characters, wherein the characters include but are not limited to one or more of characters, numbers, symbols, letters and the like.

In this embodiment, the text image to be detected is exemplified by a test paper image or a question image including a question.

Step S206: and performing multi-scale transformation on the text image to be detected to obtain a plurality of sub-text images with different sizes.

The multi-scale transformation may be implemented by any appropriate manner according to actual needs by those skilled in the art, and the embodiment of the present invention is not limited thereto. For example, the width and the height of the text image to be detected are respectively magnified by 2 times and 4 times, so as to improve the subsequent text detection rate.

Step S208: and performing text detection on each sub-text image by using a convolutional neural network model to obtain a candidate text detection box corresponding to each character in each sub-text image.

In this step, the SSD model of the convolutional neural network trained and obtained in step S202 is used to perform text detection on the plurality of sub-text images obtained in step S206, so as to obtain a text detection result corresponding to each sub-text image, where the detection result is a detection box of a character, that is, a candidate text detection box. Optionally, in this embodiment, the detection result is a circumscribed regular rectangular frame of the character, that is, a bounding box. The bounding box is a positive circumscribed rectangle corresponding to the character, no rotation is performed, each bounding box can be composed of four coordinates of [ xmin, ymin, xmax, ymax ], and comparison and judgment of the text detection box can be performed more easily through the bounding box.

Step S210: and performing NMS processing on a plurality of candidate text detection boxes of all the subfile images of the same character.

In this embodiment, the NMS processing on all candidate text detection boxes in all sub-text images may include: determining a candidate text detection box with the highest score in a plurality of candidate text detection boxes aiming at the candidate text detection boxes in a plurality of sub-text images corresponding to the same character, wherein the score is obtained through an output result of a convolutional neural network model; judging the overlapping rate of the candidate text detection box with the highest score and other candidate text detection boxes one by one; filtering out candidate text detection boxes with the overlapping rate of the candidate text detection boxes with the highest score exceeding a first set overlapping rate; and determining the candidate text detection box with the highest confidence coefficient from the rest candidate text detection boxes, and filtering out other candidate text detection boxes. Wherein, the first set overlapping rate can be set by those skilled in the art according to actual requirements.

NMS is an iteration-traversal-elimination process, and for a certain target object, the NMS sorts the scores of all detection frames (the classifier scores of the early convolutional neural network model), and selects the highest score and the corresponding detection frame; traversing the other detection frames, and deleting the detection frame if the overlapping area of the detection frame with the current highest score is larger than a certain threshold (the overlapping rate is larger than a certain threshold); and continuously selecting one detection frame with the highest score from the unprocessed detection frames, and repeating the process to obtain the optimal detection frame.

Based on this, in this embodiment, it is considered that a plurality of mutually overlapping detection results (bounding boxes) are generated at the same position, and it is necessary to retain a better detection result and remove other sub-optimal detection results. Therefore, on the one hand, using the overlap ratio between boundingboxes, the detection result that the overlap ratio exceeds a certain overlap ratio threshold needs to be suppressed, and optionally, the overlap ratio threshold may be 0.5. Further optionally, the overlapping rate calculation method selects a jaccardoverlay criterion widely applied in the field of target detection. On the other hand, for a plurality of overlapped detection results, the confidence score of Boundingbox is used for measurement, and the highest confidence score is reserved.

Step S212: and filtering the candidate text detection boxes processed by the NMS to determine effective text detection boxes.

Optionally, the candidate text detection boxes after processing by the NMS may be filtered according to at least one of an area, a length, and a width of the candidate text detection boxes after processing by the NMS.

For example, the area average value of all candidate text detection boxes processed by the NMS is calculated, the absolute value of the difference between the area of each candidate text box and the area average value is obtained, and candidate text detection boxes with the ratio of the absolute value of the difference to the area average value larger than a first set threshold value are filtered out. The first set threshold may be set by a person skilled in the art as appropriate according to actual needs, and is not limited in this respect, and is set to 0.1, for example.

For another example, the length average value of all the candidate text detection boxes processed by the NMS is calculated to obtain the absolute value of the difference between the length of each candidate text box and the length average value, and the candidate text detection boxes with the ratio of the absolute value of the difference to the length average value larger than the second set threshold value are filtered. The second set threshold may be set by a person skilled in the art as appropriate according to actual requirements, and is not limited in this respect, for example, the second set threshold is set to 0.1.

For another example, the width average value of all the candidate text detection boxes processed by the NMS is calculated to obtain the absolute value of the difference between the width of each candidate text box and the width average value, and the candidate text detection boxes with the ratio of the absolute value of the difference to the width average value larger than the third set threshold value are filtered. The third set threshold may be set by those skilled in the art as appropriate according to actual requirements, and is not limited in this respect, for example, the third set threshold is set to 0.1.

The above-mentioned modes can be used alternatively or in combination, and the first, second and third setting thresholds may be the same or different.

By the method, the false Boundingbox is filtered by using the geometric feature limitation, and the efficiency and the accuracy of text detection are improved.

Step S214: based on the effective text detection box, performing text structure analysis on a plurality of characters in the text image to be detected; and obtaining at least one text line according to the text structure analysis result.

Wherein each text line comprises at least one valid text detection box.

In practical use, for example, in a scene of a photo search, a text image to be detected is often photographed by a user, and thus, various non-standard situations such as text distortion and deformation may occur. Therefore, the characters in the text image in this case can be corrected by the text structure analysis for subsequent correct processing. It should be noted that this step is an optional step, and in a more normal text image usage scenario, steps S214 and S216 may be omitted.

In this embodiment, after the text structure of the text image to be detected is analyzed, the relationship between the effective text detection boxes corresponding to the respective characters may be determined, and one or more text lines are generated according to the relationship, where each text line includes one or more characters, and correspondingly, each character corresponds to one effective text detection box.

In an alternative, performing text structure analysis on a plurality of characters in a text image to be detected based on a valid text detection box may include: determining the horizontal position relation among the effective text detection boxes according to the horizontal coordinates of the central points of the effective text detection boxes; according to the horizontal position relation, judging whether a corresponding text line exists or not for each effective text detection box; if yes, adding the corresponding text line; if not, a text line is newly established; generating a plurality of text line groups by using a plurality of text lines with the overlapping rate in the vertical direction meeting a second set overlapping rate; and determining the upper and lower position relation among the text line groups according to the vertical coordinates of the text line groups. The second set overlap ratio may be set by a person skilled in the art according to actual requirements, and the embodiment of the present invention is not limited thereto.

For each effective text detection box, judging whether a corresponding text line exists can adopt: judging whether a text line exists at present; if at least one text line exists, acquiring the last effective text detection box of the horizontal position of each text line; determining a text line corresponding to the effective text detection box with the smallest distance in the horizontal direction with the current effective text detection box as a text line corresponding to the current effective text detection box, wherein the overlapping rate of the current effective text detection box in the vertical direction is greater than a third set overlapping rate; and if the text line does not exist at present, a new text line is created. The third set overlap ratio may be set by a person skilled in the art according to actual requirements, and the embodiment of the present invention is not limited thereto.

Step S216: and performing text rectification on the text line.

The method comprises the following steps: determining the height average value of all effective text detection boxes in each text line; calculating the difference value between the height of each effective text detection box in the text line and the average value of the heights; taking the effective text detection box with the difference value larger than the fourth set threshold as a segmentation point to segment the text line into a smooth text line and a non-smooth text line; and rectifying the non-smooth text line. The fourth setting threshold may be set by a person skilled in the art according to actual requirements, and the embodiment of the present invention is not limited thereto.

Hereinafter, the text structure analysis and the text rectification process in step S214 and step S216 will be described by taking a Boundingbox as an example.

The process is as follows:

1) and Boundingbox ordering.

And sorting based on the horizontal coordinate x of the central point of each bounding box, and determining the horizontal position relation among the bounding boxes according to the sorting of x from small to large, namely the bounding boxes are sequentially arranged from left to right on the text image.

2) And generating the text line.

The input of the process is a queue formed by a plurality of bounding boxes, and the output is a text row sequence formed by the bounding boxes.

Setting a text line to be represented by a line (a plurality of bounding boxes of the same line form a line), one text line to be represented by a line, and a plurality of text lines to be represented by lines. Firstly, the bounding box queue element box is dequeued, whether each line in the box and the lines is in the same row or not is sequentially judged (the initial lines are empty), if no effective same row exists, a new line is generated, the box is added with the line, and the line is added with the lines; if there is a valid peer, the box adds the line that is optimal among them.

When determining whether there is a valid peer, it may be determined whether the current box and the last box in a line satisfy the following two conditions: high overlapping rate in the vertical direction and small distance in the horizontal direction. Further, the line which is the smallest horizontal distance from the current box among all the lines meeting the same-line condition may be determined as the optimal same line, and the current box may be added to the optimal same-line.

A schematic of boxes and lines generated is shown in FIG. 3, where solid line boxes represent boxes and dashed line boxes represent lines. As can be seen in FIG. 3, for a more canonical text image, a line may include an entire line of boxes.

In addition, in this example, the overlapping ratio between two boxes in the vertical direction may employ the following calculation formula:

Ovlerlap＝1-D/(H1/2+H2/2)，

where D represents the distance between the center points of the two boxes in the vertical direction, and H1 and H2 are the heights corresponding to the two boxes, respectively, as shown in fig. 4.

3) And generating a text line group.

The purpose of generating the text line group is to make the text lines with larger horizontal distance but higher vertical overlapping rate form a group, so that the following text lines are convenient to sort in the vertical direction, and if the text line group is not formed, different lines with approximate one line in the horizontal direction but larger horizontal distance are wrong when the following text lines are sorted.

Fig. 5 shows a schematic diagram of a generated text line group, in fig. 5, although the left text line and the right text line are far apart in the horizontal direction, the overlapping rate of the left text line and the right text line in the vertical direction is high, and therefore, the text lines can be grouped into one group.

4) And sorting text line groups.

When the text line groups are sorted, the Y-direction coordinates of the head elements of the first lines of the current text line groups are used as a reference to be sorted from small to large, and the lines in the text groups are sorted from small to large according to the x-direction coordinates to determine the up-down position relation among the text line groups.

5) And segmenting text lines.

First, the median of the heights of all boxes in the text line is counted as an estimated value of the line height. And sequentially taking out the boxes from left to right to form a box group of boxes, counting the minimum circumscribed rectangle (rectangle with rotation) min _ bounding box of the boxes, wherein the height of the min _ bounding box is close to the height of the row to indicate that the boxes in the box group are smoothly distributed, setting the height to be a dividing point if the height exceeds a certain threshold, dividing the boxes into single rows, and continuously repeating the process until all the boxes in the rows are taken out. One type of sliced line of text is shown in FIG. 6.

6) And correcting the text line.

The purpose of text line rectification is to rectify the rotated text lines into positive horizontal text lines. The method comprises the steps of firstly calculating the minimum bounding rectangle in boxes, calculating an affine transformation matrix according to the minimum bounding rectangle and the positive rectangle, and carrying out affine transformation on a text line by using the transformation matrix. A rectified line of text is shown in fig. 7.

Through the text structure analysis, the text line segmentation and the text line correction, the text image under the degradation conditions of text distortion, deformation and the like can be effectively dealt with. Meanwhile, the algorithm complexity is low, and the text detection and identification efficiency can be effectively ensured.

Step S218: and performing text recognition on the text image to be detected based on the effective text detection box to obtain a text recognition result.

It should be noted that, if the valid text detection box is processed in the above steps S214 and S216, in this step, based on the processed valid text detection box, that is, the corrected valid text detection box, text recognition is performed on the text image to be detected, so as to obtain a text recognition result.

After the text recognition result is obtained, the question database can be queried according to the result so as to obtain answers and/or analysis of the questions.

According to the embodiment, when text detection and identification are carried out, firstly, multi-scale change is carried out on a text image to be detected so as to generate a plurality of sub-text images with different sizes; then, each sub-text image is detected through a convolutional neural network model, a candidate text detection box of each character in each sub-text image is obtained, and a plurality of corresponding candidate text detection boxes exist in a plurality of sub-text images for each character; further, all candidate text detection boxes are processed and filtered by using a non-maximum suppression method, so that an effective text detection box for detection can be determined finally; and performing text recognition based on the effective text detection boxes to obtain a text recognition result. Therefore, in the scheme, on one hand, a plurality of sub-text images with different sizes are generated according to the text image to be detected, so that the text detection rate can be effectively improved; on the other hand, text detection is carried out through the convolutional neural network model, the convolutional neural network model directly carries out feature extraction on the image, and image compression processing is not involved in the feature extraction process, so that features can be effectively extracted no matter the image is good in quality or poor in quality. And further, after the convolutional neural network model extracts features and outputs candidate text detection boxes, carrying out non-maximum suppression and filtering processing on the candidate text detection boxes, and determining the most effective candidate text detection box for text recognition. Therefore, the accuracy of text detection and recognition of the text image is greatly improved.

EXAMPLE III

Referring to fig. 8, a block diagram of a text recognition apparatus according to a third embodiment of the present invention is shown.

The text recognition apparatus of the present embodiment includes: a first obtaining module 302, configured to obtain a text image to be detected, where the text image includes information of a plurality of characters; a second obtaining module 304, configured to perform multi-scale transformation on the text image to obtain a plurality of sub-text images with different sizes; a third obtaining module 306, configured to perform text detection on each sub-text image by using a convolutional neural network model, and obtain a candidate text detection box corresponding to each character in each sub-text image; the determining module 308 is configured to perform non-maximum suppression NMS processing on a plurality of candidate text detection boxes of all sub-text images of the same character, filter the processed candidate text detection boxes, and determine an effective text detection box; and the identification module 310 is configured to perform text identification on the text image based on the valid text detection box to obtain a text identification result.

Optionally, the determining module 308 includes: the NMS module 3082 is used for carrying out non-maximum suppression NMS processing on a plurality of candidate text detection boxes of all the subfile images of the same character; and the filtering module 3084 is configured to filter the processed candidate text detection boxes according to at least one of the area, the length, and the width of the candidate text detection boxes processed by the NMS.

Optionally, the filtering module 3084 includes: the area filtering module is used for calculating the area average value of all the processed candidate text detection boxes, obtaining the absolute value of the difference between the area of each candidate text box and the area average value, and filtering the candidate text detection boxes of which the ratio of the absolute value of the difference to the area average value is greater than a first set threshold value; and/or the length filtering module is used for calculating the length average value of all the processed candidate text detection boxes, obtaining the absolute value of the difference between the length of each candidate text box and the length average value, and filtering the candidate text detection boxes of which the ratio of the absolute value of the difference to the length average value is greater than a second set threshold value; and/or the width filtering module is used for calculating the width average value of all the processed candidate text detection boxes, obtaining the absolute value of the difference between the width of each candidate text box and the width average value, and filtering the candidate text detection boxes of which the ratio of the absolute value of the difference to the width average value is greater than a third set threshold value.

Optionally, the candidate text detection box is a circumscribed regular rectangular box of each character.

Optionally, the NMS module 3082 is configured to determine, for multiple candidate text detection boxes in multiple sub-text images corresponding to the same character, a candidate text detection box with a highest score among the multiple candidate text detection boxes, where the score is obtained through an output result of the convolutional neural network model; judging the overlapping rate of the candidate text detection box with the highest score and other candidate text detection boxes one by one; filtering out candidate text detection boxes with the overlapping rate of the candidate text detection boxes with the highest score exceeding a first set overlapping rate; and determining the candidate text detection box with the highest confidence coefficient from the rest candidate text detection boxes, and filtering out other candidate text detection boxes.

Optionally, the text recognition apparatus of this embodiment further includes: a structure analysis module 312, configured to perform text structure analysis on a plurality of characters in the text image based on the valid text detection box before the recognition module 310 performs text recognition on the text image based on the valid text detection box to obtain a text recognition result; the fourth obtaining module 314 is configured to obtain at least one text line according to the text structure analysis result, where each text line includes at least one valid text detection box.

Optionally, the structure analysis module 312 is configured to determine a horizontal position relationship between the effective text detection boxes according to the horizontal coordinate of the central point of each effective text detection box; according to the horizontal position relation, judging whether a corresponding text line exists or not for each effective text detection box; if yes, adding the corresponding text line; if not, a text line is newly established; generating a plurality of text line groups by using a plurality of text lines with the overlapping rate in the vertical direction meeting a second set overlapping rate; and determining the upper and lower position relation among the text line groups according to the vertical coordinates of the text line groups.

Optionally, the text recognition apparatus of this embodiment further includes: a rectification module 316, configured to determine, for each text line, a height average of all valid text detection boxes in the text line; calculating the difference value between the height of each effective text detection box in the text line and the average value of the heights; taking the effective text detection box with the difference value larger than the fourth set threshold as a segmentation point to segment the text line into a smooth text line and a non-smooth text line; and rectifying the non-smooth text line.

Optionally, the structure analysis module 312 determines whether there is a text line currently when determining whether there is a corresponding text line for each valid text detection box; if at least one text line exists, acquiring the last effective text detection box of the horizontal position of each text line; determining a text line corresponding to the effective text detection box with the smallest distance in the horizontal direction with the current effective text detection box as a text line corresponding to the current effective text detection box, wherein the overlapping rate of the current effective text detection box in the vertical direction is greater than a third set overlapping rate; and if the text line does not exist at present, a new text line is created.

The text recognition apparatus of this embodiment is used to implement the corresponding text recognition method in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions and/or portions thereof that contribute to the prior art may be embodied in the form of a software product that can be stored on a computer-readable storage medium including any mechanism for storing or transmitting information in a form readable by a computer (e.g., a computer). For example, a machine-readable medium includes Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage media, optical storage media, flash memory storage media, electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others, and the computer software product includes instructions for causing a computing device (which may be a personal computer, server, or network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus (device), or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A text recognition method, comprising:

acquiring a text image to be detected, wherein the text image contains information of a plurality of characters;

performing multi-scale transformation on the text image to obtain a plurality of sub-text images with different sizes;

performing text detection on each sub-text image by using a convolutional neural network model to obtain a candidate text detection box corresponding to each character in each sub-text image, wherein the convolutional neural network model is an SSD network model, and the SSD network model cuts the sample image to a fixed size in the training process;

carrying out non-maximum suppression NMS processing on a plurality of candidate text detection boxes of all sub-text images of the same character, and filtering the processed candidate text detection boxes to determine effective text detection boxes;

performing text recognition on the text image based on the effective text detection box to obtain a text recognition result;

before the text recognition is performed on the text image based on the valid text detection box to obtain a text recognition result, the method further includes:

performing text structure analysis on a plurality of characters in the text image based on the valid text detection box;

obtaining at least one text line according to the text structure analysis result, wherein each text line comprises at least one effective text detection box;

wherein the performing text structure analysis on the plurality of characters in the text image based on the valid text detection box comprises:

determining the horizontal position relation among the effective text detection boxes according to the horizontal coordinates of the central points of the effective text detection boxes;

according to the horizontal position relation, judging whether a corresponding text line exists or not for each effective text detection box; if yes, adding the corresponding text line; if not, a text line is newly established;

generating a plurality of text line groups by using a plurality of text lines with the overlapping rate in the vertical direction meeting a second set overlapping rate;

determining the upper and lower position relation among the text line groups according to the vertical coordinates of the text line groups;

wherein the method further comprises:

determining the height average value of all effective text detection boxes in each text line;

calculating the difference value between the height of each effective text detection box in the text line and the average value of the heights;

taking the effective text detection box with the difference value larger than the fourth set threshold as a segmentation point to segment the text line into a smooth text line and a non-smooth text line;

and rectifying the non-smooth text line.

2. The method of claim 1, wherein the filtering the processed candidate text detection box comprises:

and filtering the processed candidate text detection box according to at least one of the area, the length and the width of the processed candidate text detection box.

3. The method according to claim 2, wherein the filtering the processed candidate text detection boxes according to at least one of an area, a length, and a width of the processed candidate text detection boxes includes:

calculating the area average value of all the processed candidate text detection boxes to obtain the absolute value of the difference between the area of each candidate text box and the area average value, and filtering the candidate text detection boxes of which the ratio of the absolute value of the difference to the area average value is greater than a first set threshold value;

and/or the presence of a gas in the gas,

calculating the length average value of all the processed candidate text detection boxes to obtain the absolute value of the difference between the length of each candidate text box and the length average value, and filtering the candidate text detection boxes of which the ratio of the absolute value of the difference to the length average value is greater than a second set threshold value;

and/or the presence of a gas in the gas,

calculating the width average value of all the processed candidate text detection boxes to obtain the absolute value of the difference between the width of each candidate text box and the width average value, and filtering the candidate text detection boxes of which the ratio of the absolute value of the difference to the width average value is greater than a third set threshold value.

4. A method according to any of claims 1-3, wherein the candidate text detection boxes are circumscribed rectangular boxes for each character.

5. A method according to any of claims 1-3, wherein said subjecting the plurality of candidate text detection boxes of all sub-text images of the same character to non-maximum suppression NMS processing comprises:

determining a candidate text detection box with the highest score in a plurality of candidate text detection boxes aiming at the candidate text detection boxes in a plurality of sub-text images corresponding to the same character, wherein the score is obtained through an output result of the convolutional neural network model;

judging the overlapping rate of the candidate text detection box with the highest score and other candidate text detection boxes one by one;

filtering out candidate text detection boxes with the overlapping rate of the candidate text detection boxes with the highest score exceeding a first set overlapping rate;

and determining the candidate text detection box with the highest confidence coefficient from the rest candidate text detection boxes, and filtering out other candidate text detection boxes.

6. The method of claim 1, wherein determining whether a corresponding text line exists for each valid text detection box comprises:

judging whether a text line exists at present;

if at least one text line exists, acquiring the last effective text detection box of the horizontal position of each text line;

determining a text line corresponding to the effective text detection box with the smallest distance in the horizontal direction with the current effective text detection box as a text line corresponding to the current effective text detection box, wherein the overlapping rate of the current effective text detection box in the vertical direction is greater than a third set overlapping rate;

and if the text line does not exist at present, a new text line is created.

7. A text recognition apparatus, comprising:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a text image to be detected, and the text image comprises information of a plurality of characters;

the second acquisition module is used for carrying out multi-scale transformation on the text image to obtain a plurality of sub-text images with different sizes;

the third acquisition module is used for performing text detection on each sub-text image by using a convolutional neural network model to obtain a candidate text detection box corresponding to each character in each sub-text image, wherein the convolutional neural network model is an SSD network model, and the SSD network model cuts the sample image to a fixed size in the training process;

the determining module is used for carrying out non-maximum suppression NMS processing on a plurality of candidate text detection boxes of all the sub-text images of the same character, filtering the processed candidate text detection boxes and determining effective text detection boxes;

the recognition module is used for performing text recognition on the text image based on the effective text detection box to obtain a text recognition result;

the text recognition apparatus further includes:

the structure analysis module is used for performing text structure analysis on a plurality of characters in the text image based on the effective text detection box before the recognition module performs text recognition on the text image based on the effective text detection box to obtain a text recognition result;

the fourth obtaining module is used for obtaining at least one text line according to the text structure analysis result, wherein each text line comprises at least one effective text detection box;

the structure analysis module is further used for determining the horizontal position relation among the effective text detection boxes according to the horizontal coordinates of the central points of the effective text detection boxes; according to the horizontal position relation, judging whether a corresponding text line exists or not for each effective text detection box; if yes, adding the corresponding text line; if not, a text line is newly established; generating a plurality of text line groups by using a plurality of text lines with the overlapping rate in the vertical direction meeting a second set overlapping rate; determining the upper and lower position relation among the text line groups according to the vertical coordinates of the text line groups;

the text recognition apparatus further includes: the correction module is used for determining the height average value of all effective text detection boxes in each text line; calculating the difference value between the height of each effective text detection box in the text line and the average value of the heights; taking the effective text detection box with the difference value larger than the fourth set threshold as a segmentation point to segment the text line into a smooth text line and a non-smooth text line; and rectifying the non-smooth text line.