CN110309769B

CN110309769B - Method for segmenting character strings in picture

Info

Publication number: CN110309769B
Application number: CN201910576925.3A
Authority: CN
Inventors: 张春红; 胡铮; 邵文良
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2021-06-15
Anticipated expiration: 2039-06-28
Also published as: CN110309769A

Abstract

The invention discloses a method for segmenting character strings in a picture, and belongs to the field of computer vision. Firstly, collecting a plurality of character string pictures to divide the character string pictures into training samples and testing samples, and respectively preprocessing each training sample to obtain a plurality of sub-pictures corresponding to each training sample; and marking each sub-picture of each training sample as a sequence in an IOBES mode. Then training a model of the bidirectional long-short term memory neural network and the conditional random field by using a training sample for sequence labeling; during testing, a test sample is input into a trained model of the bidirectional long-short term memory neural network and the conditional random field, and a label sequence with the highest score is obtained. And finally, taking the label sequence with the highest score as a segmentation line of image segmentation, and performing segmentation on the test sample. The invention avoids the step of manually establishing the threshold value when the segmentation is carried out by using a rule algorithm such as a projection method, does not need other prior knowledge, and is convenient to transplant.

Description

Method for segmenting character strings in picture

Technical Field

The invention belongs to the field of computer vision, relates to image page segmentation, and particularly relates to a method for segmenting character strings in a picture.

Background

The character string segmentation belongs to the field of character detection in computer vision, and in most image character detection tasks, character detection mainly aims at natural scene, and can be taken as a task of target detection at the moment, and a conventional target detection algorithm is adopted. However, image text encountered in many text detection fields often falls into a form that occupies one line in the image, such as a license plate number, a house number, or a table text. In such a task, it is generally necessary to recognize characters in an image by first detecting a region of the characters and then recognizing the characters.

In the laboratory sheet table shown in fig. 1, to detect characters in the table, generally, rows in the table are divided and columns are divided. The common method is that the target detection model such as SSD or fast-RCNN algorithm can be used for directly detecting characters, however, omission easily occurs when the target detection-based algorithm is used in a task with dense characters such as a table, and secondly, the arrangement of the characters in the table has a certain rule, so that a simpler method can be adopted, and a better effect can be achieved.

In the prior art, table character detection is generally realized by adopting an image page segmentation method, and an image page segmentation algorithm is used for continuously segmenting a picture so as to obtain a series of image areas containing characters. One of the most common image page segmentation algorithms is the projection method, as disclosed in document 1: the license plate character segmentation method based on the projection characteristic value is a license plate character segmentation algorithm [ J ] of computer application research, 2006,23(7). the projection method is mainly used for common horizontal image segmentation tasks such as license plate character segmentation and the like. Firstly, carrying out binarization on an image, unifying the image into a standard of black background and white characters, and then calculating the sum of pixel values of all pixel points in each row or each column; several thresholds are then determined by a priori knowledge to find reasonable partitioning points.

Although this method has good effect on the image with regular format and clear handwriting like the license plate number, in many OCR works, there is no regular text picture, so it is difficult to use a reasonable threshold to find the segmentation point. In a common OCR task, many manually established rules need to be added according to the characteristics of data to perfect the segmentation result. These methods typically require a large amount of a priori knowledge and result in a very bulky model. Therefore, a general solution is urgently needed to be found, and a machine learning method is adopted to train through a given data set, so that the model can automatically learn the characteristics of the segmentation points, and a large amount of labor cost is avoided when the prior rule is searched.

Disclosure of Invention

Aiming at the problems, the invention adopts a sequence labeling method, and labels rows and columns in the image to enable the model to predict the segmentation line of the character region in the image, thereby achieving the effect of higher accuracy than the algorithm based on rules, having the advantages of general model, reducing threshold parameters, having no need of prior knowledge advantage and significance of introducing the sequence labeling into the computer vision field, and particularly being a method for segmenting character strings in the image.

Comprises the following steps:

the method comprises the following steps of firstly, collecting a plurality of character string pictures to divide the character string pictures into training samples and testing samples;

step two, respectively preprocessing each training sample to obtain a plurality of sub-pictures corresponding to each training sample;

the pretreatment is as follows: firstly, binarizing a picture, and scaling the picture to 25 pixels in height; then, the adjacent 5 columns of pixel points are divided into one sub-picture, and the dimension of each sub-picture is 5 × 25 ═ 125.

And thirdly, marking each sub-picture of each training sample as a sequence in an IOBES mode aiming at each training sample.

The IOBES notation is: if the sub-picture input is the beginning of a text region, it is labeled B, if the sub-picture input is inside a text region, it is labeled I, if the sub-picture input is the end of a text region, it is labeled E, if the sub-picture input alone becomes a text region, it is labeled S, if the sub-picture input does not belong to a text region, it is labeled O.

Training a model of the bidirectional long-short term memory neural network and the conditional random field by using a training sample for sequence labeling;

the method comprises the following specific steps:

step 401, a bidirectional long and short term memory neural network structure is adopted, and unit information before and after the current long and short term memory neural network unit is connected in series.

Step 402, aiming at a training sample, inputting pixel points of each sub-picture in the sample into a long-short term memory neural network connected in series, and outputting five probability values of the sub-picture labels which are IOBES respectively;

let the set of sub-picture input pixel points be X ═ X₁,x₂,...x_i,...x_n)；x_iAnd the ith pixel point is input as the sub-picture. The probability value of the output label of the long-term and short-term memory neural network connected in series is as follows:

w is the five-dimensional output value of the full connection layer and is used for corresponding to IOBES;

the values of the corresponding units of the backward long-short term memory neural network,

is the value of the corresponding unit of the forward long-short term memory neural network.

Step 403, adding a conditional random field model after the long-short term memory neural networks are connected in series, and calculating a score for each training sample;

the training sample for the jth labeled sequence comprises m sub-pictures in total, and the label set is y ═ y (y)₁,y₂,...,y_m)；

First, the label y from the previous sub-picture is calculated_lLabel y transferred to the next sub-picture_l+1Sum of probabilities

Then, the label probability values of all the sub-pictures are calculated as

Denotes the l sub-picture label as y_lThe probability value of (2).

Finally, obtaining the score of the training sample;

the calculation formula is as follows:

step 404, setting constraint conditions of a model for training the bidirectional long-short term memory neural network and the conditional random field;

the constraint conditions are as follows:

passing the label probability values of all sub-pictures in each training sample through softmax, and ensuring that the probability sum is 1 and is derivable:

e^s(X,y)the power of (a) is the fraction of the correct tag sequence labeled in the current training sample;

for each sub-picture, one of the output labels IOBES is selected, and the labels of all the sub-pictures are combined in a sequence.

Step 405, maximizing the log-likelihood function of the correct label sequence, and optimizing each parameter of the model through a back propagation algorithm;

inputting the test sample into the trained models of the bidirectional long-short term memory neural network and the conditional random field to obtain a label sequence with the highest score;

and when in test, calculating the label sequence with the highest score by a Viterbi algorithm.

And step six, taking the label sequence with the highest score as a segmentation line of image segmentation, and performing segmentation on the test sample.

The process is as follows:

firstly, all sub-graph sequences contained by the BIE sequences are found, and then sub-graph sequences corresponding to a single S classification are found.

Then, artificial rules are defined: and (3) connecting a plurality of I tags after the plurality of O tags, and converting the first I tag into a B tag by using rule correction. Similarly, if multiple I tags are followed by multiple O tags, the last I tag is converted to an E tag.

And (4) connecting the subgraphs which are judged as the character areas in series through post-processing to obtain the detected character areas, and finishing character detection.

The invention has the advantages that:

a method for segmenting character strings in a picture applies a sequence labeling problem to image page segmentation, thereby avoiding the step of manually formulating a threshold value required by segmentation by using a rule algorithm such as a projection method. Meanwhile, the model is used for segmentation, and only the training needs to be carried out again in different data sets, other prior knowledge is not needed, and the transplantation is convenient.

Drawings

Fig. 1 is a conventional laboratory sheet table used for character string segmentation.

Fig. 2 is a flowchart of a method for segmenting a character string in a picture according to the present invention.

Fig. 3 is a picture with a line of text as employed in an embodiment of the present invention.

FIG. 4 is a schematic diagram of the bidirectional long-short term memory neural network and the model of the conditional random field for sequence labeling according to the present invention.

Detailed description of the preferred embodiments

The invention will be described in further detail below with reference to the drawings and examples.

The method uses the long-short term memory neural network and the model of the conditional random field to carry out sequence labeling to carry out image page segmentation, finds out the segmentation lines in the image through the sequence labeling model, and achieves the effect of segmenting character strings in the image. The application environment is as follows:

CPU	Intel(R)Xeon(R)CPU [email protected]
		memory device	32G
GPU	Nvidia TITAN Xp
		Operating system	Ubuntu 16.04LTS
Developing languages	Python

As shown in fig. 2, the following steps are divided:

the pretreatment is as follows: firstly, binarizing a picture, and scaling the picture to 25 pixels in height; and then dividing every five adjacent columns of pixel points into one sub-picture, wherein the dimension of each sub-picture is 5 × 25 ═ 125.

the method comprises the following specific steps:

let the set of sub-picture input pixel points be X ═ X₁,x₂,...x_i,...x_n)；x_iAnd the ith pixel point is input as the sub-picture. The picture input for each unit is the value of all pixel points in the picture area with width 5 and height 25. The probability value of the output label of each unit of the long-term and short-term memory neural network connected in series front and back is as follows:

w is a full connection layer, the input is a vector formed by connecting the bidirectional long-short term memory neural networks in series, the output is five dimensions, the dimension is reduced to 5 dimensions, and the vector corresponds to BEIOS;

The network firstly calculates the values of the corresponding units of the long-term and short-term memory neural networks in the front and back directions, and then connects the two output values in series.

Then, the label probability values of all the sub-pictures are calculated as

Denotes the l sub-picture label as y_lThe probability value of (2).

Finally, obtaining the score of the training sample;

the calculation formula is as follows:

the constraint conditions are as follows:

inputting a test sample into the trained model of the bidirectional long-short term memory neural network and the conditional random field during testing to obtain a label sequence with the highest score;

The process is as follows:

Example (b):

as shown in fig. 3, the present invention uses a picture with a line of text as input, and first preprocesses the picture to obtain a plurality of sub-pictures, and performs sequential IOBES labeling on each group of input sub-pictures.

And then, realizing image page segmentation by adopting a sequence labeling model combining a bidirectional long-short term memory neural network and a conditional random field. The sequence annotation is applied to image page segmentation, so that character string segmentation in an image is converted into a sequence annotation problem, and segmentation is performed through a neural network model.

The unidirectional long-short term memory neural network has the disadvantage that each current unit can only obtain information of the previous unit and can not obtain information of the following unit. Therefore, the present invention employs a bi-directional architecture, which was proposed by Hochreiter et al in 1997 and recently improved and generalized by Alex Graves. The earliest applications of Neural networks to timing models were Recurrent Neural networks (rcn), which can pass the output of a single Neural Network module to the next module, allowing persistence of information. Recurrent neural networks perform well in many tasks, such as speech recognition, machine translation, etc. In order to solve the long-term dependence problem of the recurrent neural network, Hochreiter et al designs a long-short term memory neural network, and the long-short term memory neural network realizes the acquisition of long-distance information through a well-designed threshold system. And integrating the information of the units before and after the current long-short term memory neural network unit. After passing through the long and short term memory neural network, each picture input already corresponds to an output value, however, there is a potential problem that the label sequence is related to each other.

Under the IOBES labeling approach, there are many labeling approaches that are illegal, such as the inability to connect an I-tag after an O-tag because the beginning of an image area should be a B-tag; the O-tag and B-tag cannot be connected after the I-tag because the end of an image area should be left with the E-tag. However, the model cannot learn the relationship between such labels only by using the long-short term memory neural network. Therefore, the invention adds a conditional random field module after the long-short term memory neural network to deal with the problem of the mutual connection between the labels.

Conditional random fields are a common algorithm for dealing with sequence labeling, which learns the interrelationships between tags.

As shown in fig. 4, in the present experiment, the conditional random field is implemented in such a way that the set of each sub-picture input pixel point is X ═ (X)₁,x₂,...x_i,...x_n) The probability value of the output label of each unit of the long-term and short-term memory neural network connected in series front and back is as follows:

adding a conditional random field model after the long-short term memory neural network connected in series, and correspondingly outputting a tag formation set of which the structure set is y ═ y (y)₁,y₂,...,y_m)。

For each sequence set y ═ y₁,y₂,...,y_m) Calculating a score:

then, a constraint is set to pass the scores of all possible tag sequences through one softmax, in order to make their sum of probabilities 1, and to derive:

in training, the log-likelihood function of the correct tag sequence is maximized:

and during testing, calculating the label sequence with the highest score through a Viterbi algorithm to be used as the output of the whole sequence label.

The post-processing first finds all sub-graph sequences contained by the BIE sequence and then finds the sub-graph corresponding to a single S classification. In addition, considering that B, E, S has fewer tags, more I and O tags, unbalanced data, and some B and E tags may not be identified in all categories. Therefore, some manual rules need to be defined to perfect the experimental results, such as: and (3) connecting a plurality of I tags after the plurality of O tags, and converting the first I tag into a B tag by using rule correction. Similarly, if multiple I tags are followed by multiple O tags, we also formulate rules to convert the last I tag to an E tag.

Setting parameters:

in the text detection task, the number of long-term and short-term memory neural network units in each direction is 300. Meanwhile, an Adam optimizer is used to train parameters in the network, with the parameters of the optimizer being 0.001.

The results show that:

in the experiment, a medical internet company provided laboratory sheet table picture was used as a data set, and the data set contained 500 laboratory sheet table pictures in total. The intersection ratio of the detected region and the real region is calculated (IoU), a threshold value is set to 0.8, and when the intersection ratio is greater than 0.8, the detected character region is considered to be correct. The effect of the model is evaluated from three aspects of Precision (Precision), Recall (Recall) and F1-score, wherein the F1-score is calculated as the harmonic mean of the Precision and Recall.

Model (model)	Precision(％)	Recall(％)	F1-score
				Faster R-CNN	85.73	87.26	86.45
SSD	85.03	88.60	86.78
				Projection method	88.25	88.66	88.45
Our model	91.23	91.78	91.50

And comparing the experimental result with two models, namely fast RCNN and SSD, based on image target detection, and simultaneously comparing the experimental result with a rule segmentation algorithm based on a projection method. Experiments prove that the image segmentation algorithm based on sequence annotation has better effect than other models in the text detection task of the table area.

Claims

1. A method for segmenting character strings in a picture is characterized by comprising the following steps:

marking each sub-picture of each training sample as a sequence in an IOBES mode aiming at each training sample;

the method comprises the following specific steps:

step 401, adopting a bidirectional long and short term memory neural network structure to serially connect unit information before and after a current long and short term memory neural network unit;

let the set of sub-picture input pixel points be X ═ X₁,x₂,...x_i,...x_n)；x_iThe ith pixel point input for the sub-picture; the probability value of the output label of the long-term and short-term memory neural network connected in series is as follows:

the values of the corresponding units of the forward long-short term memory neural network are obtained;

the method specifically comprises the following steps:

Then, the label probability values of all the sub-pictures are calculated as

Denotes the l sub-picture label as y_lA probability value of (d);

finally, obtaining the score of the training sample;

the calculation formula is as follows:

the constraint conditions are as follows:

e^s(X,y)of the correct tag sequence in the current training sampleA score;

selecting one IOBES from output labels for each sub-picture, and combining sequences formed by the labels of all the sub-pictures;

during testing, calculating a label sequence with the highest score through a Viterbi algorithm;

2. The method as claimed in claim 1, wherein the preprocessing in the step two is to: firstly, binarizing a picture, and scaling the picture to 25 pixels in height; then, the adjacent 5 columns of pixel points are divided into one sub-picture, and the dimension of each sub-picture is 5 × 25 ═ 125.

3. The method according to claim 1, wherein the IOBES notation in step three is as follows: if the sub-picture input is the beginning of a text region, it is labeled B, if the sub-picture input is inside a text region, it is labeled I, if the sub-picture input is the end of a text region, it is labeled E, if the sub-picture input alone becomes a text region, it is labeled S, if the sub-picture input does not belong to a text region, it is labeled O.

4. The method for segmenting the character string in the picture as claimed in claim 1, wherein the step six process is as follows:

firstly, finding out all sub-graph sequences contained by BIE sequences, and then finding out sub-graph sequences corresponding to single S classification;

then, artificial rules are defined: connecting a plurality of I tags after the plurality of O tags, and converting a first I tag into a B tag by using rule correction; similarly, if a plurality of I tags are followed by a plurality of O tags, the last I tag is converted into an E tag;