CN111062264A - Document object classification method based on dual-channel hybrid convolution network - Google Patents

Document object classification method based on dual-channel hybrid convolution network Download PDF

Info

Publication number
CN111062264A
CN111062264A CN201911180193.2A CN201911180193A CN111062264A CN 111062264 A CN111062264 A CN 111062264A CN 201911180193 A CN201911180193 A CN 201911180193A CN 111062264 A CN111062264 A CN 111062264A
Authority
CN
China
Prior art keywords
picture
dimensional
blank
sta
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911180193.2A
Other languages
Chinese (zh)
Inventor
张盛峰
田朝阳
黄胜
贾艳秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201911180193.2A priority Critical patent/CN111062264A/en
Publication of CN111062264A publication Critical patent/CN111062264A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a document object classification method based on a dual-channel hybrid convolution network, which is used for realizing the segmentation and classification of each logic object (text, formula, table and image) in a document picture. Firstly, performing multi-mode matching recursive RLSA analysis on an input picture to determine segmentation coordinates; then, dividing the input picture into different logic areas according to the division coordinates; labeling the region, removing noise and carrying out class balance processing to obtain a classification data set; then, sending the two-dimensional image area piece into a two-dimensional CNN training, extracting the image in two directions, projecting and sending the image into a one-dimensional CNN network training; finally, the first seven layers of the two convolutional networks are used as feature extractors, training of a final model is carried out through a dual-channel mixed classification network, and the model can be used for predicting the object class of the region picture; the invention respectively uses the original two-dimensional picture and the projection in two directions as input, utilizes different characteristics and improves the classification precision.

Description

Document object classification method based on dual-channel hybrid convolution network
Technical Field
The invention relates to the field of document object detection and identification, in particular to a document object classification method based on a dual-channel hybrid convolution network.
Background
With the rapid development of machine learning and deep learning in recent years, Document Image Understanding (DIU) technology has gained more and more attention. The document picture understanding means that the content of the document picture is understood from the document picture as the name implies. The document picture understanding can be divided into page segmentation (also called region segmentation), region classification (also called block marking), document object identification and other steps, wherein the invention corresponds to the first two steps, namely document object detection and identification.
The current page segmentation technology can be divided into two types in steps, one type is a method based on pixel processing, namely, a series of rules are formulated according to the distribution condition of pixels in a picture to segment different region blocks, and specifically, the methods include projection analysis, RLSA (Run Length Smoothing Algorithm) analysis (Cesarini F, lassri M, mariais, et al. Encoding of modified x-rays for Document Classification [ C ]//2001 ]), blank analysis, connected domain extraction, and the like; the other is an object detection algorithm using a deep learning network, such as a sliding window, a random search algorithm, and the like, and mainly uses an exhaustive method to traverse all windows to select a window with the largest score. The method has the advantages of strong generalization, and the defects of small data set for text object detection and low accuracy.
However, the current learning-based method lacks a large amount of data set support and has lower accuracy; in the existing algorithm based on rules, projection analysis and RLSA analysis are both used for binarizing projection data, and more complex pixel distribution conditions are not distinguished; blank analysis and RLSA analysis are judged only by continuous run length, and different structural characteristics are not distinguished; and the connected domain extraction needs pixel corrosion and expansion treatment, and can not effectively segment a single line of text.
The methods of region classification are generally classified into two types, namely deep learning-based and rule-based. Rule-based methods are more traditional methods, mainly including homogeneous domain-based methods, color distribution-based methods, morphological contrast-based methods, and the like. The homogeneous domain-based method comprises the steps of calculating the pixel distribution condition of each region according to a rule defined by the homogeneous domain-based method, then calculating the difference indexes of the two regions through a certain algorithm, and finally judging by utilizing a threshold value. The classification is based on the color distribution, i.e. based on the difference in color distribution of pixels between background and foreground and between different classes. The method based on morphological comparison is to compare the extracted connected domain with a specific character (formula symbol) or structure (table frame) to determine its category. The Deep learning-based method mainly trains different CNN Convolutional Neural Networks (Yi X, Gao L, Liao Y, et al. CNN based Page Object Detection in Document Images [ C ]/201714 th IAPR International conference Analysis and Recognition (DARICS). IEEE,2017.) to classify, wherein the existing common Networks mainly comprise small Networks such as LeNet, AlexaNet (Krizovsky A, Sutskeever I, Hinton G. ImageNetworkationwith separate Convolutional Networks [ C ]/NIPS. Current Association ], ZF network, VGGNet and the like.
However, the rule-based classification method generally designs different rules for different document types, and the generalization is poor, and the formulation of the rules depends on experience, and the impact on the results is large. The existing deep learning method only extracts and calculates a large amount of two-dimensional features of a two-dimensional picture, and ignores one-dimensional distribution features of a document, such as distribution characteristics in the horizontal and vertical directions.
The multi-mode matching recursive RLSA analysis provided by the invention can separate out single-line text regions, so that the classification result can be conveniently identified, and the formula line detection and the multi-mode matching can be carried out, thereby improving the classification effect of the complex formula structure. In addition, two feature extractors are used for respectively extracting two-dimensional picture features and one-dimensional projection features for mixed classification training during classification, so that the number of input features of a network is increased, and the classification accuracy is improved.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a document object classification method based on a dual-channel hybrid convolutional network, which utilizes an RLSA (recursive least squares) segmentation method of multi-mode matching during segmentation, performs multi-level conversion on projection data, and discriminates different object structure types so as to implement different segmentation operations, thereby improving the segmentation accuracy. During classification, a two-dimensional AlexNet network and a one-dimensional AlexNet network taking pixel projection as input are respectively used as extractors and input into a double-channel classification network for classification, so that the defect that the two-dimensional network only pays attention to two-dimensional features and ignores the one-dimensional features is overcome, and the classification precision is improved.
The invention discloses a document object classification method based on a dual-channel hybrid convolution network, which comprises the following specific schemes and steps:
step 1, performing multi-mode matching recursive RLSA analysis on an input picture to determine segmentation coordinates, comprising the following substeps:
step 1-1, performing color space conversion on an original picture, converting the original picture into a gray image, setting a threshold value, converting the gray image into a binary image, initializing a region coordinate library by using a diagonal coordinate of the picture, and entering step 1-2, wherein only one region in the coordinate library is the original picture;
step 1-2, loading area pictures as input pictures according to a coordinate library in sequence, and projecting and segmenting the input pictures in the horizontal direction, wherein the steps can be specifically divided into the following steps:
step 1-2-1, counting the number of black pixels in the horizontal direction, dividing the number into three levels according to the difference of black pixel number distribution, respectively representing the three levels by 0, 1 and 2, wherein 0 represents blank or close to blank, 1 represents that a small number of black pixels are distributed, and 2 represents that a large number of black pixels are distributed, storing the counting result into a one-dimensional array, and entering step 1-2-2;
step 1-2-2, traversing the array from the beginning, dividing the array into three states according to different levels of each value in the array, denoted by sta0, sta1 and sta2, corresponding to the three levels in step 1-2-1, at sta1 or sta2, which may be denoted as sta b, for representing a black state, and determining the length of each state maintenance according to the jump between different states, denoted by sta0_ h, sta2_ h and sta _ h, respectively, and specifically dividing into the following rules, where min _ cut _ blank represents the minimum boundary blank height (may be 10), min _ txt represents the minimum text line height (may be 10), max _ continuous _ blank represents the maximum inclusion blank height (may be 5), and formula _ line represents the formula line height (may be 3):
1) the sta _ h between the two division points is determined as a black block;
2) the start and the end of the array are in a state of a sta and automatically serve as a division point;
3) if sta0_ h > min _ cut _ blank, directly mark the adjacent sta b as a split point;
4) if sta2_ h < ═ format _ line and sta0_ h < ═ max _ continuous _ blank on both sides, identifying as a formula structure, and fusing with the black blocks before and after the determined segmentation point;
5) if sta _ h > min _ txt and its adjacent sta0_ h > max _ connian _ blank, then directly mark the state both ends sta as partitionable, this block is a black block;
6) if the sta _ h is more than min _ txt and the sta0_ h < ═ max _ linkage _ blank exists at the two ends, the matching is a parent structure and the parent structure is fused with a child structure at one end meeting the condition, otherwise, the sta at the end is marked as a division point;
7) if the sta _ h < ═ min _ txt exists at the two ends, and the sta0_ h < ═ max _ continuous _ blank exists at the two ends, the matching is a substructure, and the substructure is fused with the black block at one end meeting the condition, otherwise, the sta at the end is marked as a division point;
1-2-3, updating each region coordinate of the picture according to all the segmentation points, and storing the region coordinates into a region coordinate library;
step 1-3, loading the region picture as an input picture according to a coordinate library in sequence, and projecting and dividing the input picture in the vertical direction, wherein the projection in the vertical direction is much simpler than that in the horizontal direction, and the method specifically comprises the following steps:
step 1-3-1, counting the number of black pixels in the vertical direction, storing the number of the black pixels as a one-dimensional array, dividing the black pixels into two levels according to different black pixel number distributions, respectively representing the levels by 0 and 1, wherein 0 represents blank or is close to blank, and 1 represents pixel distribution, and entering step 1-3-2;
step 1-3-2, traversing the array from the beginning, dividing the array into two states, namely a starwhite state and a starblack state according to different grades of each value in the array, and determining the length maintained by each state according to the jump between different states, wherein the maintained length is respectively represented by a starh and a starh, and the method specifically comprises the following rules, wherein h _ min _ cut _ blank represents the minimum divisible horizontal space height (20 can be taken):
1) the sta _ h between the two division points is determined as a black block;
2) the array starts and ends and is in a state of a sta, and the array automatically serves as a partition point;
3) the method comprises the following steps that (1) a sta _ h > h _ min _ cut _ blank triggers a segmentation mode, and the stas at two ends are marked as segmentation points;
1-3-3, updating each region coordinate of the picture according to all the segmentation points, and storing the region coordinates into a region coordinate library;
step 1-4, repeating the step 1-2 and the step 1-3 for each segmented area until the height of the area is smaller than the height of the minimum text line or the coordinates of the continuous area are not updated, and skipping the area;
step 1-5, if all the areas can not be subdivided, transferring the stored coordinate libraries of all the areas into step 2;
step 2, dividing the input color picture into area pictures containing different logic objects according to the division coordinates in the step 1, and directly transmitting the area pictures to the dual-channel mixed classification network in the step 6 to be used as input if the area pictures are called by a model;
and 3, labeling the region according to the data set annotation, removing noise and carrying out equalization treatment to obtain a classification data set containing a region picture, and if the classification data set is called by a model, ignoring the step:
step 3-1, analyzing the data set annotation file to obtain a real label and a coordinate of the object region, performing IOU (intersection of two pictures is compared with union of the two pictures) comparison analysis on the object region and the segmentation data obtained in the step 2, setting a threshold value to be 0.8, and obtaining labels of all segmentation regions;
step 3-2, according to the label and the area size information of each area, carrying out error correction processing to ensure the accuracy of the data set, and the main basis is as follows:
1) if the size of the table and picture category is less than a certain threshold (20), then the area is considered as noise interference, and then the area is discarded;
2) if the text category is greater than the two lines of text height, discarding the region;
3) if the length of the text class is smaller than a certain threshold value, horizontal copying and splicing processing is carried out, and the splicing quantity formula is as follows:
Figure RE-GDA0002327416790000041
wherein copy _ num is the number of splices, round is rounding, avg _ width is the average length of the regional picture, width is the current picture width, and interval is the splice interval;
and 3-3, because the number of the pictures of different types in the data set is large in difference, in order to reduce the problem of unbalanced training, data set equalization processing is carried out, specifically, the method mainly comprises the step of turning over the area pictures with small number to expand the number of the area pictures, and a part of the pictures with large number of types is randomly selected as a training set.
Step 4, sending the two-dimensional color area picture of the processed data set into a two-dimensional CNN for training, storing training data as a two-dimensional feature extractor of the dual-channel mixed classification network, and if the two-dimensional color area picture is called by a model, omitting the step, specifically comprising the following substeps:
step 4-1, inputting picture size resize as 248 × 450 to conform to the statistical rule of the original picture size of the training set;
step 4-2, taking an AlexNet network as a prototype, keeping the structure of a convolution part unchanged, initializing parameters by using a large number of trained public parameters, keeping the parameters fixed, and arranging three full-connection layers behind the convolution part, wherein the input and output sizes of the three full-connection layers are fc 1: 6 × 12 × 256-: 4096-: 4096-4, wherein, the dropout processing is carried out after fc1 and fc2 layers, and the full connection layer is initialized randomly;
and 4-3, carrying in the processed classification data set for training and testing, wherein the learning rate is 0.001, the dropout rate is 0.5, and the batch _ size is 200, iteratively storing the model parameter with the highest testing precision, and storing and fixing all the parameters as extractors.
Step 5, extracting two-direction projections of the two-dimensional picture, merging the two-direction projections into one-dimensional data, sending the one-dimensional data into a one-dimensional CNN network for training, storing the training data as a one-dimensional feature extractor of the dual-channel mixed classification network, and if the two-dimensional data is called by a model, omitting the step, which specifically comprises the following substeps:
step 5-1, inputting a region picture size resize of 384 × 640, projecting data in horizontal and vertical directions respectively, counting the number of image melanins, and splicing into 1024-sized one-dimensional vectors as input;
and 5-2, using an AlexNet network as a prototype, only reserving a first dimension part according to the convolution kernel size and the step length of a convolution layer part, wherein a second dimension part is 1, the rest parameters are consistent with the convolution part in the step 4, and then connecting three full-connection layers, wherein the sizes of the three full-connection layers are fc 1: 30 × 256-: 4096-: 4096-4, wherein the dropout processing is carried out after fc1 and fc2 layers, and all layers are initialized randomly;
step 5-3, carrying in the own data set for training and testing, wherein the learning rate is 0.001, the dropout rate is 0.5, the batch _ size is 200, iteratively storing the model parameter with the highest testing precision, and storing and fixing all parameters as extractors;
step 6, using the first seven layers of the convolutional network model trained in steps 5 and 6 as a feature extractor, using the extracted feature data as the input of a dual-channel classification network to form a dual-channel hybrid classification network, performing the final classification training on the network, and storing the training data to obtain the final training model, specifically comprising the following substeps:
step 6-1, the first layer full connection layer is divided into two parallel layers which are respectively connected with two feature extractors, wherein the number of nodes of the layer connected with the 1D CNN extra is relatively less, the number of nodes of the layer connected with the 2D CNN extra is relatively more, the proportion is selected according to specific conditions, and the selectable proportion is 1: 4, the number of nodes is 1024, and 4096 respectively;
step 6-2, connecting two parallel full-connection layers of the first layer to a uniform full-connection layer, wherein the number of nodes can be 4096, then connecting a full-connection layer with the same number as the types for outputting the type probability, and only training the three layers during training;
and 6-3, training again by using the classification data set, wherein the learning rate is 0.01, the dropout rate is 0.5, the batch _ size is 200, and the parameter with the best effect is stored as a final model.
Compared with the prior art, the invention has the following advantages:
1) when in segmentation, the RLSA segmentation method of multi-mode matching is utilized, projection data are subjected to multi-level conversion, and different object structure types are judged, so that different segmentation operations are implemented, and the segmentation accuracy is improved;
2) the text is subjected to single-line segmentation, so that the problem of classification confusion among partial texts, tables, texts and texts, which are caused by different segmentation line numbers of block regions, is solved, the classification accuracy is effectively improved, and the expansion operations such as identification and recovery are facilitated;
3) during classification, two-dimensional AlexNet taking two-dimensional features as input and one-dimensional AlexNet taking pixel projection as input are respectively used as extractors and input into a two-channel classification network for classification, so that the defect that the two-dimensional network only pays attention to the two-dimensional features and ignores the one-dimensional features is overcome, the classification precision is improved by adjustment, the size of a network input picture is also changed according to the statistical distribution of a data set, and the applicability of text object picture classification is improved;
4) the first layer of the finally classified dual-channel classification network is divided into two parallel sublayers and respectively corresponds to the output of the two feature extraction networks, so that the classification network has better capacity of adjusting the proportion occupied by the two networks, the full utilization of the features is effectively realized, and the classification precision is improved.
Drawings
In order to make the object, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for explanation:
FIG. 1 is a flowchart of the overall steps of the present invention;
FIG. 2 is a flow chart of the multi-pattern matching recursive RLSA analysis method of the present invention;
FIG. 3 is a diagram of an improved 2-dimensional alexnet CNN network architecture of the present invention;
FIG. 4 is a diagram of an improved 1-dimensional alexnet CNN network architecture of the present invention;
FIG. 5 is a diagram of a dual-path hybrid classification network architecture of the present invention;
detailed description of the preferred embodiments
The present invention will be described in detail with reference to the following examples. It should be noted that the described embodiments are for illustrative purposes only and are not limiting on the scope of the invention.
The invention discloses a document object classification method based on a dual-channel hybrid convolution network, which comprises the following specific schemes and steps:
step 1, performing multi-mode matching recursive RLSA analysis on an input picture to determine segmentation coordinates, comprising the following substeps:
step 1-1, performing color space conversion on an original picture by using opencv, converting the original picture into a GRAY-scale graph CV _ RGB2GRAY, setting a threshold value to be 180, converting the GRAY-scale graph CV _ RGB2GRAY into a binary graph, initializing a region coordinate library by using a diagonal coordinate of the picture, and entering step 1-2, wherein only one region in the coordinate library is the original picture;
step 1-2, loading area pictures as input pictures according to a coordinate library in sequence, and projecting and segmenting the input pictures in the horizontal direction, wherein the steps can be specifically divided into the following steps:
step 1-2-1, counting the number of black pixels in the horizontal direction, dividing the number into three levels according to the difference of black pixel number distribution, respectively representing the three levels by 0, 1 and 2, wherein 0 represents blank or close to blank, 1 represents that a small number of black pixels are distributed, and 2 represents that a large number of black pixels are distributed, storing the counting result into a one-dimensional array, and entering step 1-2-2;
step 1-2-2, traversing the array from the beginning, dividing the array into three states according to different levels of each value in the array, denoted by sta0, sta1, and sta2, corresponding to the three levels in step 1-2-1, at sta1 or sta2, which may also be denoted as sta b, to represent a black state, and determining the length of each state maintenance according to transitions between different states, denoted as sta0_ h, sta2_ h, and sta _ h, which are specifically divided into the following rules, where min _ cut _ blank ═ 10 represents a minimum boundary blank height, min _ txt ═ 10 represents a minimum text line height, max _ contian _ blank ═ 5 represents a maximum inclusion blank height, and formula _ lin ═ 3 represents a formula line height:
1) the sta _ h between the two division points is determined as a black block;
2) the start and the end of the array are in a state of a sta and automatically serve as a division point;
3) if sta0_ h >10, directly mark the neighboring sta b as a split point;
4) if sta2_ h is 3 and sta0_ h is 5, identifying as a formula structure, and fusing with the black blocks before and after the determined segmentation point;
5) if sta _ h >10 and its neighbors sta0_ h >5, then directly mark the state both ends, sta, as partitionable, this block is a black block;
6) if the sta _ h is more than 10 and the sta0_ h < 5 exists at the two ends, the matching is a parent structure and the parent structure is fused with a child structure at one end meeting the condition, otherwise, the sta at the end is marked as a splitting point;
7) if the sta _ h < ═ 10 exists at both ends, the sta0_ h < ═ 5 exists at both ends, the matching is a substructure, and the black block at one end meeting the condition is fused, otherwise, the sta at the end is marked as a division point;
1-2-3, updating each region coordinate of the picture according to all the segmentation points, and storing the region coordinates into a region coordinate library;
step 1_3, loading the region picture as an input picture according to a coordinate library in sequence, and projecting and dividing the input picture in the vertical direction, wherein the projection in the vertical direction is much simpler than that in the horizontal direction, and the method specifically comprises the following steps:
step 1-3-1, counting the number of black pixels in the vertical direction, storing the number of the black pixels as a one-dimensional array, dividing the black pixels into two levels according to different black pixel number distributions, respectively representing the levels by 0 and 1, wherein 0 represents blank or is close to blank, and 1 represents pixel distribution, and entering step 1-3-2;
step 1-3-2, traversing the array from the head, dividing the array into two states, namely a starwhite state and a starblack state, according to different levels of each value in the array, determining the length maintained by each state according to the jump between different states, wherein the maintained length is respectively represented by starh and starh, and the method specifically comprises the following rules, wherein h _ min _ cut _ blank ═ 20 represents the minimum divisible space height:
1) the sta _ h between the two division points is determined as a black block;
2) the array starts and ends and is in a state of a sta, and the array automatically serves as a partition point;
3) the step (b) is more than 20, a segmentation mode is triggered, and the stabs at two ends are marked as segmentation points;
1-3-3, updating each region coordinate of the picture according to all the segmentation points, and storing the region coordinates into a region coordinate library;
step 1-4, repeating the step 1-2 and the step 1-3 for each segmented area until the height of the area is smaller than the height of the minimum text line or the coordinates of the continuous area are not updated, and skipping the area;
step 1-5, if all the areas can not be subdivided, transferring the stored coordinate libraries of all the areas into step 2;
step 2, dividing the input color picture into area pictures containing different logic objects according to the division coordinates in the step 1, and directly transmitting the area pictures to the dual-channel mixed classification network in the step 6 to be used as input if the area pictures are called by a model;
and 3, labeling the region according to the data set annotation, removing noise and carrying out equalization treatment to obtain a classification data set containing a region picture, and if the classification data set is called by a model, ignoring the step:
step 3-1, analyzing the data set annotation file to obtain a real label and a coordinate of the object region, performing IOU (intersection of two pictures is compared with union of the two pictures) comparison analysis on the object region and the segmentation data obtained in the step 2, setting a threshold value to be 0.8, and obtaining labels of all segmentation regions;
step 3-2, according to the label and the area size information of each area, carrying out error correction processing to ensure the accuracy of the data set, and the main basis is as follows:
1) if the size of the table and picture category is less than the threshold 20, then the area is discarded;
2) if the text category is greater than the two line text height 20, then the region is discarded;
3) if the length of the text class is smaller than the threshold value 20, performing horizontal copying and splicing treatment, wherein the splicing quantity formula is as follows:
Figure RE-GDA0002327416790000081
wherein copy _ num is the number of splices, round is rounding, avg _ width is the average picture length 406, width is the current picture width, and interval is the splice interval 8;
and 3-3, because the number of the pictures of different types in the data set is large in difference, in order to reduce the problem of unbalanced training, data set equalization processing is carried out, specifically, the pictures with small number are overturned to expand the number of the pictures, and a part of the pictures with large number is randomly selected as a training set. Wherein the text, formula, table and picture are 46892, 3937, 699 and 1978 respectively; so, the table is flipped three times, the picture is flipped once, the text is limited to 3000, and after equalization, 3000, 3937, 2796, 3956, respectively.
Step 4, sending the two-dimensional color area picture of the processed data set into a two-dimensional CNN for training, storing training data as a two-dimensional feature extractor of the dual-channel mixed classification network, and if the two-dimensional color area picture is called by a model, omitting the step, specifically comprising the following substeps:
step 4-1, inputting picture size resize as 248 × 450 to conform to the statistical rule of the original picture size of the training set;
step 4-2, taking an AlexNet network as a prototype, keeping the structure of a convolution part unchanged, initializing parameters by using a large number of trained public parameters, keeping the parameters fixed, and arranging three full-connection layers behind the convolution part, wherein the input and output sizes of the three full-connection layers are fc 1: 6 × 12 × 256-: 4096-: 4096-4, wherein, the dropout processing is carried out after fc1 and fc2 layers, and the full connection layer is initialized randomly;
step 4-3, carrying the processed classification data set in for training and testing, wherein the learning rate is 0.001, the dropout rate is 0.5, the batch _ size is 200, iteratively storing the model parameter with the highest testing precision, and storing and fixing all parameters as an extractor;
step 5, extracting two-direction projections of the two-dimensional picture, merging the two-direction projections into one-dimensional data, sending the one-dimensional data into a one-dimensional CNN network for training, storing the training data as a one-dimensional feature extractor of the dual-channel mixed classification network, and if the two-dimensional data is called by a model, omitting the step, which specifically comprises the following substeps:
step 5-1, inputting a region picture size resize of 384 × 640, projecting data in horizontal and vertical directions respectively, counting the number of image melanins, and splicing into 1024-sized one-dimensional vectors as input;
and 5-2, using an AlexNet network as a prototype, only reserving a first dimension part according to the convolution kernel size and the step length of a convolution layer part, wherein a second dimension part is 1, the rest parameters are consistent with the convolution part in the step 4, and then connecting three full-connection layers, wherein the sizes of the three full-connection layers are fc 1: 30 × 256-: 4096-: 4096-4, wherein the dropout processing is carried out after fc1 and fc2 layers, and all layers are initialized randomly;
step 5-3, carrying in the own data set for training and testing, wherein the learning rate is 0.001, the dropout rate is 0.5, the batch _ size is 200, iteratively storing the model parameter with the highest testing precision, and storing and fixing all parameters as extractors;
the specific parameters of these two networks are shown in table 1:
Figure RE-GDA0002327416790000091
step 6, using the first seven layers of the convolutional network model trained in steps 5 and 6 as a feature extractor, using the extracted feature data as the input of a dual-channel classification network to form a dual-channel hybrid classification network, performing the final classification training on the network, and storing the training data to obtain the final training model, specifically comprising the following substeps:
step 6-1, the first layer full connection layer is divided into two parallel layers which are respectively connected with two feature extractors, wherein the number of nodes of the layer connected with the 1D CNN extra is relatively less, the number of nodes of the layer connected with the 2D CNN extra is relatively more, the proportion is selected according to specific conditions, and the selectable proportion is 1: 4, the number of nodes is 1024, and 4096 respectively;
step 6-2, connecting two parallel full-connection layers of the first layer to a uniform full-connection layer, wherein the number of nodes can be 4096, then connecting a full-connection layer with the same number as the types for outputting the type probability, and only training the three layers during training;
and 6-3, training again by using the classification data set, wherein the learning rate is 0.01, the dropout rate is 0.5, the batch _ size is 200, the parameter with the best effect is stored as a final model, and the best effect can be obtained after 20 iterations, wherein the accuracy is 98.02%.
The invention discloses a document object classification method based on a dual-channel hybrid convolution network, and a plurality of methods and ways for specifically implementing the technical scheme are provided, the above description is only a preferred embodiment of the invention, but the protection scope of the invention is not limited thereto, and any changes or alternative methods that can be easily conceived by those skilled in the art within the technical scope of the invention should be covered within the protection scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (4)

1. A document object classification method based on a dual-channel hybrid convolutional network is characterized by comprising two contents of model training and model calling, wherein the model calling omits a part of steps of the model training, and comprises the following steps, wherein the default of the steps is the model training:
step 1, performing multi-mode matching recursive RLSA analysis on an input picture to determine a segmentation coordinate;
step 2, dividing the input color picture into area pictures containing different logic objects according to the division coordinates in the step 1, and directly transmitting the area pictures to the dual-channel mixed classification network in the step 6 to be used as input if the area pictures are called by a model;
step 3, labeling the region according to the data set annotation, removing noise and carrying out equalization processing to obtain a classification data set containing a region picture, and if the classification data set is model calling, ignoring the step;
step 4, sending the two-dimensional color area picture of the processed data set into a two-dimensional CNN for training, storing training data as a two-dimensional feature extractor of the dual-channel mixed classification network, and ignoring the step if the model is called;
step 5, extracting two-direction projections of the two-dimensional picture, combining the two-direction projections into one-dimensional data, sending the one-dimensional data into a one-dimensional CNN network for training, storing the training data as a one-dimensional feature extractor of the dual-channel mixed classification network, and ignoring the step if the two-dimensional data is called by a model;
and 6, taking the first seven layers of the convolutional network models trained in the steps 5 and 6 as feature extractors, taking the extracted feature data as the input of the dual-channel classification network to form a dual-channel mixed classification network, performing final classification training on the network, and storing the training data to obtain a final training model.
2. The method of claim 1, wherein the step of tri-valuing the projection data to represent three different states and applying different segmentation rules based on the states to distinguish between different cases comprises the sub-steps of:
step 1-1, graying and binarizing an original picture, and initializing an area coordinate library by using a diagonal coordinate of the picture, wherein only one area in the coordinate library is the original picture;
step 1-2, loading area pictures as input pictures in sequence according to a coordinate library, and projecting and dividing the input pictures in the horizontal direction, wherein the steps can be specifically divided into the following steps;
step 1-2-1, counting the number of black pixels in the horizontal direction, dividing the number into three levels according to the difference of black pixel number distribution, respectively representing the three levels by 0, 1 and 2, wherein 0 represents blank or is close to blank, 1 represents that a small number of black pixels are distributed, and 2 represents that a large number of black pixels are distributed, and storing the counting result into a one-dimensional array;
step 1-2-2, traversing the array from the beginning, dividing the array into three states according to different levels of each value in the array, denoted by sta0, sta1 and sta2, corresponding to the three levels in step 1-2-1, at sta1 or sta2, which may also be denoted as sta b, to represent a black state, and determining the length of each state maintenance according to the jump between different states, denoted by sta0_ h, sta2_ h and sta _ h, respectively, and specifically dividing into the following rules, where min _ cut _ blank represents the minimum boundary blank height, min _ txt represents the minimum text line height, max _ constraint _ blank represents the maximum inclusion blank height, and formula _ line represents the formula line height:
1) the sta _ h between the two division points is determined as a black block;
2) the start and the end of the array are in a state of a sta and automatically serve as a division point;
3) if sta0_ h > min _ cut _ blank, directly mark the adjacent sta b as a split point;
4) if sta2_ h < ═ format _ line and sta0_ h < ═ max _ continuous _ blank on both sides, identifying as a formula structure, and fusing with the black blocks before and after the determined segmentation point;
5) if sta _ h > min _ txt and its adjacent sta0_ h > max _ connian _ blank, then directly mark the state both ends sta as partitionable, this block is a black block;
6) if the sta _ h is more than min _ txt and the sta0_ h < ═ max _ linkage _ blank exists at the two ends, the matching is a parent structure and the parent structure is fused with a child structure at one end meeting the condition, otherwise, the sta at the end is marked as a division point;
7) if the sta _ h < ═ min _ txt exists at the two ends, and the sta0_ h < ═ max _ continuous _ blank exists at the two ends, the matching is a substructure, and the substructure is fused with the black block at one end meeting the condition, otherwise, the sta at the end is marked as a division point;
1-2-3, updating each region coordinate of the picture according to all the segmentation points, and storing the region coordinates into a region coordinate library;
step 1-3, loading the region picture as an input picture according to a coordinate library in sequence, and projecting and dividing the input picture in the vertical direction, wherein the projection in the vertical direction is much simpler than that in the horizontal direction, and the method specifically comprises the following steps:
step 1-3-1, counting the number of black pixels in the vertical direction, storing the number of the black pixels as a one-dimensional array, dividing the black pixels into two levels according to different black pixel number distributions, respectively representing the levels by 0 and 1, wherein 0 represents blank or is close to blank, and 1 represents pixel distribution, and entering step 1-3-2;
step 1-3-2, traversing the array from the head, dividing the array into two states, namely a starwhite state and a starblack state according to different grades of each value in the array, and determining the length maintained by each state according to the jump between different states, wherein the maintained length is respectively represented by a starh and a starh, and the method specifically comprises the following rules, wherein h _ min _ cut _ blank represents the minimum horizontal divisible blank height:
1) the sta _ h between the two division points is determined as a black block;
2) the array starts and ends and is in a state of a sta, and the array automatically serves as a partition point;
3) the method comprises the following steps that (1) a sta _ h > h _ min _ cut _ blank triggers a segmentation mode, and the stas at two ends are marked as segmentation points;
1-3-3, updating each region coordinate of the picture according to all the segmentation points, and storing the region coordinates into a region coordinate library;
step 1-4, repeating the step 1-2 and the step 1-3 for each segmented area until the height of the area is smaller than the height of the minimum text line or the coordinates of the continuous area are not updated, and skipping the area;
and 1-5, if all the areas can not be subdivided, transferring the stored coordinate libraries of all the areas into the step 2.
3. The method of claim 1, wherein the label marking is performed on the resolved contrast label and the partition area, and then data equalization and data error correction processing are performed, wherein the error correction processing specifically includes the following steps:
according to the label and the area size information of each area, error correction processing is carried out to ensure the accuracy of the data set, and the main basis is as follows:
1) if the sizes of the table and the picture category are smaller than a certain threshold value, the noise interference is determined, and the area is discarded;
2) if the text category is greater than the two lines of text height, discarding the region;
3) if the length of the text class is smaller than a certain threshold value, horizontal copying and splicing processing is carried out, and the splicing quantity formula is as follows:
Figure FDA0002291041160000031
wherein copy _ num is the number of splices, round is rounding, avg _ width is the average picture length, width is the current picture width, and interval is the splice interval.
4. The two-way hybrid classification network of step 6 of claim 1, comprising the following:
step 6-1, taking the first 7 layers of the model trained in the step 4 as a two-dimensional feature extractor of the regional picture, and recording the two-dimensional feature extractor as a 2D CNNextractor, wherein the input of the feature extractor is a two-dimensional color regional picture;
step 6-2, using the first 7 layers of the model trained in step 5 as a one-dimensional feature extractor of the regional picture, which is recorded as 1D CNNextractor, and the input of the one-dimensional feature extractor is a one-dimensional vector;
and 6-3, the final classification network is a three-layer fully-connected network, but is different from the traditional fully-connected network, and has the specific characteristics that:
1) the full tie-layer of first layer divide into two layers that parallel, connects two feature extractor respectively, and wherein the number of nodes of connecting the layer of 1D CNNextractor is less relatively, and the number of nodes of connecting the layer of 2D CNN extraactor is more relatively, and the proportion is selected according to particular case, and selectable proportion is 1: 4, the number of nodes is 1024, and 4096 respectively;
two parallel full-connection layers of the first layer are connected to a uniform full-connection layer, the number of nodes can be 4096, then a full-connection layer with the same number as the types is connected in the following layer and used for outputting the type probability, and only the three layers are trained during training.
CN201911180193.2A 2019-11-27 2019-11-27 Document object classification method based on dual-channel hybrid convolution network Pending CN111062264A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911180193.2A CN111062264A (en) 2019-11-27 2019-11-27 Document object classification method based on dual-channel hybrid convolution network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911180193.2A CN111062264A (en) 2019-11-27 2019-11-27 Document object classification method based on dual-channel hybrid convolution network

Publications (1)

Publication Number Publication Date
CN111062264A true CN111062264A (en) 2020-04-24

Family

ID=70298769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911180193.2A Pending CN111062264A (en) 2019-11-27 2019-11-27 Document object classification method based on dual-channel hybrid convolution network

Country Status (1)

Country Link
CN (1) CN111062264A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128395A (en) * 2021-04-16 2021-07-16 重庆邮电大学 Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE60302191D1 (en) * 2002-08-27 2005-12-15 Oce Print Logic Technologies S Determination of the skew of document images
US20050281463A1 (en) * 2004-04-22 2005-12-22 Samsung Electronics Co., Ltd. Method and apparatus for processing binary image
CN102496018A (en) * 2011-12-08 2012-06-13 方正国际软件有限公司 Document skew detection method and system
CN106650721A (en) * 2016-12-28 2017-05-10 吴晓军 Industrial character identification method based on convolution neural network
CN107220641A (en) * 2016-03-22 2017-09-29 华南理工大学 A kind of multi-language text sorting technique based on deep learning
CN107609549A (en) * 2017-09-20 2018-01-19 北京工业大学 The Method for text detection of certificate image under a kind of natural scene
CN108108731A (en) * 2016-11-25 2018-06-01 中移(杭州)信息技术有限公司 Method for text detection and device based on generated data
CN108614997A (en) * 2018-04-04 2018-10-02 南京信息工程大学 A kind of remote sensing images recognition methods based on improvement AlexNet

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE60302191D1 (en) * 2002-08-27 2005-12-15 Oce Print Logic Technologies S Determination of the skew of document images
US20050281463A1 (en) * 2004-04-22 2005-12-22 Samsung Electronics Co., Ltd. Method and apparatus for processing binary image
CN102496018A (en) * 2011-12-08 2012-06-13 方正国际软件有限公司 Document skew detection method and system
CN107220641A (en) * 2016-03-22 2017-09-29 华南理工大学 A kind of multi-language text sorting technique based on deep learning
CN108108731A (en) * 2016-11-25 2018-06-01 中移(杭州)信息技术有限公司 Method for text detection and device based on generated data
CN106650721A (en) * 2016-12-28 2017-05-10 吴晓军 Industrial character identification method based on convolution neural network
CN107609549A (en) * 2017-09-20 2018-01-19 北京工业大学 The Method for text detection of certificate image under a kind of natural scene
CN108614997A (en) * 2018-04-04 2018-10-02 南京信息工程大学 A kind of remote sensing images recognition methods based on improvement AlexNet

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ANGELOS P. GIOTIS ET AL.: "A survey of document image word spotting techniques", 《PATTERN RECOGNITION》 *
BERAT KURAR BARAKAT ET AL.: "Binarization Free Layout Analysis for Arabic Historical Documents Using Fully Convolutional Networks", 《2018 IEEE 2ND INTERNATIONAL WORKSHOP ON ARABIC AND DERIVED SCRIPT ANALYSIS AND RECOGNITION》 *
贺景宇: "复杂版面文档图像中公式与文本的提取及分析", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
黄胜 等: "基于深度学习的简历信息实体抽取方法", 《计算机工程与设计》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113128395A (en) * 2021-04-16 2021-07-16 重庆邮电大学 Video motion recognition method and system based on hybrid convolution and multi-level feature fusion model
CN113128395B (en) * 2021-04-16 2022-05-20 重庆邮电大学 Video action recognition method and system based on hybrid convolution multistage feature fusion model

Similar Documents

Publication Publication Date Title
CN110276765B (en) Image panorama segmentation method based on multitask learning deep neural network
US10552705B2 (en) Character segmentation method, apparatus and electronic device
CN111639646B (en) Test paper handwritten English character recognition method and system based on deep learning
US9552536B2 (en) Image processing device, information storage device, and image processing method
US20190188528A1 (en) Text detection method and apparatus, and storage medium
US8155445B2 (en) Image processing apparatus, method, and processing program for image inversion with tree structure
US10423852B1 (en) Text image processing using word spacing equalization for ICR system employing artificial neural network
CN109241861B (en) Mathematical formula identification method, device, equipment and storage medium
CN112287941B (en) License plate recognition method based on automatic character region perception
CN110334709B (en) License plate detection method based on end-to-end multi-task deep learning
CN110598581B (en) Optical music score recognition method based on convolutional neural network
US11915465B2 (en) Apparatus and methods for converting lineless tables into lined tables using generative adversarial networks
CN113673338A (en) Natural scene text image character pixel weak supervision automatic labeling method, system and medium
CN113361432A (en) Video character end-to-end detection and identification method based on deep learning
CN112819840A (en) High-precision image instance segmentation method integrating deep learning and traditional processing
WO2000062243A1 (en) Character string extracting device and method based on basic component in document image
CN113591831A (en) Font identification method and system based on deep learning and storage medium
CN116824608A (en) Answer sheet layout analysis method based on target detection technology
CN115578741A (en) Mask R-cnn algorithm and type segmentation based scanned file layout analysis method
CN114330234A (en) Layout structure analysis method and device, electronic equipment and storage medium
US9710703B1 (en) Method and apparatus for detecting texts included in a specific image
KR101571681B1 (en) Method for analysing structure of document using homogeneous region
CN111062264A (en) Document object classification method based on dual-channel hybrid convolution network
CN111476226A (en) Text positioning method and device and model training method
KR102026280B1 (en) Method and system for scene text detection using deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200424