CN111062264A

CN111062264A - Document object classification method based on dual-channel hybrid convolution network

Info

Publication number: CN111062264A
Application number: CN201911180193.2A
Authority: CN
Inventors: 张盛峰; 田朝阳; 黄胜; 贾艳秋
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2020-04-24

Abstract

The invention provides a document object classification method based on a dual-channel hybrid convolution network, which is used for realizing the segmentation and classification of each logic object (text, formula, table and image) in a document picture. Firstly, performing multi-mode matching recursive RLSA analysis on an input picture to determine segmentation coordinates; then, dividing the input picture into different logic areas according to the division coordinates; labeling the region, removing noise and carrying out class balance processing to obtain a classification data set; then, sending the two-dimensional image area piece into a two-dimensional CNN training, extracting the image in two directions, projecting and sending the image into a one-dimensional CNN network training; finally, the first seven layers of the two convolutional networks are used as feature extractors, training of a final model is carried out through a dual-channel mixed classification network, and the model can be used for predicting the object class of the region picture; the invention respectively uses the original two-dimensional picture and the projection in two directions as input, utilizes different characteristics and improves the classification precision.

Description

Document object classification method based on dual-channel hybrid convolution network

Technical Field

The invention relates to the field of document object detection and identification, in particular to a document object classification method based on a dual-channel hybrid convolution network.

Background

With the rapid development of machine learning and deep learning in recent years, Document Image Understanding (DIU) technology has gained more and more attention. The document picture understanding means that the content of the document picture is understood from the document picture as the name implies. The document picture understanding can be divided into page segmentation (also called region segmentation), region classification (also called block marking), document object identification and other steps, wherein the invention corresponds to the first two steps, namely document object detection and identification.

The current page segmentation technology can be divided into two types in steps, one type is a method based on pixel processing, namely, a series of rules are formulated according to the distribution condition of pixels in a picture to segment different region blocks, and specifically, the methods include projection analysis, RLSA (Run Length Smoothing Algorithm) analysis (Cesarini F, lassri M, mariais, et al. Encoding of modified x-rays for Document Classification [ C ]//2001 ]), blank analysis, connected domain extraction, and the like; the other is an object detection algorithm using a deep learning network, such as a sliding window, a random search algorithm, and the like, and mainly uses an exhaustive method to traverse all windows to select a window with the largest score. The method has the advantages of strong generalization, and the defects of small data set for text object detection and low accuracy.

However, the current learning-based method lacks a large amount of data set support and has lower accuracy; in the existing algorithm based on rules, projection analysis and RLSA analysis are both used for binarizing projection data, and more complex pixel distribution conditions are not distinguished; blank analysis and RLSA analysis are judged only by continuous run length, and different structural characteristics are not distinguished; and the connected domain extraction needs pixel corrosion and expansion treatment, and can not effectively segment a single line of text.

The methods of region classification are generally classified into two types, namely deep learning-based and rule-based. Rule-based methods are more traditional methods, mainly including homogeneous domain-based methods, color distribution-based methods, morphological contrast-based methods, and the like. The homogeneous domain-based method comprises the steps of calculating the pixel distribution condition of each region according to a rule defined by the homogeneous domain-based method, then calculating the difference indexes of the two regions through a certain algorithm, and finally judging by utilizing a threshold value. The classification is based on the color distribution, i.e. based on the difference in color distribution of pixels between background and foreground and between different classes. The method based on morphological comparison is to compare the extracted connected domain with a specific character (formula symbol) or structure (table frame) to determine its category. The Deep learning-based method mainly trains different CNN Convolutional Neural Networks (Yi X, Gao L, Liao Y, et al. CNN based Page Object Detection in Document Images [ C ]/201714 th IAPR International conference Analysis and Recognition (DARICS). IEEE,2017.) to classify, wherein the existing common Networks mainly comprise small Networks such as LeNet, AlexaNet (Krizovsky A, Sutskeever I, Hinton G. ImageNetworkationwith separate Convolutional Networks [ C ]/NIPS. Current Association ], ZF network, VGGNet and the like.

However, the rule-based classification method generally designs different rules for different document types, and the generalization is poor, and the formulation of the rules depends on experience, and the impact on the results is large. The existing deep learning method only extracts and calculates a large amount of two-dimensional features of a two-dimensional picture, and ignores one-dimensional distribution features of a document, such as distribution characteristics in the horizontal and vertical directions.

The multi-mode matching recursive RLSA analysis provided by the invention can separate out single-line text regions, so that the classification result can be conveniently identified, and the formula line detection and the multi-mode matching can be carried out, thereby improving the classification effect of the complex formula structure. In addition, two feature extractors are used for respectively extracting two-dimensional picture features and one-dimensional projection features for mixed classification training during classification, so that the number of input features of a network is increased, and the classification accuracy is improved.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a document object classification method based on a dual-channel hybrid convolutional network, which utilizes an RLSA (recursive least squares) segmentation method of multi-mode matching during segmentation, performs multi-level conversion on projection data, and discriminates different object structure types so as to implement different segmentation operations, thereby improving the segmentation accuracy. During classification, a two-dimensional AlexNet network and a one-dimensional AlexNet network taking pixel projection as input are respectively used as extractors and input into a double-channel classification network for classification, so that the defect that the two-dimensional network only pays attention to two-dimensional features and ignores the one-dimensional features is overcome, and the classification precision is improved.

The invention discloses a document object classification method based on a dual-channel hybrid convolution network, which comprises the following specific schemes and steps:

step 1, performing multi-mode matching recursive RLSA analysis on an input picture to determine segmentation coordinates, comprising the following substeps:

step 1-1, performing color space conversion on an original picture, converting the original picture into a gray image, setting a threshold value, converting the gray image into a binary image, initializing a region coordinate library by using a diagonal coordinate of the picture, and entering step 1-2, wherein only one region in the coordinate library is the original picture;

step 1-2, loading area pictures as input pictures according to a coordinate library in sequence, and projecting and segmenting the input pictures in the horizontal direction, wherein the steps can be specifically divided into the following steps:

step 1-2-1, counting the number of black pixels in the horizontal direction, dividing the number into three levels according to the difference of black pixel number distribution, respectively representing the three levels by 0, 1 and 2, wherein 0 represents blank or close to blank, 1 represents that a small number of black pixels are distributed, and 2 represents that a large number of black pixels are distributed, storing the counting result into a one-dimensional array, and entering step 1-2-2;

step 1-2-2, traversing the array from the beginning, dividing the array into three states according to different levels of each value in the array, denoted by sta0, sta1 and sta2, corresponding to the three levels in step 1-2-1, at sta1 or sta2, which may be denoted as sta b, for representing a black state, and determining the length of each state maintenance according to the jump between different states, denoted by sta0_ h, sta2_ h and sta _ h, respectively, and specifically dividing into the following rules, where min _ cut _ blank represents the minimum boundary blank height (may be 10), min _ txt represents the minimum text line height (may be 10), max _ continuous _ blank represents the maximum inclusion blank height (may be 5), and formula _ line represents the formula line height (may be 3):

1) the sta _ h between the two division points is determined as a black block;

2) the start and the end of the array are in a state of a sta and automatically serve as a division point;

3) if sta0_ h > min _ cut _ blank, directly mark the adjacent sta b as a split point;

4) if sta2_ h < ═ format _ line and sta0_ h < ═ max _ continuous _ blank on both sides, identifying as a formula structure, and fusing with the black blocks before and after the determined segmentation point;

5) if sta _ h > min _ txt and its adjacent sta0_ h > max _ connian _ blank, then directly mark the state both ends sta as partitionable, this block is a black block;

6) if the sta _ h is more than min _ txt and the sta0_ h < ═ max _ linkage _ blank exists at the two ends, the matching is a parent structure and the parent structure is fused with a child structure at one end meeting the condition, otherwise, the sta at the end is marked as a division point;

7) if the sta _ h < ═ min _ txt exists at the two ends, and the sta0_ h < ═ max _ continuous _ blank exists at the two ends, the matching is a substructure, and the substructure is fused with the black block at one end meeting the condition, otherwise, the sta at the end is marked as a division point;

1-2-3, updating each region coordinate of the picture according to all the segmentation points, and storing the region coordinates into a region coordinate library;

step 1-3, loading the region picture as an input picture according to a coordinate library in sequence, and projecting and dividing the input picture in the vertical direction, wherein the projection in the vertical direction is much simpler than that in the horizontal direction, and the method specifically comprises the following steps:

step 1-3-1, counting the number of black pixels in the vertical direction, storing the number of the black pixels as a one-dimensional array, dividing the black pixels into two levels according to different black pixel number distributions, respectively representing the levels by 0 and 1, wherein 0 represents blank or is close to blank, and 1 represents pixel distribution, and entering step 1-3-2;

step 1-3-2, traversing the array from the beginning, dividing the array into two states, namely a starwhite state and a starblack state according to different grades of each value in the array, and determining the length maintained by each state according to the jump between different states, wherein the maintained length is respectively represented by a starh and a starh, and the method specifically comprises the following rules, wherein h _ min _ cut _ blank represents the minimum divisible horizontal space height (20 can be taken):

1) the sta _ h between the two division points is determined as a black block;

2) the array starts and ends and is in a state of a sta, and the array automatically serves as a partition point;

3) the method comprises the following steps that (1) a sta _ h > h _ min _ cut _ blank triggers a segmentation mode, and the stas at two ends are marked as segmentation points;

1-3-3, updating each region coordinate of the picture according to all the segmentation points, and storing the region coordinates into a region coordinate library;

step 1-4, repeating the step 1-2 and the step 1-3 for each segmented area until the height of the area is smaller than the height of the minimum text line or the coordinates of the continuous area are not updated, and skipping the area;

step 1-5, if all the areas can not be subdivided, transferring the stored coordinate libraries of all the areas into step 2;

step 2, dividing the input color picture into area pictures containing different logic objects according to the division coordinates in the step 1, and directly transmitting the area pictures to the dual-channel mixed classification network in the step 6 to be used as input if the area pictures are called by a model;

and 3, labeling the region according to the data set annotation, removing noise and carrying out equalization treatment to obtain a classification data set containing a region picture, and if the classification data set is called by a model, ignoring the step:

step 3-1, analyzing the data set annotation file to obtain a real label and a coordinate of the object region, performing IOU (intersection of two pictures is compared with union of the two pictures) comparison analysis on the object region and the segmentation data obtained in the step 2, setting a threshold value to be 0.8, and obtaining labels of all segmentation regions;

step 3-2, according to the label and the area size information of each area, carrying out error correction processing to ensure the accuracy of the data set, and the main basis is as follows:

1) if the size of the table and picture category is less than a certain threshold (20), then the area is considered as noise interference, and then the area is discarded;

2) if the text category is greater than the two lines of text height, discarding the region;

3) if the length of the text class is smaller than a certain threshold value, horizontal copying and splicing processing is carried out, and the splicing quantity formula is as follows:

wherein copy _ num is the number of splices, round is rounding, avg _ width is the average length of the regional picture, width is the current picture width, and interval is the splice interval;

and 3-3, because the number of the pictures of different types in the data set is large in difference, in order to reduce the problem of unbalanced training, data set equalization processing is carried out, specifically, the method mainly comprises the step of turning over the area pictures with small number to expand the number of the area pictures, and a part of the pictures with large number of types is randomly selected as a training set.

Step 4, sending the two-dimensional color area picture of the processed data set into a two-dimensional CNN for training, storing training data as a two-dimensional feature extractor of the dual-channel mixed classification network, and if the two-dimensional color area picture is called by a model, omitting the step, specifically comprising the following substeps:

step 4-1, inputting picture size resize as 248 × 450 to conform to the statistical rule of the original picture size of the training set;

step 4-2, taking an AlexNet network as a prototype, keeping the structure of a convolution part unchanged, initializing parameters by using a large number of trained public parameters, keeping the parameters fixed, and arranging three full-connection layers behind the convolution part, wherein the input and output sizes of the three full-connection layers are fc 1: 6 × 12 × 256-: 4096-: 4096-4, wherein, the dropout processing is carried out after fc1 and fc2 layers, and the full connection layer is initialized randomly;

and 4-3, carrying in the processed classification data set for training and testing, wherein the learning rate is 0.001, the dropout rate is 0.5, and the batch _ size is 200, iteratively storing the model parameter with the highest testing precision, and storing and fixing all the parameters as extractors.

Step 5, extracting two-direction projections of the two-dimensional picture, merging the two-direction projections into one-dimensional data, sending the one-dimensional data into a one-dimensional CNN network for training, storing the training data as a one-dimensional feature extractor of the dual-channel mixed classification network, and if the two-dimensional data is called by a model, omitting the step, which specifically comprises the following substeps:

step 5-1, inputting a region picture size resize of 384 × 640, projecting data in horizontal and vertical directions respectively, counting the number of image melanins, and splicing into 1024-sized one-dimensional vectors as input;

and 5-2, using an AlexNet network as a prototype, only reserving a first dimension part according to the convolution kernel size and the step length of a convolution layer part, wherein a second dimension part is 1, the rest parameters are consistent with the convolution part in the step 4, and then connecting three full-connection layers, wherein the sizes of the three full-connection layers are fc 1: 30 × 256-: 4096-: 4096-4, wherein the dropout processing is carried out after fc1 and fc2 layers, and all layers are initialized randomly;

step 5-3, carrying in the own data set for training and testing, wherein the learning rate is 0.001, the dropout rate is 0.5, the batch _ size is 200, iteratively storing the model parameter with the highest testing precision, and storing and fixing all parameters as extractors;

step 6, using the first seven layers of the convolutional network model trained in steps 5 and 6 as a feature extractor, using the extracted feature data as the input of a dual-channel classification network to form a dual-channel hybrid classification network, performing the final classification training on the network, and storing the training data to obtain the final training model, specifically comprising the following substeps:

step 6-1, the first layer full connection layer is divided into two parallel layers which are respectively connected with two feature extractors, wherein the number of nodes of the layer connected with the 1D CNN extra is relatively less, the number of nodes of the layer connected with the 2D CNN extra is relatively more, the proportion is selected according to specific conditions, and the selectable proportion is 1: 4, the number of nodes is 1024, and 4096 respectively;

step 6-2, connecting two parallel full-connection layers of the first layer to a uniform full-connection layer, wherein the number of nodes can be 4096, then connecting a full-connection layer with the same number as the types for outputting the type probability, and only training the three layers during training;

and 6-3, training again by using the classification data set, wherein the learning rate is 0.01, the dropout rate is 0.5, the batch _ size is 200, and the parameter with the best effect is stored as a final model.

Compared with the prior art, the invention has the following advantages:

1) when in segmentation, the RLSA segmentation method of multi-mode matching is utilized, projection data are subjected to multi-level conversion, and different object structure types are judged, so that different segmentation operations are implemented, and the segmentation accuracy is improved;

2) the text is subjected to single-line segmentation, so that the problem of classification confusion among partial texts, tables, texts and texts, which are caused by different segmentation line numbers of block regions, is solved, the classification accuracy is effectively improved, and the expansion operations such as identification and recovery are facilitated;

3) during classification, two-dimensional AlexNet taking two-dimensional features as input and one-dimensional AlexNet taking pixel projection as input are respectively used as extractors and input into a two-channel classification network for classification, so that the defect that the two-dimensional network only pays attention to the two-dimensional features and ignores the one-dimensional features is overcome, the classification precision is improved by adjustment, the size of a network input picture is also changed according to the statistical distribution of a data set, and the applicability of text object picture classification is improved;

4) the first layer of the finally classified dual-channel classification network is divided into two parallel sublayers and respectively corresponds to the output of the two feature extraction networks, so that the classification network has better capacity of adjusting the proportion occupied by the two networks, the full utilization of the features is effectively realized, and the classification precision is improved.

Drawings

In order to make the object, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for explanation:

FIG. 1 is a flowchart of the overall steps of the present invention;

FIG. 2 is a flow chart of the multi-pattern matching recursive RLSA analysis method of the present invention;

FIG. 3 is a diagram of an improved 2-dimensional alexnet CNN network architecture of the present invention;

FIG. 4 is a diagram of an improved 1-dimensional alexnet CNN network architecture of the present invention;

FIG. 5 is a diagram of a dual-path hybrid classification network architecture of the present invention;

detailed description of the preferred embodiments

The present invention will be described in detail with reference to the following examples. It should be noted that the described embodiments are for illustrative purposes only and are not limiting on the scope of the invention.

step 1-1, performing color space conversion on an original picture by using opencv, converting the original picture into a GRAY-scale graph CV _ RGB2GRAY, setting a threshold value to be 180, converting the GRAY-scale graph CV _ RGB2GRAY into a binary graph, initializing a region coordinate library by using a diagonal coordinate of the picture, and entering step 1-2, wherein only one region in the coordinate library is the original picture;

step 1-2-2, traversing the array from the beginning, dividing the array into three states according to different levels of each value in the array, denoted by sta0, sta1, and sta2, corresponding to the three levels in step 1-2-1, at sta1 or sta2, which may also be denoted as sta b, to represent a black state, and determining the length of each state maintenance according to transitions between different states, denoted as sta0_ h, sta2_ h, and sta _ h, which are specifically divided into the following rules, where min _ cut _ blank ═ 10 represents a minimum boundary blank height, min _ txt ═ 10 represents a minimum text line height, max _ contian _ blank ═ 5 represents a maximum inclusion blank height, and formula _ lin ═ 3 represents a formula line height:

1) the sta _ h between the two division points is determined as a black block;

3) if sta0_ h >10, directly mark the neighboring sta b as a split point;

4) if sta2_ h is 3 and sta0_ h is 5, identifying as a formula structure, and fusing with the black blocks before and after the determined segmentation point;

5) if sta _ h >10 and its neighbors sta0_ h >5, then directly mark the state both ends, sta, as partitionable, this block is a black block;

6) if the sta _ h is more than 10 and the sta0_ h < 5 exists at the two ends, the matching is a parent structure and the parent structure is fused with a child structure at one end meeting the condition, otherwise, the sta at the end is marked as a splitting point;

7) if the sta _ h < ═ 10 exists at both ends, the sta0_ h < ═ 5 exists at both ends, the matching is a substructure, and the black block at one end meeting the condition is fused, otherwise, the sta at the end is marked as a division point;

step 1_3, loading the region picture as an input picture according to a coordinate library in sequence, and projecting and dividing the input picture in the vertical direction, wherein the projection in the vertical direction is much simpler than that in the horizontal direction, and the method specifically comprises the following steps:

step 1-3-2, traversing the array from the head, dividing the array into two states, namely a starwhite state and a starblack state, according to different levels of each value in the array, determining the length maintained by each state according to the jump between different states, wherein the maintained length is respectively represented by starh and starh, and the method specifically comprises the following rules, wherein h _ min _ cut _ blank ═ 20 represents the minimum divisible space height:

1) the sta _ h between the two division points is determined as a black block;

3) the step (b) is more than 20, a segmentation mode is triggered, and the stabs at two ends are marked as segmentation points;

1) if the size of the table and picture category is less than the threshold 20, then the area is discarded;

2) if the text category is greater than the two line text height 20, then the region is discarded;

3) if the length of the text class is smaller than the threshold value 20, performing horizontal copying and splicing treatment, wherein the splicing quantity formula is as follows:

wherein copy _ num is the number of splices, round is rounding, avg _ width is the average picture length 406, width is the current picture width, and interval is the splice interval 8;

and 3-3, because the number of the pictures of different types in the data set is large in difference, in order to reduce the problem of unbalanced training, data set equalization processing is carried out, specifically, the pictures with small number are overturned to expand the number of the pictures, and a part of the pictures with large number is randomly selected as a training set. Wherein the text, formula, table and picture are 46892, 3937, 699 and 1978 respectively; so, the table is flipped three times, the picture is flipped once, the text is limited to 3000, and after equalization, 3000, 3937, 2796, 3956, respectively.

step 4-3, carrying the processed classification data set in for training and testing, wherein the learning rate is 0.001, the dropout rate is 0.5, the batch _ size is 200, iteratively storing the model parameter with the highest testing precision, and storing and fixing all parameters as an extractor;

the specific parameters of these two networks are shown in table 1:

and 6-3, training again by using the classification data set, wherein the learning rate is 0.01, the dropout rate is 0.5, the batch _ size is 200, the parameter with the best effect is stored as a final model, and the best effect can be obtained after 20 iterations, wherein the accuracy is 98.02%.

The invention discloses a document object classification method based on a dual-channel hybrid convolution network, and a plurality of methods and ways for specifically implementing the technical scheme are provided, the above description is only a preferred embodiment of the invention, but the protection scope of the invention is not limited thereto, and any changes or alternative methods that can be easily conceived by those skilled in the art within the technical scope of the invention should be covered within the protection scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A document object classification method based on a dual-channel hybrid convolutional network is characterized by comprising two contents of model training and model calling, wherein the model calling omits a part of steps of the model training, and comprises the following steps, wherein the default of the steps is the model training:

step 1, performing multi-mode matching recursive RLSA analysis on an input picture to determine a segmentation coordinate;

step 3, labeling the region according to the data set annotation, removing noise and carrying out equalization processing to obtain a classification data set containing a region picture, and if the classification data set is model calling, ignoring the step;

step 4, sending the two-dimensional color area picture of the processed data set into a two-dimensional CNN for training, storing training data as a two-dimensional feature extractor of the dual-channel mixed classification network, and ignoring the step if the model is called;

step 5, extracting two-direction projections of the two-dimensional picture, combining the two-direction projections into one-dimensional data, sending the one-dimensional data into a one-dimensional CNN network for training, storing the training data as a one-dimensional feature extractor of the dual-channel mixed classification network, and ignoring the step if the two-dimensional data is called by a model;

and 6, taking the first seven layers of the convolutional network models trained in the steps 5 and 6 as feature extractors, taking the extracted feature data as the input of the dual-channel classification network to form a dual-channel mixed classification network, performing final classification training on the network, and storing the training data to obtain a final training model.

2. The method of claim 1, wherein the step of tri-valuing the projection data to represent three different states and applying different segmentation rules based on the states to distinguish between different cases comprises the sub-steps of:

step 1-1, graying and binarizing an original picture, and initializing an area coordinate library by using a diagonal coordinate of the picture, wherein only one area in the coordinate library is the original picture;

step 1-2, loading area pictures as input pictures in sequence according to a coordinate library, and projecting and dividing the input pictures in the horizontal direction, wherein the steps can be specifically divided into the following steps;

step 1-2-1, counting the number of black pixels in the horizontal direction, dividing the number into three levels according to the difference of black pixel number distribution, respectively representing the three levels by 0, 1 and 2, wherein 0 represents blank or is close to blank, 1 represents that a small number of black pixels are distributed, and 2 represents that a large number of black pixels are distributed, and storing the counting result into a one-dimensional array;

step 1-2-2, traversing the array from the beginning, dividing the array into three states according to different levels of each value in the array, denoted by sta0, sta1 and sta2, corresponding to the three levels in step 1-2-1, at sta1 or sta2, which may also be denoted as sta b, to represent a black state, and determining the length of each state maintenance according to the jump between different states, denoted by sta0_ h, sta2_ h and sta _ h, respectively, and specifically dividing into the following rules, where min _ cut _ blank represents the minimum boundary blank height, min _ txt represents the minimum text line height, max _ constraint _ blank represents the maximum inclusion blank height, and formula _ line represents the formula line height:

1) the sta _ h between the two division points is determined as a black block;

step 1-3-2, traversing the array from the head, dividing the array into two states, namely a starwhite state and a starblack state according to different grades of each value in the array, and determining the length maintained by each state according to the jump between different states, wherein the maintained length is respectively represented by a starh and a starh, and the method specifically comprises the following rules, wherein h _ min _ cut _ blank represents the minimum horizontal divisible blank height:

1) the sta _ h between the two division points is determined as a black block;

and 1-5, if all the areas can not be subdivided, transferring the stored coordinate libraries of all the areas into the step 2.

3. The method of claim 1, wherein the label marking is performed on the resolved contrast label and the partition area, and then data equalization and data error correction processing are performed, wherein the error correction processing specifically includes the following steps:

according to the label and the area size information of each area, error correction processing is carried out to ensure the accuracy of the data set, and the main basis is as follows:

1) if the sizes of the table and the picture category are smaller than a certain threshold value, the noise interference is determined, and the area is discarded;

wherein copy _ num is the number of splices, round is rounding, avg _ width is the average picture length, width is the current picture width, and interval is the splice interval.

4. The two-way hybrid classification network of step 6 of claim 1, comprising the following:

step 6-1, taking the first 7 layers of the model trained in the step 4 as a two-dimensional feature extractor of the regional picture, and recording the two-dimensional feature extractor as a 2D CNNextractor, wherein the input of the feature extractor is a two-dimensional color regional picture;

step 6-2, using the first 7 layers of the model trained in step 5 as a one-dimensional feature extractor of the regional picture, which is recorded as 1D CNNextractor, and the input of the one-dimensional feature extractor is a one-dimensional vector;

and 6-3, the final classification network is a three-layer fully-connected network, but is different from the traditional fully-connected network, and has the specific characteristics that:

1) the full tie-layer of first layer divide into two layers that parallel, connects two feature extractor respectively, and wherein the number of nodes of connecting the layer of 1D CNNextractor is less relatively, and the number of nodes of connecting the layer of 2D CNN extraactor is more relatively, and the proportion is selected according to particular case, and selectable proportion is 1: 4, the number of nodes is 1024, and 4096 respectively;

two parallel full-connection layers of the first layer are connected to a uniform full-connection layer, the number of nodes can be 4096, then a full-connection layer with the same number as the types is connected in the following layer and used for outputting the type probability, and only the three layers are trained during training.