CN112163599B

CN112163599B - Image classification method based on multi-scale and multi-level fusion

Info

Publication number: CN112163599B
Application number: CN202010932190.6A
Authority: CN
Inventors: 万玉钗; 刘峡壁; 王穆荣; 郑中枢; 朱正; 赵心明
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2020-09-08
Filing date: 2020-09-08
Publication date: 2023-09-01
Anticipated expiration: 2040-09-08
Also published as: CN112163599A

Abstract

The invention discloses an image classification method based on multi-scale and multi-level fusion, which is characterized in that the input image is subjected to scale transformation, the multi-scale image is used as input, the global and local characteristics of the image can be extracted, the comprehensive description of an image target is formed, then the multi-level fusion is carried out on the description characteristics of the image in multiple scales, the final classification and identification result is obtained, namely, the classification performance of a convolutional neural network image classifier is effectively improved from the aspects of multi-scale image description and multi-level information fusion.

Description

Image classification method based on multi-scale and multi-level fusion

Technical Field

The invention belongs to the technical fields of pattern recognition, deep learning and artificial intelligence, and particularly relates to an image classification method based on multi-scale multi-level fusion.

Background

With the development of computer technology, the deep learning method is attracting attention of more and more researchers, and has made great breakthroughs and developments in the fields of image classification, video analysis, natural language processing and the like. Convolutional neural networks (Convolutional Neural Network, CNN) are a classical and widely used neural network structure in deep learning, and generally consist of convolutional layers, pooling layers, fully connected layers, and the like. The characteristics of local connection, weight sharing, pooling operation and the like of the convolutional neural network can effectively reduce the complexity of the network, and the convolutional neural network is easy to train and optimize.

In the field of image recognition and classification, convolutional neural networks generally have two application modes: one type is a feature extractor, which extracts abstract visual features of an image based on a convolutional neural network and computes the features as inputs to other algorithms. Compared with the traditional artificial design features, the visual features extracted based on the convolutional neural network have better descriptive capability, which benefits from the larger parameter space and the automatic learning capability of the neural network. The other type is an end-to-end classification model, which is a mode that is more commonly adopted in the field of image recognition by a convolutional neural network, wherein the whole image is used as the input of the convolutional neural network, and the classification recognition result of the image is output at an output layer after calculation. Compared with the traditional image classification method, the method for classifying the images based on the convolutional neural network brings great improvement on classification accuracy and the like, and is widely focused.

In summary, most of the existing algorithms for classifying images based on convolutional neural networks input images into the convolutional neural networks for analysis, only fusion is performed at the image feature level, and classification information of the images cannot be fully utilized, so that performances such as classification accuracy and the like are still not greatly improved, and classification results of the images are not completely satisfactory.

Disclosure of Invention

In view of the above, the invention provides an image classification method based on multi-scale and multi-level fusion, which can realize accurate classification of images.

The invention provides an image classification method based on multi-scale and multi-level fusion, which comprises the following steps:

step 1, representing an image to be classified as a plurality of images with different scales, respectively inputting the images with different scales into corresponding scale classifiers based on a convolutional neural network, and obtaining a plurality of scale classification decision results of the image to be classified;

step 2, the visual features extracted by the scale classifier are connected in series to form fusion features, and the fusion features are used as an input layer of a feature fusion classifier based on a convolutional neural network; determining the weight of the full connection layer of the feature fusion classifier according to the contribution of the fusion features to classification recognition; the output layer of the feature fusion classifier is a feature fusion classification decision result of the image to be classified;

and 3, calculating a basic probability distribution function m (-) of the scale classifier and the feature fusion classifier according to the scale classification decision result and the feature fusion classification decision result by adopting an evidence theory, wherein the basic probability distribution function m (-) is shown in the following formula:

wherein, acc _i Representing the average classification accuracy of the scale classifier or the feature fusion classifier i on the whole data set; 1-acc _i Representing the uncertainty of the classifier;for the output of the scale classifier or feature fusion classifier i, the scale classifier or feature fusion classifier i is represented to classify the image j as C _s The probability of class images, n is the total number of image types, Θ is a finite set independent of each other;

calculating a conflict coefficient according to the basic probability distribution function, and fusing the scale classification decision result and the feature fusion classification decision result to obtain a final classification decision result when the conflict coefficient is smaller than a set threshold value; and when the conflict coefficient is greater than a threshold value, calculating the confidence coefficient of the scale classifier and the feature fusion classifier, and taking the output of the classifier with the highest confidence coefficient as a final classification decision result.

Further, the images of different scales include three-scale images of sizes from small to large.

Further, when the area where the classification target is located is taken as an analysis object, the images with different scales are three image blocks with different scales, which are obtained by cutting the image to be classified around the area where the target is located; when the whole image is taken as an analysis object, the images with different scales are three-scale representations obtained by downsampling the image to be classified by adopting a Gaussian pyramid method.

Further, the feature fusion classifier in the step 2 adopts a migration learning strategy to perform parameter training.

Further, in the step 3, the output value of the scale classifier or the feature fusion classifier is converted into a probability value by adopting a soft-max function.

The beneficial effects are that:

according to the invention, the input image is subjected to scale transformation, the multi-scale image is taken as input, the global and local features of the image can be extracted to form comprehensive description of the image target, and then the multi-scale description features of the image are subjected to multi-level fusion to obtain a final classification and identification result, namely, the classification performance of the convolutional neural network image classifier is effectively improved from the aspects of multi-scale image description and multi-level information fusion.

Drawings

Fig. 1 is a frame diagram of an image classification method based on multi-scale and multi-level fusion.

Fig. 2 (a) is a schematic diagram of a multi-scale image formed by a clipping method based on an image area, which is used in the image classification method based on multi-scale multi-level fusion.

Fig. 2 (b) is a schematic diagram of a multi-scale image formed by a gaussian pyramid method based on an image ensemble, which is used in the image classification method based on multi-scale and multi-level fusion.

Detailed Description

The invention will now be described in detail by way of example with reference to the accompanying drawings.

The invention provides an image classification method based on multi-scale and multi-level fusion, which has the following basic ideas: through multi-scale representation of the image, comprehensive description information such as the global and local of the image can be extracted; then, feature level fusion is carried out, namely visual feature information of a plurality of scales is fused, a convolutional neural network is adopted to construct a classifier of each scale and a classifier after fusion, and the visual feature information of the plurality of scales and the information after fusion are classified to obtain a classification decision result; and finally, fusing decision levels, namely fusing a plurality of decision results on the decision levels to obtain a final classification decision.

The invention provides an image classification method based on multi-scale multi-level fusion, which is shown in figure 1, and comprises three parts of multi-scale representation, feature level fusion and decision level fusion of images, and specifically comprises the following steps:

step 1, representing an image to be classified as a plurality of images with different scales, respectively inputting the images with different scales into corresponding scale classifiers based on a convolutional neural network, and obtaining a plurality of scale classification decision results of the image to be classified.

The invention can use three scales to represent images, namely three types of Large, middle and Small. In images of different scales, the recognition targets in the images have different visual characteristics, so that three different sets of features can be extracted based on the images of three scales.

For the generation of multi-scale images, the invention provides two different strategies according to different application requirements. If the region where the classification target is located in the image is taken as the analysis object, three kinds of image blocks with different scales can be obtained by clipping around the target region, for example, as shown in fig. 2 (a). If the whole image is taken as an analysis object, the original image can be downsampled by adopting a Gaussian pyramid method, so as to obtain three scale representations of the original image, for example, as shown in fig. 2 (b).

The invention adopts convolutional neural network to extract visual description characteristics of the image. And respectively constructing neural networks for the image sets of each scale, and performing parameter training to obtain three convolutional neural networks of different scales. And for each image, extracting a full connection layer of the convolutional neural network as visual characteristic representation of the full connection layer to obtain three-scale characteristic vectors, and then, further carrying out classification and identification on the visual characteristics of the image by the convolutional neural network to obtain different classification decisions corresponding to the three scales.

Step 2, connecting input layers of the scale classifier in series to form fusion features, wherein the fusion features are used as the input layers of the feature fusion classifier based on the convolutional neural network; determining the weight of the full connection layer of the feature fusion classifier according to the contribution of the fusion features to classification recognition; and the output layer of the feature fusion classifier is a feature fusion classification decision result of the image to be classified.

Images of different scales have different visual characteristics due to differences in the size, orientation, position, etc. of the objects in the images. In order to fully utilize visual feature information of multiple scales, feature layers of images are required to be fused, and classification and identification are performed based on the fused features.

In order to perform feature level fusion, the three-layer convolutional neural network structure is adopted as a feature fusion classifier, and first, three scale features are connected in series and are used as an input layer of the neural network. And then, according to the contribution of different characteristic values after the concatenation to classification recognition, different full-connection layer weights are allocated to the characteristic values. And after calculation of the full connection layer, outputting a classification result at an output layer, namely, classifying decisions after feature level fusion of the information with different scales.

For training the feature fusion classifier, a strategy of transfer learning can be adopted to perform parameter learning, namely: firstly, pretraining the neural network parameters on a large-scale existing data set, then migrating the neural network to a specific application field, and finely adjusting the neural network parameters by using a small amount of data in the field.

Step 3, adopting an evidence theory, and calculating a basic probability distribution function of the scale classifier and the feature fusion classifier according to the scale classification decision result and the feature fusion classification decision result; calculating a conflict coefficient T according to the basic probability distribution function, and fusing the scale classification decision result and the feature fusion classification decision result to obtain a final classification decision result when T is smaller than a threshold value; and when T is greater than a threshold value, calculating the confidence coefficient of the scale classifier and the feature fusion classifier, and taking the output of the classifier with the highest confidence coefficient as a final classification decision result.

According to the invention, through the steps 1 and 2, four convolutional neural network classifiers are constructed aiming at three scales of 'Large', 'Middle', 'Small' and the fused characteristics. For each image, four classifiers may give four decision results, respectively. Due to differences in recognition target size, location, orientation, etc., optimal classification decisions may be obtained in classifiers of different scales. Therefore, in order to enhance the accuracy and stability of the image classifier, the invention fuses multi-scale decision information on a decision level.

And 3.1, determining a basic probability distribution function.

The D-S evidence theory in the prior art is a mathematical theory and method of uncertainty reasoning, which regards different information sources as different evidences, and fuses the different evidences based on the Dempster criterion to obtain a final decision. Let Θ denote a finite set of mutually independent, called recognition framework. For any subset a of Θ, each evidence may provide its basic confidence probability value m (a), and m (a) satisfies the following requirement:

where phi represents an empty set, representing a state that does not occur in any case. The m (-) function is called the basic probability distribution function (Basic Probability Assignment, BPA). For w pieces of evidence on the same recognition framework, the following formulas in the prior art can be adopted to integrate the w pieces of evidence into a consistent decision:

in the present invention, let n denote the number of image categories in the dataset, then Θ= { C ₁ ,C ₂ ,…,C _n }, wherein C _n Representing the nth class of image categories. For A, according to the characteristic that the convolutional neural network classifier only outputs the corresponding probability of each category, the invention only considers the following subset A= { C of Θ ₁ },A＝{C ₂ },…,A＝{C _n And a=Θ, the basic probability value corresponding to the other subset is set to 0. Where a=Θ indicates that the classifier cannot give an explicit classification, i.e.: uncertainty of the classifier.

In the relevant application of D-S evidence theory, the definition of BPA is a difficulty because it requires the transformation of knowledge of the actual problem into a probabilistic expression. The present invention combines the overall uncertainty of the classifier with the classification performance on a single image to design and define BPA. For the convolutional neural network classifier i (i=1, 2,3, 4) in the present invention, BPA for image j is defined as shown in the following formula (3):

wherein acc is _i Representing the average classification accuracy of classifier i over the entire dataset, 1-acc _i Representing the uncertainty of the classifier.Derived from the output of classifier i, representing that classifier i classifies image j as C _s Probability of class. Since the range of values output by the convolutional neural network classifier is not 0,1]The invention adopts soft-max function to convert it into probability value, namely

And 3.2, conflict elimination.

The existing conventional D-S evidence theory fusion method has the following problems: there may be a conflict between different evidence. When the traditional D-S evidence theory method is used for fusing high-conflict evidence, the obtained result can generate errors. In order to eliminate conflict, the invention improves the traditional D-S evidence theory, and designs and proposes a self-adaptive fusion method.

In equation (2), K is a normalized coefficient and is also a collision coefficient. A larger K value indicates higher conflict between evidence and a smaller K value indicates lower conflict between evidence. Therefore, the method sets the threshold according to the value of K, and if the value of K is smaller than the threshold, the four convolutional neural network classifiers are fused by adopting a traditional D-S evidence theory method; if the K value is larger than the threshold value, calculating the confidence coefficient of each evidence, and taking the classification decision of the evidence with the highest confidence coefficient on the image as the fused classification decision. Confidence measures the degree of support of one evidence by other evidence, a high confidence indicating that the higher its consistency with other evidence is, the higher the importance is.

The confidence coefficient calculating process comprises the following steps: firstly, calculating a support matrix of evidence; then, calculating the similarity between the evidence and other evidence, and summing the similarity of the evidence to obtain the support of the evidence i; and finally, normalizing the support degree to obtain the confidence degree of the evidence.

The method of decision level fusion is generally described as: for an image, a convolutional neural network classifier of four scales may output four classification results. The invention improves the traditional D-S evidence theory method to fuse four classification results. Firstly, calculating conflict coefficients K of four classifiers, and if the K is smaller than a threshold value, fusing classification results according to a traditional D-S evidence theory method; if K is larger than the threshold value, calculating the confidence coefficient of the four evidences, and taking the classification result of the classifier with the highest confidence coefficient as the classification result after the image fusion. Algorithm 1 describes the decision level fusion method in detail.

Algorithm 1. Decision hierarchy fusion Algorithm

In order to verify the image classification method based on multi-scale and multi-level fusion, the invention performs an experiment of benign and malignant classification of liver focus on a liver MRI image data set.

Experimental setup

A) Data set: the present invention has been tested on liver MRI image datasets. The data set was collected and labeled by tumor hospitals at the national academy of medical science. MRI images within this dataset were all taken based on us GE Signa Excite HD 3.0.0T superconducting MR imager. The dataset contains 85 patients (35 benign cases, 50 malignant cases), each patient performs a conventional scan, a dynamic enhancement scan, etc., and 5 scan images of different periods are acquired in total. Thus, the dataset contained 425 liver MRI images, of which 175 benign and 250 malignant. The focal areas of the images in the dataset were manually noted by radiologists experienced by oncology hospitals at the national academy of medical science to provide experimental contrast criteria.

B) The comparison method comprises the following steps: in order to verify the universality of the method provided by the invention, a plurality of convolutional neural network architectures which are commonly used at present and have better performance are selected for experiments, wherein the method comprises the following steps: resnet18, VGG11, and Alexnet architectures. All three network architectures were pre-trained on the ImageNet large-scale image dataset and then migrated to the dataset for parameter fine tuning. For each convolutional neural network architecture, the classification performance under a single scale (namely, only images of one scale are used and fusion is not considered), the classification performance of feature hierarchy fusion is adopted, and the classification performance of the multi-scale and multi-level fusion method provided by the invention is compared. The present experiment employs classification based on focal area, namely: in the MRI image, three different scale lesion area images are cropped around the lesion area. The image pixels corresponding to the three dimensions "Large", "Middle", and "Small" are 256×256, 128×128, and 64×64, respectively. In order to adapt to the input size requirements of three network architectures, the focus area images are all adjusted to the required size by adopting a bilinear interpolation method.

Meanwhile, the experiment is compared with the traditional classification method of non-deep learning. For the traditional classification method, focus description is carried out by adopting the characteristics of a widely used gray Level Co-occurrence matrix GLCM (Grey Level Co-Occurrence Matrices), and classification experiments are carried out by respectively using AdaBoost, a support vector machine (SupportVectorMachine, SVM) and a random forest (RandomForest, RF) classifier.

C) Evaluation index: the performance of the classifier is evaluated by adopting 5 commonly used indexes in the experiment, and the performance is respectively: classification accuracy, sensitivity (sensitivity), specificity (specificity), positive predictive value (positive predictive value, PPV), and negative predictive value (negative predictive value, NPV). The classification accuracy measures the proportion of correctly classified images in the whole dataset, the sensitivity or specificity measures the proportion of correctly classified images in the positive or negative data, and the PPV and NPV measures the proportion of images that are truly positive or negative in the set classified as positive or negative.

If a positive sample is correctly classified, it is called "true positive" (TruePositive, TP), otherwise it is called "false negative" (FalseNegative, FN); if a negative sample is correctly classified, it is called "true negative" (TrueNegative, TN), otherwise it is called "false positive" (FalsePositive, FP). The calculation formulas of classification accuracy, sensitivity, specificity, PPV and NPV are respectively:

d) Setting details: the experiment adopts a 5-fold cross validation method to train and test the model, and takes the average value of 5 experiments as the final result to be presented. Parameters of the neural network are optimized by adopting a random gradient descent method, wherein the learning rate, the iteration times and the batch size are respectively set to be 0.001, 500 and 10. The collision coefficient threshold is set to 0.85. All experiments were done on a computer with 10GB of memory and a block of NvidiaGeForce1080Ti GPU.

Experimental results

The liver focus benign and malignant classification experiments are carried out by adopting the comparison methods, and the experimental results of the methods are shown in table 1. Table 1 shows the results of the various neural network frameworks or conventional methods in separate display and presents the optimal and sub-optimal results in each group in bold and italic fonts, respectively.

TABLE 1 comparison of classification Properties of different methods in the classification of benign and malignant lesions of liver

As can be seen from table 1:

1) The method based on deep learning is obviously superior to the traditional method in classification performance;

2) Compared with single-scale methods such as Large, middle or Small, the feature hierarchy fusion method can bring about improvement of classification performance as a whole. In addition to specificity and PPV indexes in the Resnet18 framework, the feature hierarchy fusion method brings about performance improvement over a single-scale method in five indexes.

3) The multi-scale multi-level fusion method provided by the invention obtains the optimal classification performance, and the highest result is obtained in five indexes of three convolutional neural network frames.

4) As can be seen by comparing the 2) and the 3), the feature hierarchy fusion method has better classification performance than the single-scale method, and the decision hierarchy fusion can further improve the classification performance on the basis of the feature hierarchy fusion. This illustrates the necessity of feature level fusion and decision level fusion, both of which are indispensable. The superiority of the multi-scale multi-level fusion method provided by the invention in classification performance is proved.

5) In three common convolutional neural network frameworks for experimental comparison, the convolutional neural network image classifier multi-scale multi-level fusion method provided by the invention obtains optimal performance, and the method provided by the invention has stronger stability and universality.

Further, in order to observe the feature expression after the feature hierarchy is subjected to the feature concatenation of different scales, the experiment adopts a PCA (Principal Component Analysis) method to reduce the dimension of the feature after the concatenation, and maps the feature into a 2-dimensional space for presentation. The GLCM features and the feature dimension reduction results of the convolutional neural network frames after being connected in series can be seen that most benign and malignant lesions are mixed together in the GLCM features, and are difficult to distinguish. The depth feature is used to distinguish lesions more than the GLCM feature. The degree of distinction of depth features on lesions is higher and higher, and indistinguishable data is gradually reduced, which corresponds to the gradual increase of the average classification accuracy corresponding to the three frames in table 1.

In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The image classification method based on multi-scale and multi-level fusion is characterized by comprising the following steps of:

2. The method of claim 1, wherein the different scale images comprise three scale images from small to large in size.

3. The method according to claim 2, wherein when the area where the classification target is located is taken as an analysis object, the images of different scales are three kinds of image blocks of different scales obtained by cropping the image to be classified around the area where the target is located; when the whole image is taken as an analysis object, the images with different scales are three-scale representations obtained by downsampling the image to be classified by adopting a Gaussian pyramid method.

4. The method according to claim 1, wherein the feature fusion classifier in step 2 performs parameter training using a migration learning strategy.

5. The method according to claim 1, characterized in that in step 3 the output values of the scale classifier or feature fusion classifier are converted into probability values using a soft-max function.