Disclosure of Invention
The application aims to provide a method for identifying the degree of coronary artery stenosis based on multi-classifier fusion, which realizes automatic classification and pre-judgment on the degree of the stenosis and avoids the injury brought by invasive surgery to a patient.
The technical scheme adopted by the application is that the method for identifying the degree of coronary artery stenosis based on multi-classifier fusion is implemented according to the following steps:
step 1, constructing an image sample library;
step 2, denoising and segmentation binarization processing is carried out on a CT original sequence image extracted from heart CTA, so as to obtain a coronary artery extraction image;
step 3, extracting features of the segmented image, namely three main features of interesting texture features, gray features and geometric features;
step 4, according to the principle of 7:3, dividing 500 samples into training groups and test groups by adopting a random index method, screening three image group chemical characteristics of texture, gray scale and geometry extracted in the step 3 by adopting a multi-classification Relieff characteristic weighting algorithm, carrying out ten-fold cross validation on random characteristics, calculating the correlation between each characteristic and a prediction result, and eliminating the characteristics with small correlation;
step 5, forming a feature set by the features of the texture, the gray level and the geometry selected in the step 4, establishing a multi-classifier fusion prediction model, and selecting three classifiers of a Support Vector Machine (SVM), a Random Forest (RF) and an Extreme Learning Machine (ELM) with good medical image classification effect to fusion predict coronary artery lesion degree; and determining the weight of the 3 classifiers in the fusion classifier by adopting a weighting method, judging as a normal sample when the stenosis degree is lower than 50%, and judging as a lesion sample when the stenosis degree is higher than 50%.
The present application is also characterized in that,
the step 1 is specifically as follows:
patient information and images which are subjected to heart CTA and coronary angiography DSA examination in recent three years are collected in a hospital data system, namely, a CT image and coronary stenosis gold index data can be corresponding, after basic information of a patient in the images is hidden, 500 coronary CT images of the patient which meet the image quality are selected as selected input samples, and label category labeling is carried out.
The step 2 is specifically as follows:
step 2.1, arranging all pixel points in the neighborhood of Gaussian noise points in an original CT image according to a size rule, taking the gray value of the pixel in the middle as the gray value of the noise point to reduce noise of the image, wherein the principle expression is as follows:
wherein i, j represents the coordinate value of the pixel point, g ij The gray value of the noise point is represented by A, and the neighborhood region taken by the noise point is represented by A; { f ij -a sequence of data; med means a median operation.
The image quality of the CT image can be improved through denoising, and meanwhile, the denoised image can more clearly reflect coronary artery structure information in the CT image, so that the segmentation operation in the step 2.2 is facilitated.
Steps 2.2 and R represent the whole image, and the segmentation is considered as a process of segmenting the whole denoised CT image R into c sub-regions, and the following conditions (1) to (4) should be satisfied at the same time:
①U(R x )=R,R x is a sub-communication area;
②R x ∩R y =Φ, x, y=1, 2,3.. and for any x and y, x is not equal to y;
③P(R x ) The number of times of for x=1, 2,3.
④R(R x ∪R y )=False,x≠y;
And extracting coronary vessels with continuous areas from the original CT image by using an area growth segmentation algorithm, and obtaining a coronary artery extraction map.
The step 3 is specifically as follows:
step 3.1, extracting gray features in six aspects of mean value, variance, energy, entropy, kurtosis and skewness of the coronary artery extraction map in the step 2 by adopting a gray histogram method;
step 3.2, constructing a gray level co-occurrence matrix, selecting a sliding window of 5 multiplied by 5, calculating gray level characteristic values of each pixel point of the coronary artery extraction map in the step 2, and extracting texture characteristics of the image;
and 3.3, extracting geometric features of the coronary artery image by using a Hu invariant moment method based on the coronary artery extraction map obtained in the step 2, firstly calculating second-order and third-order center distances of the coronary artery image, then carrying out normalization processing to obtain an invariant moment group, and describing the geometric features of the shape of the coronary artery extraction image by the invariant moment group.
The step 4 is specifically as follows:
step 4.1, selecting the first d features with the largest correlation from all features of texture, gray scale and geometric three-image histology extracted in the step 3 through a ReliefF feature weighting algorithm to form d feature subsets, wherein each subset comprises the feature numbers from 1 to d in sequence;
step 4.2, performing ten-fold cross validation, dividing a sample set into 10 subsets, selecting one subset as a test set each time, taking the rest 9 subsets as training sets, repeating 10 times, and finally selecting the average recognition accuracy of 10 times as a result;
and 4.3, calculating the prediction error rate of each feature subset according to the process, and selecting the feature subset with the minimum pre-error rate as the input feature of the multi-classifier fusion prediction model in the step 5.
The ReliefF feature weighting algorithm in step 4.1 is specifically as follows:
randomly extracting a sample S from the training sample set each time, and respectively finding k neighbor samples H from the same type of samples and different types of samples of the sample S l 、M l And then updating the weight occupied by each feature in the three types of texture, gray scale and geometry features extracted in the step 3 in the prediction process, wherein the features with the weights smaller than the set threshold value are rejected, and the feature weight calculation formula is as follows:
in the above, m isThe number of samples, k, is the number of nearest neighbor samples, l=1. Once again, the combination of the two components, diff (A, S, H) l ) Representing sample S and sample H l The difference in feature A, C is the sample class, p (C) is the ratio of the number of C-class target samples to the total number of samples, and p (class (S)) is the ratio of the number of samples in sample S to the total number of samples.
The step 5 is specifically as follows:
step 5.1, firstly, the feature sample set screened in the step 4 is respectively passed through three single classifiers of a Support Vector Machine (SVM), an Extreme Learning Machine (ELM) and a Random Forest (RF) to obtain the recognition result of each classifier on the coronary artery stenosis degree, namely 3 classes obtained by the classification prediction of the sample to be recognized by each classifier, and the weight occupied by each single classifier in a final multi-classifier fusion prediction model is calculated according to the classification correct capacity of each classifier;
step 5.2, adopting a majority weighted voting method to fuse classification results of three single classifiers of a Support Vector Machine (SVM), an Extreme Learning Machine (ELM) and a Random Forest (RF), when the output result of the classifier is +1, the classification result is a normal class, namely, the stenosis degree is lower than 50%, and when the output result of the classifier is-1, the classification result is a lesion class, namely, the stenosis degree is higher than 50%; multiplying the classification result of each classifier by the corresponding weight obtained in the step 5.1, adding the three products to obtain a classification result of the multi-classifier fusion prediction model, and judging the classification result as a normal class when the addition result is positive and judging the classification result as a lesion class when the addition result is negative.
In step 5.1, the weight occupied by each classifier is determined according to the classification accuracy, and the accuracy calculation formula of the classification model is as follows:
wherein a = 1,2,3; n=narrow, non-narrow; e' n Is thatE n Is divided intoAccumulated times of normal class or abnormal class; y is a E { +1, -1} is the tag of the training sample, +.>Respectively representing the classification results of the models;
calculating the weight w of each model a The method comprises the following steps:
wherein ,
and 5.2, multiplying and adding the results obtained by each model with the corresponding weight to obtain a final output result:
when the output result is positive, the classification result is a normal category, namely the stenosis degree is lower than 50%, and when the output result is negative, the classification result is a lesion category, namely the stenosis degree is higher than 50%.
The coronary artery stenosis degree identification method based on multi-classifier fusion has the advantages that CTA and DSA images and diagnostic reports of existing patients can be corresponded, a multi-classifier fusion prediction model can be established by machine learning directly through heart CTA detection results, gold indexes of coronary artery stenosis degree of the patients are predicted, and a treatment scheme is determined. The method adopts an in-vitro classification prediction mode, avoids adverse reaction and wound brought by invasive coronary angiography to a patient, and does not need to singly conduct coronary angiography operation, so that the applicability of coronary lesion diagnosis can be improved, meanwhile, the advantages of all the classifiers can be combined by fusing multiple classifiers, the prediction accuracy and the prediction speed have good performance, and the diagnosis efficiency of a clinician is improved.
Detailed Description
The application will be described in detail below with reference to the drawings and the detailed description.
The application provides a machine learning method for automatically identifying the disease degree by using a fusion classifier based on the original purpose of noninvasive coronary stenosis degree identification, and the gold index is compared by using a two-dimensional CT shooting image, so that the clinical diagnosis efficiency is improved. As shown in the overall framework of FIG. 1, the method mainly comprises six basic modules of sample library construction, image preprocessing, feature extraction, feature screening, fusion classifier model construction and experimental verification, and can be understood as mainly comprising two main stages of sample acquisition and modeling. In the sample acquisition stage, various processing procedures on the training samples need to be completed, in the modeling stage, a machine learning model needs to be established, and the classifier structure and parameter tuning are determined. Finally, the efficiency of the method provided by the application can be verified and evaluated. It should be noted that the present application is directed to the inventive solution but is not limited thereto, and that the diagnosis of other diseases is applicable in addition to the one suitable for the present research context.
The application relates to a method for identifying the degree of coronary artery stenosis based on multi-classifier fusion, which is implemented by combining fig. 1 and fig. 2, and specifically comprises the following steps:
step 1, constructing an image sample library;
the step 1 is specifically as follows:
patient information and images which are subjected to heart CTA and coronary angiography DSA examination in recent three years are collected in a hospital data system, namely, a CT image and coronary stenosis gold index data can be corresponding, after basic information of a patient in the images is hidden, 500 coronary CT images of the patient which meet the image quality are selected as selected input samples, and label category labeling is carried out.
Step 2, denoising and segmentation binarization processing is carried out on a CT original sequence image extracted from heart CTA, so as to obtain a coronary artery extraction image;
the step 2 is specifically as follows:
step 2.1, because the object aimed by the application is a CT image, the noise mainly introduced by the medical image is Gaussian noise, a median filtering mode is adopted to achieve a better denoising effect on the Gaussian noise, all pixel points in the neighborhood of the Gaussian noise point in the original CT image are arranged according to the size rule, the gray value of the pixel in the middle is taken as the gray value of the noise point to denoise the image, and the principle expression is as follows:
wherein i, j represents the coordinate value of the pixel point, g ij The gray value of the noise point is represented by A, and the neighborhood region taken by the noise point is represented by A; { f ij -a sequence of data; med means a median operation.
The image quality of the CT image can be improved through denoising, and meanwhile, the denoised image can more clearly reflect coronary artery structure information in the CT image, so that the segmentation operation in the step 2.2 is facilitated.
And 2.2, according to the shape characteristics of blood vessels, the extracted area and the external area have obvious differences, so that an image is segmented by using an algorithm based on area growth, and the concept is that each pixel point with certain similar characteristics is divided into the same area to realize segmentation. Firstly, selecting a seed point in each to-be-segmented area of the whole image as a starting point of area growth, merging pixels which are similar or similar to the characteristics of the pixel points around the seed point into an area where preset seed pixels are located according to a growth criterion which optimizes the target of the seed point, and then continuously growing the merged new pixels serving as seed areas according to the method until the whole image is traversed, so that when pixels which do not meet preset conditions or criteria in the whole image can be merged into the seed areas, ending the whole area growth segmentation process.
The region growing and dividing algorithm can divide the connected region with the same characteristics well to provide good boundary information, R represents the whole image, and the dividing process is regarded as the process of dividing the whole denoised CT image R into c sub-regions, and the following conditions (1) to (4) are satisfied at the same time:
①U(R x )=R,R x is a sub-communication area;
②R x ∩R y =Φ, x, y=1, 2,3.. and for any x and y, x is not equal to y;
③P(R x ) The number of times of for x=1, 2,3.
④R(R x ∪R y )=False,x≠y;
The region growing and dividing algorithm is a process of gathering pixels or subareas into larger areas according to a predefined criterion, and coronary vessels with continuous areas are extracted from an original CT image through the region growing and dividing algorithm, so that a coronary artery extraction map is obtained.
Step 3, extracting features of the segmented image, namely extracting three main types of features of interesting texture features, gray features and geometric features according to the characteristics of the medical image as shown in fig. 3;
the step 3 is specifically as follows:
step 3.1, extracting gray features in six aspects of mean value, variance, energy, entropy, kurtosis and skewness of the coronary artery extraction map in the step 2 by adopting a gray histogram method;
step 3.2, constructing a gray level co-occurrence matrix, selecting a sliding window of 5 multiplied by 5, calculating gray level characteristic values of each pixel point of the coronary artery extraction map in the step 2, and extracting texture characteristics of the image;
and 3.3, extracting geometric features of the coronary artery image by using a Hu invariant moment method based on the coronary artery extraction graph obtained in the step 2, wherein in statistics, the moment reflects the scattering situation of random variables, and the method is popularized to the image field, and if the gray value of the image is regarded as a density scattering function, the moment mode can be used for extracting the image features. The Hu invariant moment method characterizes the geometric features of the image area, first calculates the second-order and third-order center distances of the coronary artery image, then carries out normalization processing to obtain an invariant moment group, and describes the geometric features of the shape of the coronary artery extracted image by the invariant moment group.
Step 4, according to the principle of 7:3, dividing 500 samples into training groups and test groups by adopting a random index method, screening three image group chemical characteristics of texture, gray scale and geometry extracted in the step 3 by adopting a multi-classification Relieff characteristic weighting algorithm, carrying out ten-fold cross validation on random characteristics, calculating the correlation between each characteristic and a prediction result, and eliminating the characteristics with small correlation;
the step 4 is specifically as follows:
step 4.1, selecting the first d features with the largest correlation from all features of texture, gray scale and geometric three-image histology extracted in the step 3 through a ReliefF feature weighting algorithm to form d feature subsets, wherein each subset comprises the feature numbers from 1 to d in sequence;
step 4.2, performing ten-fold cross validation, dividing a sample set into 10 subsets, selecting one subset as a test set each time, taking the rest 9 subsets as training sets, repeating 10 times, and finally selecting the average recognition accuracy of 10 times as a result;
and 4.3, calculating the prediction error rate of each feature subset according to the process, and selecting the feature subset with the minimum pre-error rate as the input feature of the multi-classifier fusion prediction model in the step 5.
The ReliefF feature weighting algorithm in step 4.1 is specifically as follows:
as shown in the flowchart 4, the features are the basis of machine learning, but redundancy and correlation among the features are opposite to reduce the accuracy of classification, especially, the application is a small sample learning model, too many features not only increase the complexity of the model, but also reduce the generalization capability of the model to a certain extent, so the application optimizes and selects the features extracted in the step 2 by adopting a ReliefF feature weighting algorithm, and gives different weights to each feature, so that the features with the weights smaller than the set threshold value can be removed.
Randomly extracting a sample S from the training sample set each time, and respectively finding k neighbor samples H from the same type of samples and different types of samples of the sample S l 、M l And then updating the weight occupied by each feature in the three types of texture, gray scale and geometry features extracted in the step 3 in the prediction process, wherein the features with the weights smaller than the set threshold value are rejected, and the feature weight calculation formula is as follows:
in the above formula, m is the number of sample samples, k is the number of nearest neighbor samples, l=1 l ) Representing sample S and sample H l The difference in feature A, C is the sample class, p (C) is the ratio of the number of C-class target samples to the total number of samples, and p (class (S)) is the ratio of the number of samples in sample S to the total number of samples.
Step 5, forming a feature set by the features of the texture, the gray level and the geometry selected in the step 4, establishing a multi-classifier fusion prediction model, and selecting three classifiers of a Support Vector Machine (SVM), a Random Forest (RF) and an Extreme Learning Machine (ELM) with good medical image classification effect to fusion predict coronary artery lesion degree; and determining the weight of the 3 classifiers in the fusion classifier by adopting a weighting method, so that the prediction effect is optimal, judging the normal sample when the stenosis degree is lower than 50%, and judging the lesion sample when the stenosis degree is higher than 50%.
The step 5 is specifically as follows:
as shown in the topological structure of the classifier in fig. 7, a fusion classifier consisting of a Support Vector Machine (SVM), an Extreme Learning Machine (ELM) and a Random Forest (RF) is selected to classify a sample set, and the principle of each classifier is as follows:
as shown in fig. 6 (a), the basic principle of the Support Vector Machine (SVM) is to find an optimal hyperplane capable of separating different samples, and the solution of the optimal hyperplane corresponds to the optimization process of convex quadratic programming: searching an objective function and determining constraint conditions. Dimension disasters can be avoided, the robustness is good, and the generalization capability is strong; the classification performance of SVM is affected by a number of factors, two of which are 1) error penalty parameter C; 2) Kernel function form and parameter g thereof. The error punishment parameters enable the generalization capability of the learning machine to be best through adjusting the confidence range and experience risk in the feature subspace. The radial basis function has nonlinearity and few parameters, and can map the original characteristics to infinite dimensions, so the application selects the radial basis function as the kernel function of the support vector machine.
As shown in fig. 6 (b), the basic structure of the extreme learning machine is a single hidden layer neural network, and compared with the traditional BP neural network, the extreme learning machine has better generalization capability and faster learning speed, in short, the network structure of the Extreme Learning Machine (ELM) model is the same as that of the single hidden layer feedforward neural network (SLFN), but is no longer a gradient-based algorithm (backward propagation) frequently found in the traditional neural network in the training stage, and a random input layer weight and deviation are adopted, and the output layer weight is calculated by generalized inverse matrix theory. Training of an Extreme Learning Machine (ELM) is completed after the weights and the deviations on all network nodes are obtained, and when the data is tested, the prediction of the data can be calculated by using the output layer weight just obtained. In the algorithm implementation process, inputs comprise a data set, the number of hidden layer neurons and an activation function, outputs are beta weights, and hidden layer outputs and output layer weights are calculated by randomly generating the input weights and hidden layer deviations.
As shown in fig. 6 (c), the random forest input includes training data sets and the number of sample subsets, and the output is a final strong classifier, which has good applicability in terms of machine learning, and does not need a complex parameter tuning process, and only one tree can be normally constructed for one data set, so that a plurality of data subsets related to each other can be divided on the same data set through guiding the aggregation algorithm idea to construct a plurality of subtrees, and the optimal classification is determined by voting on classification results of a plurality of decision trees.
Step 5.1, as shown in a fusion schematic diagram of fig. 7, firstly, the feature sample set screened in the step 4 is respectively passed through three single classifiers of a Support Vector Machine (SVM), an Extreme Learning Machine (ELM) and a Random Forest (RF) to obtain the recognition result of each classifier on the coronary artery stenosis degree, namely 3 classes obtained by the classification prediction of the sample to be recognized by each classifier, and the weight occupied by each single classifier in the final multi-classifier fusion prediction model is calculated by the correct classification capacity of each classifier;
step 5.2, adopting a majority weighted voting method to fuse classification results of three single classifiers, namely a Support Vector Machine (SVM), an Extreme Learning Machine (ELM) and a Random Forest (RF), when the output result of the classifier is +1, the classification result is a normal class, namely the stenosis degree is lower than 50%, and when the output result of the classifier is-1, the classification result is a lesion class, namely the stenosis degree is higher than 50%; multiplying the classification result of each classifier by the corresponding weight obtained in the step 5.1, adding the three products to obtain a classification result of the multi-classifier fusion prediction model, and judging the classification result as a normal class when the addition result is positive and judging the classification result as a lesion class when the addition result is negative.
In step 5.1, the weight occupied by each classifier is determined according to the classification accuracy, and the accuracy calculation formula of the classification model is as follows:
wherein a = 1,2,3; n=narrow, non-narrow; e' n Is thatE n The accumulated times are classified into normal types or abnormal types; y is a E { +1, -1} is the tag of the training sample, +.>Respectively representing the classification results of the models;
calculating the weight w of each model a The method comprises the following steps:
wherein ,
and 5.2, multiplying and adding the results obtained by each model with the corresponding weight to obtain a final output result:
when the output result is positive, the classification result is a normal category, namely the stenosis degree is lower than 50%, and when the output result is negative, the classification result is a lesion category, namely the stenosis degree is higher than 50%.
The technical scheme adopted by the application comprises the design of two main components: an image processing stage and a classifier modeling stage. Firstly, a database is required to be collected, pretreatment is carried out on a constructed image sample library, and a coronary artery segmentation map is extracted to carry out subsequent classification learning on coronary arteries; and then constructing a fusion classifier model, determining a classifier topological structure and a result output mode, identifying and classifying the stenosis degree according to the defined training sample, and fusing the classification result. And finally, SPSS software can be used for analyzing the accuracy, sensitivity, specificity, negative predicted value and positive predicted value of the predicted result, and the test set is used for classification prediction. In the process, from the point of noninvasively determining the coronary stenosis degree, two labeling categories are defined: the stenosis degree is more than 50% and less than 50%. Clinically, the coronary heart disease can be defined when the stenosis degree is more than 50%, so that the patients need to pay attention when the stenosis degree of the separated cases is more than 50%, and a treatment scheme is formulated. The fusion classifier is one of important components for degree identification and degree type marking, and is used for carrying out parameter training on the model by using a marked training set, and then identifying and marking by applying a test set. In order to obtain higher classification accuracy, the method adopts an algorithm to screen the characteristics of overfitting caused by excessive characteristics, three classifiers with better image classification performance are selected on the selection of a single classifier to be combined and built, as the single classifier relates to the adjustment and optimization of a plurality of super parameters, the multi-classifiers can be mutually coordinated, the problem of parameter adjustment and optimization is solved, the effect of adding one to more than two in the accuracy of classification results can be realized, and the weighted fusion algorithm gives a larger weight to the classifier with better classification performance, so that the classification results are more credible.
The application adopts SPSS software to analyze the accuracy, sensitivity, specificity, negative predictive value and positive predictive value of the predictive result. And testing the model classification effect by adopting a test set. In the designed technical scheme, the proportion of each classifier in the step 5 in the fusion classifier is distributed through the prediction capability of each classifier, the classification standard takes the international latest coronary artery stenosis diagnosis standard CAD-RADS as a criterion, when the stenosis degree is less than 50%, observation and prevention are taken as the main, and when the stenosis degree is greater than 50%, treatment such as medicines and operations are considered.
The application can directly classify the stenosis degree by studying the comparison of the prior patient data CTA and DSA diagnosis results, and finally realize the automatic prediction of the stenosis degree corresponding to DSA by the coronary CT image, namely, the accurate determination of the gold index of coronary lesions by the CTA image, thereby giving a treatment scheme without invasive examination, not only assisting doctors in giving diagnosis results, improving the working efficiency, but also greatly relieving the pain of patients, and having important clinical significance.