CN107944479B

CN107944479B - Disease prediction model establishing method and device based on semi-supervised learning

Info

Publication number: CN107944479B
Application number: CN201711135644.1A
Authority: CN
Inventors: 王宏志; 宋扬
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2020-10-30
Anticipated expiration: 2037-11-16
Also published as: CN107944479A

Abstract

The invention relates to a disease prediction model building method and device based on semi-supervised learning, which comprises the following steps: classifying the labeled data to obtain a basic classification model of the labeled data; selecting part of non-label data; classifying the selected part of non-label data by a clustering method, marking the selected part of non-label data by using the basic classification model, obtaining a marking result of the non-label data according to a clustering result and a prediction result of the non-label data, combining the marking result with the labeled data for classification to obtain an updated basic classification model, continuously selecting part of non-label data from the rest non-label data for modeling again, and iterating until all the non-label data are processed to obtain a final classification model. The invention models the label-free data, specifically combines a labeled classification method and a label-free clustering method, and improves the prediction precision in an iteration mode, thereby better improving the model prediction precision.

Description

Disease prediction model establishing method and device based on semi-supervised learning

Technical Field

The invention relates to the field of data processing, in particular to a disease prediction model building method and device based on semi-supervised learning, and a disease prediction method and device based on semi-supervised learning.

Background

Disease prediction is a very important subject at present, and a prediction model is obtained by analyzing medical data, so that disease data can be better utilized, and doctors and individuals can be helped to judge diseases. The data modeling method adopted at present is mainly a supervised learning method, namely, data modeling is carried out according to a known use case, and the model is utilized to mark unmarked data. However, the supervised learning method generally performs data modeling on labeled data, but the effective data amount is very limited, and the number of massive label-free data is huge, so that many data models do not fit data well or even over-fit data.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a disease prediction model establishing method and device based on semi-supervised learning, which utilizes a semi-supervised learning method to model unlabelled data, combines a labeled classification method and an unlabelled clustering method, performs adjustment according to data classification results, and improves prediction accuracy through an iterative manner, aiming at the above defects in the prior art.

In order to solve the above technical problem, a first aspect of the present invention provides a disease prediction model building method based on semi-supervised learning, including the following steps:

s1, classifying the labeled data to obtain a basic classification model of the labeled data;

s2, selecting part of unlabeled data from the unlabeled data;

s3, classifying the part of the unlabeled data selected in the step S2 by a clustering method to obtain a clustering result M of the unlabeled data₁And marking the part of the unlabeled data selected in the step S2 by using the basic classification model to obtain a prediction result T₁(ii) a Clustering result M according to the label-free data₁And predicted result T₁Obtaining a marking result C of the label-free data;

s4, combining the labeling result C of the non-label data and the labeled data for classification to obtain an updated basic classification model, turning to the step S2, continuing to select part of non-label data from the rest non-label data to execute the steps S3 and S4, and iterating in the above way until all the non-label data are processed to obtain the final classification model.

Preferably, in step S2, if q is greater than q₂Far greater than q₁Wherein q is₁Total amount of data for tagged data, q₂The quantity of the selected part of the non-label data is a multiplied by q₂And a is more than or equal to 15% and less than or equal to 25%, otherwise, the quantity of the selected part of the non-label data is bxq₁And b is more than or equal to 45 percent and less than or equal to 55 percent.

Preferably, in step S2, if q is greater than q₂＞10q₁Then the selected number of the part of the non-label data is a × q₂Wherein a is 20%; if q is₁≤q₂≤10q₁Then the selected number of the part of the non-label data is b × q₁And b is 50%.

Preferably, in the step S3, the labeling result C of the unlabeled data is calculated by using the following linear formula:

C＝αT₁+βM₁；

wherein alpha and beta are classification coefficients; alpha 50% q₁/(q₁+q₂)，β＝q₁/(q₁+q₂)。

Preferably, the step S3 further includes: if C > 1.5q₁/(q₁+q₂) The result C is marked with a value 1 indicating true, if C is less than or equal to 1.5q₁/(q₁+q₂) Then result C is flagged as a value of 0 indicating false.

Preferably, in step S1, the labeled data is classified by any one of the following classification methods: neural networks, naive bayes, or multivariate linear regression analysis methods.

Preferably, the clustering method used in the step S3 is a K-means or hierarchical clustering method.

In a second aspect of the present invention, a disease prediction method based on semi-supervised learning is provided, wherein a final classification model established by the disease prediction model establishing method based on semi-supervised learning is adopted to process disease data to obtain a disease prediction result.

In a third aspect of the present invention, a disease prediction model building apparatus based on semi-supervised learning is provided, including:

the first processing unit is used for classifying the labeled data to obtain a basic classification model of the labeled data;

the second processing unit is used for selecting part of non-label data from the non-label data;

a third processing unit for classifying part of the non-label data selected by the second processing unit by a clustering method to obtain a clustering node of the non-label dataFruit M₁And marking part of the unlabelled data selected by the second processing unit by using the basic classification model to obtain a prediction result T₁(ii) a Clustering result M according to the label-free data₁And predicted result T₁Obtaining a marking result C of the label-free data;

and the fourth processing unit is used for combining the labeling result C of the non-label data and the labeled data for classification to obtain an updated basic classification model, then starting the second processing unit to continuously select part of the non-label data from the rest non-label data for modeling, and iterating until all the non-label data are processed to obtain a final classification model.

In a third aspect of the present invention, a disease prediction apparatus based on semi-supervised learning is provided, including: the disease prediction model building device based on semi-supervised learning is used for obtaining a final classification model; and the disease prediction unit is connected with the disease prediction unit and is used for processing the disease data by utilizing the final classification model to obtain a disease prediction result.

The implementation of the invention has the following beneficial effects: the invention utilizes a semi-supervised learning method to model unlabelled data, specifically combines a labeled classification method and an unlabelled clustering method, adjusts according to a data classification result, and improves the prediction precision in an iteration mode, thereby avoiding the situations of over-fitting or incomplete fitting caused by too little labeled data, and further better improving the model prediction precision.

Drawings

FIG. 1 is a flow chart of a method for building a semi-supervised learning based disease prediction model according to a preferred embodiment of the present invention;

FIG. 2 is a diagram illustrating a process of building a semi-supervised learning based disease prediction model according to a preferred embodiment of the present invention;

FIG. 3 is a block diagram of a semi-supervised learning based disease prediction model building apparatus according to the present invention;

FIG. 4 is a graph comparing the disease prediction effects of the conventional method and the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments obtained by a person skilled in the art without making any inventive step are within the scope of protection of the present invention.

The invention provides a disease prediction model establishing method and a disease prediction method based on semi-supervised learning, which can utilize unlabeled data and improve the accuracy of a prediction model in an iterative mode. Fig. 1 is a flowchart illustrating a method for building a semi-supervised learning based disease prediction model according to a preferred embodiment of the present invention. Please refer to fig. 2, which is a schematic diagram of a process for establishing a semi-supervised learning based disease prediction model according to a preferred embodiment of the present invention. As shown in fig. 1 and 2, the method for building a disease prediction model based on semi-supervised learning according to the embodiment includes the following steps:

first, in step S101, labeled data is classified to obtain a data classification model of the labeled data as a basic classification model, where the total amount of the labeled data is q₁. Let the total amount of data of the unlabeled data be q₂If the total amount of the tagged data and the untagged data is q ═ q₁+q₂. Wherein the data classification model is preferably a basic data classification model based on disease prediction. Preferably, the tagged data and the untagged data are both medical data, i.e. medical data of a certain disease, including but not limited to chronic diseases such as heart disease, hypertension, cancer and cardiovascular and cerebrovascular diseases. In this step, the labeled data is classified by any one of the following classification methods: neural networks, naive bayes, or multivariate linear regression analysis methods.

Subsequently, in step S102, part of the unlabeled data is selected from the whole unlabeled data for the subsequent modeling processing, that is, a certain amount of unlabeled data is selected each time for the subsequent modeling processing.

In a preferred embodiment of the invention, if q is₂Far greater than q₁I.e. q₂＞＞q₁Then the number of the part of the non-label data selected each time is a × q₂And a is more than or equal to 15% and less than or equal to 25%, otherwise, the quantity of the part of non-label data selected each time is b multiplied by q₁And b is more than or equal to 45 percent and less than or equal to 55 percent.

In general, the total amount of data of the non-labeled data is equal to or greater than the total amount of data of the labeled data, i.e. q₂≥q₁. Thus, in a more preferred embodiment of the invention, if q is₂＞10q₁Then the number of said partial non-tag data is a × q₂Wherein a is 20%; if q is₁≤q₂≤10q₁The number of the partial non-label data is b × q₁And b is 50%. That is, at 10q₁As a far greater judgment criterion, when the total amount of data of the non-tag data is far greater than that of the tag data, the amount of data is selected to be 20% q each time₂The label-free data of (2) is subjected to subsequent modeling processing. At q₂≥q₁And within 10 times, 50% q is taken each time₁The label-free data of (2) is subjected to subsequent modeling processing. The proportion of selecting the non-label data each time is the optimal proportion obtained after a large number of experiments and experience summarization, and a better data modeling effect can be obtained.

Subsequently, in step S103, the part of the unlabeled data selected in step S2 is classified by a clustering method to obtain a clustering result M of the unlabeled data₁. Preferably, the clustering method used in this step S103 is a K-means or hierarchical clustering method. Meanwhile, the basic classification model is utilized to mark the part of the non-label data selected in the step S102 to obtain a prediction result T₁(ii) a Clustering result M according to label-free data₁And predicted result T₁And obtaining a marking result C of the non-label data.

Preferably, the labeling result C of the non-label data is calculated in this step S103 using the following linear formula (1):

C＝αT₁+βM₁； (1)

wherein alpha and beta are classification coefficients; preferably, α is 50% q₁/(q₁+q₂)，β＝q₁/(q₁+q₂)。

The invention combines the classification method of the labeled data and the clustering method of the unlabeled data, fine adjustment is carried out according to the data classification result, and the final classification result is determined according to a certain proportion, thus obtaining the labeling result C.

The step S103 further includes: if C > 1.5q₁/(q₁+q₂) If C is less than or equal to 1.5q, the marking result C takes the value of 1 which represents true₁/(q₁+q₂) The value of the flag result C is 0 indicating false. The above clustering result M₁And the predicted result T₁And the labeling result C are both 0, 1 values.

Subsequently, in step S104, the labeling result C of the unlabeled data and the labeled data are combined for classification, so as to obtain an updated basic classification model. Combining the marking result C into the previous training data set to carry out model training, and obtaining an updated basic classification model.

Subsequently, in step S105, it is determined whether all the non-tag data are processed, if yes, step S106 is performed, otherwise step S102 is performed, and part of the non-tag data is continuously selected from the remaining non-tag data to perform steps S103 and S104, that is, the newly selected part of the non-tag data is classified by a clustering method to obtain a new clustering result M of the non-tag data₁Meanwhile, the basic classification model updated in the step S104 is used for marking the newly selected part of the unlabeled data to obtain a new prediction result T₁(ii) a The new labeling result C of the unlabeled data is calculated again using linear formula (1). Then, the new label result C of the non-label data and the label data (the label data includes the total amount of data q in step S101)₁The original tagged data, and also the non-tagged data that was marked in the last iteration) are merged together for classification to obtain an updated basisAnd (5) classifying the models. And repeating the steps until all the non-label data are processed, so that all the non-label data are marked, and obtaining the final classification model. Preferably, the number of the non-tag data selected each time in step S102 is equal, until the number of the non-tag data remaining last is smaller than the number of the non-tag data that needs to be selected each time, all the remaining non-tag data are selected as data for the subsequent modeling processing.

Subsequently, in step S106, after the above iteration, all the unlabeled data are all labeled, resulting in a final classification model.

The invention also correspondingly provides a disease prediction method based on semi-supervised learning, which comprises the steps in the disease prediction model building method based on semi-supervised learning and the subsequent disease prediction step. In the disease prediction step, the disease data is processed by using the final classification model established by the disease prediction model establishing method based on semi-supervised learning to obtain a disease prediction result.

Please refer to fig. 3, which is a block diagram of a semi-supervised learning based disease prediction model building apparatus according to the present invention. As shown in fig. 3, the semi-supervised learning based disease prediction model creation apparatus 300 includes:

a first processing unit 301, configured to classify the labeled data to obtain a basic classification model of the labeled data, where a total amount of the labeled data is q₁. Let the total amount of data of the unlabeled data be q₂If the total amount of the tagged data and the untagged data is q ═ q₁+q₂. Preferably, the tagged data and the untagged data are both medical data, i.e. medical data of a certain disease, including but not limited to heart disease, cancer, cerebrovascular disease, etc. In this step, the labeled data is classified by any one of the following classification methods: neural networks, naive bayes, or multivariate linear regression analysis methods.

The second processing unit 302 is configured to select a part of the non-tag data from all the non-tag data for subsequent modeling processing, that is, select a certain amount of non-tag data each time for subsequent modeling processing.

In general, the total amount of data of the non-labeled data is equal to or greater than the total amount of data of the labeled data, i.e. q₂≥q₁. Thus, in a more preferred embodiment of the invention, if q is₂＞10q₁Then the number of said partial non-tag data is a × q₂Wherein a is 20%; if q is₁≤q₂≤10q₁The number of the partial non-label data is b × q₁And b is 50%. That is, at 10q₁As a far greater judgment criterion, when the total amount of data of the non-tag data is far greater than that of the tag data, the amount of data is selected to be 20% q each time₂The label-free data of (2) is subjected to subsequent modeling processing. At q₂≥q₁And within 10 times, 50% q is taken each time₂The label-free data of (2) is subjected to subsequent modeling processing. The proportion of selecting the non-label data each time is the optimal proportion obtained after a large number of experiments and experience summarization, and a better data modeling effect can be obtained.

A third processing unit 303, configured to classify some of the unlabeled data selected by the second processing unit 302 by a clustering method to obtain a clustering result M of the unlabeled data₁. Preferably, the clustering method used in this step S103 is a K-means or hierarchical clustering method. Meanwhile, the basic classification model is utilized to mark part of the non-label data selected by the second processing unit 302 to obtain a prediction result T₁(ii) a Clustering result M according to the label-free data₁And predicted result T₁And obtaining a marking result C of the non-label data.

Preferably, the third processing unit 303 calculates the labeling result C of the non-tag data using the following linear formula (1):

C＝αT₁+βM₁； (1)

The third processing unit 303 further performs the following operations: if C > 1.5q₁/(q₁+q₂) If C is less than or equal to 1.5q, the marking result C takes the value of 1 which represents true₁/(q₁+q₂) The value of the flag result C is 0 indicating false. The above clustering result M₁And the predicted result T₁And the labeling result C are both 0, 1 values.

The fourth processing unit 304 is configured to combine the labeling result C of the non-label data with the labeled data for classification to obtain an updated basic classification model, and restart the second processing unit 302 to continue to select a part of non-label data from the remaining non-label data for modeling, so as to iterate until all the non-label data are processed, and obtain a final classification model.

The invention also correspondingly provides a disease prediction device based on semi-supervised learning, which comprises: the semi-supervised learning based disease prediction model building apparatus 300 and the disease prediction unit connected thereto are as described above. The disease prediction model establishing device 300 based on semi-supervised learning is used for obtaining a final classification model, and the disease prediction unit is used for processing disease data by using the final classification model to obtain a disease prediction result.

The disease prediction effect of the common method and the method of the invention is compared through experiments. The method comprises the steps of utilizing the neural network as a basic data model for classifying the labeled data, utilizing k-means as a clustering algorithm, and obtaining the data model after 2 iterations. The experimental data source is heart disease data. The total sample size adopted in the method experiment is 689, wherein the test set comprises 300 data, 100 labeled classified data, 200 unlabeled classified data and the verification set comprises 389 data. The treatment process is as follows:

1. modeling 100 labeled data by using a neural network method to form a classification model;

2. classifying 100 (50%) of the 200 unlabeled data using the classification model;

3. clustering the same 100 unlabeled data by using Kmean;

4. calculating the classification and clustering results according to a formula to form C;

5. adding the 100 labeled data C into a training set to continue training to form a new classification model;

6. and repeating the step 2 to calculate another 100 pieces of non-label data to obtain the model.

Please refer to fig. 4, which is a graph comparing the disease prediction effect of the conventional method and the method of the present invention. The results including accuracy, error rate, precision, recall and correlation are compared, and the numerical results are shown in table 1.

	Accuracy rate	Error rate	Accuracy of measurement	Recall rate	Degree of correlation
						General procedure	0.945026178	0.054973822	0.846846847	0.959183673	0.82071
The invention	0.971204188	0.028795812	0.930693069	0.959183673	0.915200021

Therefore, compared with the common method, the method has the advantages of higher accuracy, lower error rate and capability of improving the accuracy by 3%.

In conclusion, the invention provides an improved disease prediction model, which utilizes a semi-supervised learning method to model unlabeled data, effectively utilizes the unlabeled data, further optimizes the prediction model, and helps to better improve the precision of model prediction, thereby better coping with the application scenes of the large-scale mass unlabeled data at present, and the precision can be improved by 3% according to the experimental result. According to the experimental result, the method can be effectively applied to the field of disease prediction, and can also be applied to other data models by fine-tuning parameters.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A disease prediction model building method based on semi-supervised learning is characterized by comprising the following steps:

s2, selecting part of unlabeled data from the unlabeled data;

s4, combining the labeling result C of the non-label data and the labeled data for classification to obtain an updated basic classification model, turning to the step S2 to continue to select part of non-label data from the rest non-label data to execute the steps S3 and S4, and iterating in the above way until all the non-label data are processed to obtain a final classification model;

in the step S2, if q is greater than q₂＞10q₁Wherein q is₁Total amount of data for tagged data, q₂The quantity of the selected part of the non-label data is a multiplied by q₂Wherein a is 20%; if q is₁≤q₂≤10q₁Then the selected number of the part of the non-label data is b × q₁And b is 50%;

in step S3, the labeling result C of the unlabeled data is calculated by using the following linear formula:

C＝αT₁+βM₁；

wherein alpha and beta are classification coefficients; alpha 50% q₁/(q₁+q₂)，β＝q₁/(q₁+q₂)；

The step S3 further includes:

if C > 1.5q₁/(q₁+q₂) The result C is marked with a value 1 indicating true, if C is less than or equal to 1.5q₁/(q₁+q₂) If yes, marking result C as a value 0 representing false; wherein q is₁Total amount of data for tagged data, q₂Total amount of data q for unlabeled data₂。

2. The method for building a disease prediction model based on semi-supervised learning according to claim 1, wherein the labeled data is classified in the step S1 by any one of the following classification methods: neural networks, naive bayes, or multivariate linear regression analysis methods.

3. The method for building a disease prediction model based on semi-supervised learning according to claim 1, wherein the clustering method used in step S3 is K-means or hierarchical clustering method.

4. A disease prediction model building device based on semi-supervised learning is characterized by comprising:

a third processing unit for classifying part of the non-label data selected by the second processing unit by a clustering method to obtain a clustering result M of the non-label data₁And marking part of the unlabelled data selected by the second processing unit by using the basic classification model to obtain a prediction result T₁(ii) a Clustering result M according to the label-free data₁And predicted result T₁Obtaining a marking result C of the label-free data;

the fourth processing unit is used for combining the labeling result C of the non-label data and the labeled data for classification to obtain an updated basic classification model, then the second processing unit is started to continuously select part of non-label data from the rest non-label data for modeling, and the iteration is carried out until all the non-label data are processed to obtain a final classification model;

in the second processing unit, if q₂＞10q₁Wherein q is₁Total amount of data for tagged data, q₂The quantity of the selected part of the non-label data is a multiplied by q₂Wherein a is 20%; if q is₁≤q₂≤10q₁Then the selected number of the part of the non-label data is b × q₁And b is 50%;

the third processing unit calculates a labeling result C of the non-tag data using the following linear formula:

C＝αT₁+βM₁；

5. A disease prediction apparatus based on semi-supervised learning, comprising:

the semi-supervised learning based disease prediction model building apparatus of claim 4, for deriving a final classification model; and connected thereto

And the disease prediction unit is used for processing the disease data by using the final classification model to obtain a disease prediction result.