CN116089801A - Medical data missing value repairing method based on multiple confidence degrees - Google Patents

Medical data missing value repairing method based on multiple confidence degrees Download PDF

Info

Publication number
CN116089801A
CN116089801A CN202310031008.3A CN202310031008A CN116089801A CN 116089801 A CN116089801 A CN 116089801A CN 202310031008 A CN202310031008 A CN 202310031008A CN 116089801 A CN116089801 A CN 116089801A
Authority
CN
China
Prior art keywords
attribute
sample
value
missing
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310031008.3A
Other languages
Chinese (zh)
Inventor
范科峰
曾登辉
杨磊
董建
方春燕
苗宗利
刘立新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
China Electronics Standardization Institute
Original Assignee
Guilin University of Electronic Technology
China Electronics Standardization Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology, China Electronics Standardization Institute filed Critical Guilin University of Electronic Technology
Priority to CN202310031008.3A priority Critical patent/CN116089801A/en
Publication of CN116089801A publication Critical patent/CN116089801A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a medical data missing value batch repairing method based on multiple confidence degrees, which comprises the following steps: updating the sample confidence coefficient by using the attribute weight, and introducing the missing sample set into model training; optimizing a loss function by using the sample confidence, and filling missing values of the data set; the sample confidence is calculated according to attribute relations among samples, and different confidence is given to the samples according to the data attribute to be predicted and the number of missing values in the samples; the influence degree of the sample in the model training process is adjusted through dynamic selection of the confidence coefficient. The model architecture is optimized, so that the network can be used for batch filling of multidimensional data missing values at one time, the constant mapping problem of a network transfer function can be eliminated, and the cross correlation among nodes can be enhanced. The invention improves the utilization rate of the data and improves the filling precision and the filling efficiency of the data.

Description

Medical data missing value repairing method based on multiple confidence degrees
Technical Field
The invention relates to the crossing field of medical health and information science, in particular to a method for repairing medical data missing values based on multiple confidence degrees.
Background
With the vigorous development of big data industry and the vigorous popularization of intelligent medical treatment in the whole society, more and more medical data sets are used for assisting medical diagnosis, and the quality of the data sets directly influences the diagnosis result. At present, medical data inevitably has a defect in the process of collection, transmission and storage due to various unavoidable factors. The missing data affects the authenticity of the data itself, reduces the validity of the data, and affects subsequent data analysis, so filling the missing data is highly necessary.
At present, researchers solve the problem of data missing mostly through mean value filling, regression filling, multiple filling, nearest neighbor filling and other modes, however, under the condition that a data set sample has multi-dimensional attribute missing and the missing rate is large, the filling methods are difficult to achieve accurate, effective and rapid filling.
Conventional statistical algorithms and machine learning algorithms are mostly populated for a single missing attribute. When one attribute in the data set is filled, deleting the samples with other missing attributes, so that resource waste is caused, valuable information in the missing data samples cannot be obtained, and accuracy of result analysis is possibly influenced.
Therefore, those skilled in the art are dedicated to develop a method for repairing missing values of medical data based on multiple confidence levels, to batch fill the missing values of multidimensional data in the data set samples, to improve filling efficiency, to reasonably add the data samples with the missing values into the training model, to fully mine the data information in the data set, so as to solve the defects in the prior art.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, the present invention aims to solve the technical problems that the filling method disclosed in the prior art mostly fills a single missing attribute, when filling a certain attribute in a data set, deleting samples with other missing attributes to cause resource waste, valuable information in the missing data samples cannot be obtained, and the utilization rate of data in the original data set is low, and the efficiency and accuracy of data filling are low.
In order to achieve the above object, the present invention provides a method for repairing missing values of medical data based on multiple confidence levels, which includes using each attribute in each sample of a dataset as a target attribute to be filled in sequence, analyzing correlation between each attribute in the dataset and the attribute of the data to be filled by a statistical method, and calculating weight of each attribute relative to the target attribute based on the correlation; updating the confidence coefficient of each data sample through the weight of the attribute and the number of missing data in the sample, changing the influence degree of the sample on the whole training model through dynamic adjustment of the confidence coefficient, and adding each sample into the model training process; the method comprises the steps of filling missing values in batches through a self-association neural network model, optimizing a transmission path of the neural network model on the basis of the self-association neural network to eliminate an identity mapping problem and enhance cross-correlation among nodes; introducing classification errors and sample confidence to improve a loss function so as to improve the sample utilization rate and filling accuracy;
further, the method for repairing the medical data missing values based on the multiple confidence degrees specifically comprises the following steps:
step 1, importing a missing data set;
step 2, eliminating dimension influence among sample indexes, and carrying out normalization processing on a data set, wherein the normalization formula of the data sample is as follows (1):
Figure BDA0004046861150000021
wherein,,
x ij normalized values for the original data of the ith row and jth column;
x represents data to be normalized in the sample;
x max representing a maximum value of the data attribute;
x min representing a minimum value of the data attribute;
step 3, calculating an association relation matrix among all the attributes by a statistical method; the correlation coefficient between features is calculated as formula (2):
Figure BDA0004046861150000022
wherein,,
ρ i,j representing the correlation between attribute i and attribute j;
cov (i, j) represents the covariance of attribute i and attribute j;
di and Dj represent the variances of attribute i and attribute j;
step 4, updating the weight of each other attribute relative to the target attribute by using the correlation coefficient obtained in the step 3; the target attributes are all attributes in the sample set; the specific weight calculation formula is as follows (3):
Figure BDA0004046861150000031
wherein,,
W ij representing the weight of attribute j relative to attribute i;
d represents the total attribute number of the sample;
step 5, calculating the confidence coefficient of each sample set; in particular to the preparation method of the composite material,
step 5-1, calculating the weight of other attributes in the sample relative to the target attribute according to the mode of step 4; calculating multiple confidence of the sample according to the formula (4), wherein the destroyed degree of the sample is obtained by adding weights of all missing value attributes:
Figure BDA0004046861150000032
wherein,,
ms represents the missing attribute;
R ki representing the confidence of the sample in predicting the ith value in the kth sample;
step 5-2, filling all missing values through a self-association neural network model at one time; on the premise of not determining the missing attribute of the sample, outputting a predicted value to each attribute of the sample; when predicting each attribute, the sample can regain a new confidence, namely, how many dimensions of a sample the sample has a plurality of confidence;
step 6, marking the missing value of the data set, returning the coordinate position information of the missing value in the data set, and pre-filling by using an average value or a mode or a median according to the characteristics of the attribute of the missing value;
step 7, dividing the data set into a training set and a testing set;
step 8, constructing a neural network, and optimizing a transmission path of the neural network; the number of output quantities of all attribute values of the model prediction sample is equal to the number of input quantities; adding the predicted classification result to an output layer of the neural network; when filling a certain missing value, removing the corresponding input data quantity from the training model, wherein the specific network transfer function is as follows:
the transfer formula of the next level of the input layer is as follows (5):
Figure BDA0004046861150000033
the transfer formula between hidden layers is as follows (6):
Figure BDA0004046861150000034
when the predicted value is a continuous value, the output value of the network is as shown in formula (7):
Figure BDA0004046861150000041
when the predicted value is a classification attribute, the output transfer function of the network is formula (8); when the predicted value is a multi-classification attribute, the output transfer function of the network is equation (9)
Figure BDA0004046861150000042
Figure BDA0004046861150000043
In the formulas (5), (6), (7), (8) and (9),
confidence of incomplete sample R ij The R is ij Representing the confidence that the sample was assigned when predicting the jth attribute of the ith sample;
g () represents the relu activation function;
f () represents a sigmoid activation function;
h () represents a softmax activation function;
Y hj representing the output of the h neuron of the input layer to the next layer of the neural network in predicting the j-th attribute;
Y hj representing the output of the h' th neuron of the second hidden layer when predicting attribute j;
Y j an output value representing the network model;
W lh representing the transfer weight between the first attribute of the input sample and the h neuron of the second layer of the network;
W hh' representing a transfer weight between an h neuron of the first hidden layer to an h' neuron of the second hidden layer;
W h'j representing the transfer weight between the h' th neuron of the second hidden layer and the attribute j to be predicted of the output layer;
X il a first attribute representing a first input sample;
b n biasing the network model;
n represents a total of n neurons in the hidden layer;
d represents the attribute number of the sample;
step 9, performing one-hot coding on the classification attribute in the data set;
step 10, optimizing a loss function; applying the calculated multiple confidence coefficient to a loss function of the neural network model, and distinguishing the influence degree of each sample on model training through adjusting the confidence coefficient of the sample;
the expression of the loss function is as follows (10):
Figure BDA0004046861150000051
in the method, in the process of the invention,
R ij the confidence of the sample when the ith sample fills the jth missing attribute is represented, and cont represents that the attribute value is a continuous numerical variable;
class represents a classification attribute;
x ij represents the jth attribute in the ith sample (which is a continuous variable);
y ij a predicted value representing the jth attribute in the ith sample;
z ij a classification result of the jth attribute of the ith sample (the classification result is represented by a one-hot code);
p ij a prediction result of a jth classification attribute of the ith sample;
an early stop strategy is introduced, so that the prediction accuracy is maximized;
the early-stopping strategy is that in the model training process, in order to determine the optimal training times and obtain the best result, the concept is introduced, if the training times are too small, the underfitting is caused, if the training times are too many, the underfitting is caused, in order to solve the problem, the early-stopping strategy is introduced, after each epoch is finished, a test result is obtained on a verification set, and as the epoch is increased to a certain value, the error of the verification set is changed from a descending trend to an ascending trend, the training is stopped at the moment, and the epoch at the moment is the optimal training times;
by adopting the scheme, the medical data missing value repairing method based on the multiple confidence degrees has the following advantages:
(1) According to the medical data missing value restoration method based on multiple confidence degrees, a statistical method and a machine learning method are combined, the number of dimensions of a sample is taken as a judgment basis, and multiple confidence degrees are given to the sample; the loss function of the self-association neural network model is optimized through dynamic selection of the confidence coefficient, incomplete samples are introduced into training of the model, and the utilization rate of data is improved;
(2) According to the medical data missing value restoration method based on the multiple confidence degrees, the transmission path of the self-association neural network model is optimized, the self-mapping problem from the input node to the output node is removed (when a certain attribute is predicted, the attribute does not participate in the input of the network model), and the cross-correlation among the nodes is enhanced;
(3) According to the medical data missing value restoration method based on the multiple confidence degrees, the missing values of the classification attribute and the continuous attribute are synchronously predicted, so that the model filling efficiency is improved, the transmission path of the self-association neural network model is optimized, and the self-mapping problem from the input node to the output node is solved;
in summary, according to the medical data missing value repairing method based on multiple confidence degrees, disclosed by the invention, the multidimensional data missing values in the data set samples are filled in batches, so that the filling efficiency is improved, the data samples with the multidimensional missing values are reasonably added into the training model, the data information in the data set is fully mined, the utilization rate of the data in the original data set is improved, the data filling efficiency is accelerated, and the data filling accuracy is improved;
the conception, specific technical scheme, and technical effects produced by the present invention will be further described in conjunction with the specific embodiments below to fully understand the objects, features, and effects of the present invention.
Drawings
FIG. 1 is a flow chart of a method of medical data missing value restoration based on multiple confidence levels of the present invention;
FIG. 2 is a network architecture diagram of a method of medical data missing value restoration based on multiple confidence levels in accordance with the present invention;
Detailed Description
The following describes a number of preferred embodiments of the present invention to make its technical contents more clear and easy to understand. This invention may be embodied in many different forms of embodiments which are exemplary of the description and the scope of the invention is not limited to only the embodiments set forth herein.
Example 1, method of medical data missing value repair based on multiple confidence
The data set adopts a heart disease data set on a kagle officer network;
in the embodiment 1, according to the attribute of the filled missing value, the relative weight of each attribute and the predicted attribute is calculated, a plurality of confidence degrees are given to the samples by combining the number of the sample missing values, and the influence of each sample on model training when each attribute is filled is distinguished by adjusting the confidence degrees of the samples; in particular to the preparation method of the composite material,
step 1, loading a heart disease data set on a kaggle officer network; randomly deleting the heart disease data set, and storing the relative position coordinates of the deletion value in the data set to the local, and marking the relative position coordinates as (a, b); on the dataset, the missing values for the classification attributes are filled with modes, and the missing values for the continuity data are filled with averages;
step 2, eliminating dimension influence among sample indexes, and carrying out normalization processing on a data set, wherein the normalization formula of the data sample is as follows (1):
Figure BDA0004046861150000061
wherein,,
x ij normalized values for the original data of the ith row and jth column;
x represents data to be normalized in the sample;
x max representing a maximum value of the data attribute;
x min representing a minimum value of the data attribute;
step 3, calculating an association relation matrix among all the attributes by a statistical method; the correlation coefficient between features is calculated as formula (2):
Figure BDA0004046861150000071
wherein,,
ρ i,j representing the correlation between attribute i and attribute j;
cov (i, j) represents the covariance of attribute i and attribute j;
di and Dj represent the variances of attribute i and attribute j;
step 4, updating the weight of each other attribute relative to the target attribute by using the correlation coefficient obtained in the step 3; the target attributes are all attributes in the sample set; the specific weight calculation formula is as follows (3):
Figure BDA0004046861150000072
wherein,,
W ij representing the weight of attribute j relative to attribute i;
d represents the total attribute number of the sample;
step 5, calculating the confidence coefficient of each sample set; in particular to the preparation method of the composite material,
step 5-1, calculating the weight of other attributes in the sample relative to the target attribute according to the mode of step 4; calculating multiple confidence of the sample according to the formula (4), wherein the destroyed degree of the sample is obtained by adding weights of all missing value attributes:
Figure BDA0004046861150000073
wherein,,
ms represents the missing attribute;
R ki representing the confidence of the sample in predicting the ith value in the kth sample;
step 5-2, filling all missing values through a self-association neural network model at one time; on the premise of not determining the missing attribute of the sample, outputting a predicted value to each attribute of the sample; when predicting each attribute, the sample can regain a new confidence, namely, how many dimensions of a sample the sample has a plurality of confidence;
in step 5, the confidence is reset for each sample, the data set has fourteen attributes, and a confidence R is assigned to each sample for each attribute ij
Thus, the sample has fourteen confidence levels; assuming that the first sample lacks the first, third, and fifth attributes, respectively, the sample confidence may be expressed as follows: r is R i1 =1-W 13 -W 15 Wherein W is 13 Representing the weight of attribute three relative to attribute one, W, in populating attribute one 15 Representing the weight of attribute five relative to attribute one when filling attribute one;
similarly, R i2 =1-W 21 -W 23 -W 25 ;R i3 =1-W 31 -W 35 And similarly, R is calculated i14 Is a value of (2);
step 6, marking the missing value of the data set, returning the coordinate position information of the missing value in the data set, and pre-filling by using an average value or a mode or a median according to the characteristics of the attribute of the missing value;
in the step 6, the position coordinates (a, B) of the data to be filled in the training set and the testing set are updated again according to the relative position coordinates (a, B) in the step 1;
step 7, dividing the data set into a training set and a testing set;
in the step 7, eight percent of the data set is a training set, and twenty percent is a testing set;
step 8, constructing a neural network, and optimizing a transmission path of the neural network; the number of output quantities of all attribute values of the model prediction sample is equal to the number of input quantities; adding the predicted classification result to an output layer of the neural network; when filling a certain missing value, removing the corresponding input data quantity from the training model, wherein the specific network transfer function is as follows:
the transfer formula of the next level of the input layer is as follows (5):
Figure BDA0004046861150000081
the transfer formula between hidden layers is as follows (6):
Figure BDA0004046861150000082
when the predicted value is a continuous value, the output value of the network is as shown in formula (7):
Figure BDA0004046861150000083
when the predicted value is a classification attribute, the output transfer function of the network is formula (8); when the predicted value is a multi-classification attribute, the output transfer function of the network is equation (9)
Figure BDA0004046861150000084
Figure BDA0004046861150000091
In the formulas (5), (6), (7), (8) and (9),
confidence of incomplete sample R ij The R is ij Representing the confidence that the sample was assigned when predicting the jth attribute of the ith sample;
g () represents the relu activation function;
f () represents a sigmoid activation function;
h () represents a softmax activation function;
Y hj representing the output of the h neuron of the input layer to the next layer of the neural network in predicting the j-th attribute;
Y h′j representing the output of the h' th neuron of the second hidden layer when predicting attribute j;
Y j an output value representing the network model;
W lh representing the transfer weight between the first attribute of the input sample and the h neuron of the second layer of the network;
W hh' representing a transfer weight between an h neuron of the first hidden layer to an h' neuron of the second hidden layer;
W h'j representing the transfer weight between the h' th neuron of the second hidden layer and the attribute j to be predicted of the output layer;
X il a first attribute representing a first input sample;
b n biasing the network model;
n represents a total of n neurons in the hidden layer;
d represents the attribute number of the sample;
in the step 8, a specific network structure diagram is shown in fig. 2;
in the heart disease data set, when the first output of the prediction output layer is output, the first input of the data set does not participate in training, and so on, and when the second output of the prediction output layer is output, the second input of the data set does not participate in training of the model until the last predicted value is output;
step 9, performing one-hot coding on the classification attribute in the data set; adding predictive output of the classification attribute in an output layer of the neural network, outputting a set of probability values by using a softmax activation function, and taking the product of one-hot coding of the predictive value and the accurate value as a part of a loss function;
step 10, optimizing a loss function; applying the calculated multiple confidence coefficient to a loss function of the neural network model, and distinguishing the influence degree of each sample on model training through adjusting the confidence coefficient of the sample;
the expression of the loss function is as follows (10):
Figure BDA0004046861150000101
in the method, in the process of the invention,
R ij the confidence of the sample when the ith sample fills the jth missing attribute is represented, and cont represents that the attribute value is a continuous numerical variable;
class represents a classification attribute;
x ij represents the jth attribute in the ith sample (which is a continuous variable);
y ij a predicted value representing the jth attribute in the ith sample;
z ij a classification result of the jth attribute of the ith sample (the classification result is represented by a one-hot code);
p ij a prediction result of a jth classification attribute of the ith sample;
in the step 10, the regression error (x ij -y ij ) 2 And classification error z i p i Combining and multiplying the confidence level of the missing value to be filled of the corresponding sample as a final loss function; the sample training times follow an early stop strategy to obtain the optimal filling effect; the early-stopping strategy is that in the model training process, in order to determine the optimal training times and obtain the best result, the concept is introduced, if the training times are too small, the underfitting is caused, if the training times are too many, the underfitting is caused, in order to solve the problem, the early-stopping strategy is introduced, after each epoch is finished, a test result is obtained on a verification set, and as the epoch is increased to a certain value, the error of the verification set is changed from a descending trend to an ascending trend, the training is stopped at the moment, and the epoch at the moment is the optimal training times;
step 11, comparing the filling value with the accurate value in the original data according to the missing value coordinates (A, B) recorded in the step 6, and calculating the continuous value filling percentage error rate
Figure BDA0004046861150000102
(x i For accurate value +.>
Figure BDA0004046861150000103
For filling values) and the accuracy of the classification property filling +.>
Figure BDA0004046861150000104
(i=1 when filling is correct, i=0 when not correct);
comparative example 2, without setting sample confidence, training the samples directly;
step 1, loading a heart disease data set on a kaggle officer network; randomly deleting the heart disease data set, and storing the relative position coordinates of the deletion value in the data set to the local, and marking the relative position coordinates as (a, b); on the dataset, the missing values for the classification attributes are filled with modes, and the missing values for the continuity data are filled with averages;
step 2, eliminating dimension influence among sample indexes, and carrying out normalization processing on a data set, wherein the normalization formula of the data sample is as follows (1):
Figure BDA0004046861150000105
wherein,,
x ij normalized values for the original data of the ith row and jth column;
x represents data to be normalized in the sample;
x max representing a maximum value of the data attribute;
x min representing a minimum value of the data attribute;
marking the missing value of the data set, returning the coordinate position information of the missing value in the data set, and pre-filling by using an average value or a mode or a median according to the characteristics of the attribute of the missing value;
in the step 3, the position coordinates of the data to be filled in the training set and the testing set are updated again according to the relative position coordinates (B) in the step 1;
step 4, dividing the data set into a training set and a testing set;
in the step 4, eight percent of the data set is a training set, and twenty percent is a testing set;
step 5, building a neural network and optimizing a transmission path of the neural network; the number of output quantities of all attribute values of the model prediction sample is equal to the number of input quantities; adding the predicted classification result to an output layer of the neural network; when filling a certain missing value, removing the corresponding input data quantity from the training model, wherein the specific network transfer function is as follows:
the transfer formula of the next level of the input layer is as follows (5):
Figure BDA0004046861150000111
the transfer formula between hidden layers is as follows (6):
Figure BDA0004046861150000112
when the predicted value is a continuous value, the output value of the network is as shown in formula (7):
Figure BDA0004046861150000113
when the predicted value is a classification attribute, the output transfer function of the network is formula (8); when the predicted value is a multi-classification attribute, the output transfer function of the network is equation (9)
Figure BDA0004046861150000114
Figure BDA0004046861150000115
In the formulas (5), (6), (7), (8) and (9),
g () represents the relu activation function;
f () represents a sigmoid activation function;
h () represents a softmax activation function;
Y hj representing the output of the h neuron of the input layer to the next layer of the neural network in predicting the j-th attribute;
Y h′j representing the output of the h' th neuron of the second hidden layer when predicting attribute j;
Y j an output value representing the network model;
W lh representing the transfer weight between the first attribute of the input sample and the h neuron of the second layer of the network;
W hh' representing a transfer weight between an h neuron of the first hidden layer to an h' neuron of the second hidden layer;
W h'j representing the transfer weight between the h' th neuron of the second hidden layer and the attribute j to be predicted of the output layer;
X il the ith input samplel attributes;
b n biasing the network model;
n represents a total of n neurons in the hidden layer;
d represents the attribute number of the sample;
step 6, performing one-hot coding on the classification attribute in the data set; adding predictive output of the classification attribute in an output layer of the neural network, outputting a set of probability values by using a softmax activation function, and taking the product of one-hot coding of the predictive value and the accurate value as a part of a loss function;
step 7, optimizing a loss function; the expression of the loss function is as follows (10):
loss=∑ j=cont (x ij -y ij ) 2 -∑ j=cont z ij p ij (10)
in the method, in the process of the invention,
cont indicates that the attribute value is a continuous numerical variable;
class represents a classification attribute;
x ij represents the jth attribute in the ith sample (which is a continuous variable);
y ij a predicted value representing the jth attribute in the ith sample;
z ij a classification result of the jth attribute of the ith sample (the classification result is represented by a one-hot code);
p ij a prediction result of a jth classification attribute of the ith sample;
in the step 7, the regression error (x ij -y ij ) 2 And classification error z i p i Combining and multiplying the confidence level of the missing value to be filled of the corresponding sample as a final loss function; the sample training times follow an early stop strategy to obtain the optimal filling effect;
step 8, comparing the filling value with the accurate value in the original data according to the missing value coordinates (A, B) recorded in the step 3, and calculating the continuous value filling percentage error rate
Figure BDA0004046861150000131
(x i For accurate value +.>
Figure BDA0004046861150000132
For filling values) and the accuracy of the classification property filling +.>
Figure BDA0004046861150000133
(i=1 when filling is correct, i=0 when not correct);
comparative example 3, calculating sample confidence according to the number of missing values, and further changing the influence degree of each sample on model training;
step 1, loading a heart disease data set on a kaggle officer network; randomly deleting the heart disease data set, and storing the relative position coordinates of the deletion value in the data set to the local, and marking the relative position coordinates as (a, b); on the dataset, the missing values for the classification attributes are filled with modes, and the missing values for the continuity data are filled with averages;
step 2, eliminating dimension influence among sample indexes, and carrying out normalization processing on a data set, wherein the normalization formula of the data sample is as follows (1):
Figure BDA0004046861150000134
wherein,,
x ij normalized values for the original data of the ith row and jth column;
x represents data to be normalized in the sample;
x max representing a maximum value of the data attribute;
x min representing a minimum value of the data attribute;
step 3, counting the number of missing values and the total number of sample attributes of each data sample, wherein the sample confidence level=the total number of samples missing values, the sample confidence level is only one confidence level, and the sample confidence level is recorded as R i Representing the ith sample;
marking the missing value of the data set, returning the coordinate position information of the missing value in the data set, and pre-filling by using an average value or a mode or a median according to the characteristics of the attribute of the missing value;
in the step 4, the position coordinates of the data to be filled in the training set and the testing set are updated again according to the relative position coordinates (B) in the step 1;
step 5, dividing the data set into a training set and a testing set;
in the step 5, eight percent of the data set is a training set, and twenty percent is a testing set;
step 6, constructing a neural network, and optimizing a transmission path of the neural network; the number of output quantities of all attribute values of the model prediction sample is equal to the number of input quantities; adding the predicted classification result to an output layer of the neural network; when filling a certain missing value, removing the corresponding input data quantity from the training model, wherein the specific network transfer function is as follows:
the transfer formula of the next level of the input layer is as follows (5):
Figure BDA0004046861150000141
the transfer formula between hidden layers is as follows (6):
Figure BDA0004046861150000142
when the predicted value is a continuous value, the output value of the network is as shown in formula (7):
Figure BDA0004046861150000143
when the predicted value is a classification attribute, the output transfer function of the network is formula (8); when the predicted value is a multi-classification attribute, the output transfer function of the network is equation (9)
Figure BDA0004046861150000144
Figure BDA0004046861150000145
In the formulas (5), (6), (7), (8) and (9),
g () represents the relu activation function;
f () represents a sigmoid activation function;
h () represents a softmax activation function;
Y hj representing the output of the h neuron of the input layer to the next layer of the neural network in predicting the j-th attribute;
Y hj representing the output of the h' th neuron of the second hidden layer when predicting attribute j;
Y j an output value representing the network model;
W lh representing the transfer weight between the first attribute of the input sample and the h neuron of the second layer of the network;
W hh' representing a transfer weight between an h neuron of the first hidden layer to an h' neuron of the second hidden layer;
W h'j representing the transfer weight between the h' th neuron of the second hidden layer and the attribute j to be predicted of the output layer;
X il a first attribute representing a first input sample;
b n biasing the network model;
n represents a total of n neurons in the hidden layer;
d represents the attribute number of the sample;
step 7, performing one-hot coding on the classification attribute in the data set; adding predictive output of the classification attribute in an output layer of the neural network, outputting a set of probability values by using a softmax activation function, and taking the product of one-hot coding of the predictive value and the accurate value as a part of a loss function;
step 8, optimizing a loss function; applying the calculated multiple confidence coefficient to a loss function of the neural network model, and distinguishing the influence degree of each sample on model training through adjusting the confidence coefficient of the sample;
the expression of the loss function is as follows (10):
loss=∑ j=cont R i (x ij -y ij ) 2 -∑ j=cont R i z ij p ij (10)
in the method, in the process of the invention,
R i representing the confidence of the ith sample;
cont indicates that the attribute value is a continuous numerical variable;
class represents a classification attribute;
x ij represents the jth attribute in the ith sample (which is a continuous variable);
y ij a predicted value representing the jth attribute in the ith sample;
z ij a classification result of the jth attribute of the ith sample (the classification result is represented by a one-hot code);
p ij a prediction result of a jth classification attribute of the ith sample;
in the step 8, the regression error (x ij -y ij ) 2 And classification error z i p i Combining and multiplying the confidence level of the missing value to be filled of the corresponding sample as a final loss function; the sample training times follow an early stop strategy to obtain the optimal filling effect;
step 9, comparing the filling value with the accurate value in the original data according to the missing value coordinates (A, B) recorded in the step 4, and calculating the continuous value filling percentage error rate
Figure BDA0004046861150000151
(x i For accurate value +.>
Figure BDA0004046861150000152
For a fill value) and classification attribute fillAccuracy of->
Figure BDA0004046861150000153
(i=1 when filling is correct, i=0 when not correct);
comparative example 4, filling the missing value attribute with a random forest algorithm;
the rest is the same as in example 1 except that a random forest algorithm is used when filling the missing value attribute;
test 5, comparison of the filling results of example 1, comparative example 2, comparative example 3, comparative example 4
Setting evaluation indexes:
according to the data type, dividing the data into continuous data and classified data; when calculating the continuous data filling accuracy, taking average absolute percentage error (MAPE) as an index for evaluating the data filling quality; in evaluating the filling quality of classified data, the filling accuracy thereof is described by the filling Accuracy (ACC); the calculation formulas of the two are as follows:
Figure BDA0004046861150000161
Figure BDA0004046861150000162
in the method, in the process of the invention,
Figure BDA0004046861150000163
a predicted value representing an i-th sample; />
x i Representing the exact value of the original sample;
n represents the number of samples of the data set, i=1 when the predicted value coincides with the original data value, i=0 when the predicted value does not coincide with the original data value, when the classified data is predicted;
the final results are shown in Table 1 below:
table 1 comparison of four filling methods
Figure BDA0004046861150000164
As can be obtained, compared with comparative examples 2, 3 and 4, the filling result of the method for repairing the medical data missing values based on multiple confidence coefficients has the highest filling Accuracy (ACC) and the lowest percentage error (MAPE), and the best effect;
in summary, according to the technical scheme, the multi-dimensional data missing values in the data set samples are filled in batches, so that the filling efficiency is improved, the data samples with the multi-dimensional missing values are reasonably added into the training model, the data information in the data set is fully mined, the utilization rate of the data in the original data set is improved, the data filling efficiency is accelerated, and the data filling accuracy is improved;
the foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention without requiring creative effort by one of ordinary skill in the art. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by a person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims (7)

1. A method for repairing missing values of medical data based on multiple confidence levels, comprising the steps of:
step 1, importing a missing data set;
step 2, eliminating dimension influence among sample indexes, and carrying out normalization processing on the data set;
step 3, calculating an association relation matrix among all the attributes by a statistical method;
step 4, updating the weight of each other attribute relative to the target attribute by using the correlation coefficient obtained in the step 3; the target attributes are all attributes in the sample set;
step 5, calculating the confidence coefficient of each sample set; in particular comprising the following steps of the method,
step 5-1, calculating the weight of other attributes in the sample relative to the target attribute according to the mode of step 4; calculating multiple confidence of the sample according to the formula (4), wherein the destroyed degree of the sample is obtained by adding weights of all missing value attributes:
Figure FDA0004046861140000011
wherein,,
ms represents the missing attribute;
R ki representing the confidence of the sample in predicting the ith value in the kth sample;
step 5-2, filling all missing values through a self-association neural network model at one time; on the premise of not determining the missing attribute of the sample, outputting a predicted value to each attribute of the sample; when predicting each attribute, the sample can regain a new confidence, namely, how many dimensions of a sample the sample has a plurality of confidence;
step 6, marking the missing value of the data set, returning the coordinate position information of the missing value in the data set, and pre-filling by using an average value or a mode or a median according to the characteristics of the attribute of the missing value;
step 7, dividing the data set into a training set and a testing set;
step 8, constructing a neural network, and optimizing a transmission path of the neural network; the number of output quantities of all attribute values of the model prediction sample is equal to the number of input quantities; adding the predicted classification result to an output layer of the neural network; when filling a certain missing value, removing the corresponding input data quantity from the training model;
step 9, performing one-hot coding on the classification attribute in the data set;
step 10, optimizing a loss function; applying the calculated multiple confidence coefficient to a loss function of the neural network model, and distinguishing the influence degree of each sample on model training through adjusting the confidence coefficient of the sample; and an early stop strategy is introduced, so that the prediction accuracy is maximized.
2. The method for repairing missing medical data values based on multiple confidence levels according to claim 1, wherein in the step 2,
the normalization formula of the data sample is shown as the following formula (1):
Figure FDA0004046861140000021
wherein,,
x ij normalized values for the original data of the ith row and jth column;
x represents data to be normalized in the sample;
x max representing a maximum value of the data attribute;
x min representing the minimum value of the data attribute.
3. The method for repairing missing medical data values based on multiple confidence levels according to claim 1, wherein in the step 3,
the correlation coefficient between the features is calculated according to the following formula (2):
Figure FDA0004046861140000022
wherein,,
ρ i,j representing the correlation between attribute i and attribute j;
cov (i, j) represents the covariance of attribute i and attribute j;
di and Dj represent the variances of attribute i and attribute j.
4. The method for repairing missing medical data values based on multiple confidence levels according to claim 1, wherein in the step 4,
the specific weight calculation formula is as follows (3):
Figure FDA0004046861140000023
wherein,,
W ij representing the weight of attribute j relative to attribute i;
d represents the total number of attributes of the sample.
5. The method for repairing missing medical data values based on multiple confidence levels as set forth in claim 1, wherein in step 8,
the specific network transfer functions are as follows:
the transfer formula of the next level of the input layer is as follows (5):
Figure FDA0004046861140000031
the transfer formula between hidden layers is as follows (6):
Figure FDA0004046861140000032
when the predicted value is a continuous value, the output value of the network is as shown in formula (7):
Figure FDA0004046861140000033
when the predicted value is a classification attribute, the output transfer function of the network is formula (8); when the predicted value is a multi-classification attribute, the output transfer function of the network is equation (9)
Figure FDA0004046861140000034
Figure FDA0004046861140000035
In the formulas (5), (6), (7), (8) and (9),
confidence of incomplete sample R ij The R is ij Representing the confidence that the sample was assigned when predicting the jth attribute of the ith sample;
g () represents the relu activation function;
f () represents a sigmoid activation function;
h () represents a softmax activation function;
Y hj representing the output of the h neuron of the input layer to the next layer of the neural network in predicting the j-th attribute;
Y hj representing the output of the h' th neuron of the second hidden layer when predicting attribute j;
Y j an output value representing the network model;
W lh representing the transfer weight between the first attribute of the input sample and the h neuron of the second layer of the network;
W hh' representing a transfer weight between an h neuron of the first hidden layer to an h' neuron of the second hidden layer;
W h'j representing the transfer weight between the h' th neuron of the second hidden layer and the attribute j to be predicted of the output layer;
X il a first attribute representing a first input sample;
b n biasing the network model;
n represents a total of n neurons in the hidden layer;
d represents the number of attributes of the sample.
6. The method for repairing missing medical data values based on multiple confidence levels as set forth in claim 1, wherein in step 10,
the expression of the loss function is as follows (10):
Figure FDA0004046861140000041
in the method, in the process of the invention,
R ij the confidence of the sample when the ith sample fills the jth missing attribute is represented, and cont represents that the attribute value is a continuous numerical variable;
class represents a classification attribute;
x ij representing the jth attribute in the ith sample; the attribute is a continuous variable;
y ij a predicted value representing the jth attribute in the ith sample;
z ij a classification result representing a j-th attribute of the i-th sample; the classification result is represented by a one-hot code;
p ij representing the predicted outcome of the j-th classification attribute of the i-th sample.
7. The method for repairing missing medical data values based on multiple confidence levels as set forth in claim 1, wherein in step 10,
the early-stopping strategy is that, in the model training process, in order to determine the optimal training times and obtain the best result, the concept is introduced, for example, too few training times can cause under fitting, too many training times can cause over fitting, in order to solve the problem, the early-stopping strategy is introduced, after each epoch is finished, a test result is obtained on a verification set, along with the increase of the epoch to a certain value, the error of the verification set is changed from a descending trend to an ascending trend, at the moment, the training is stopped, and at the moment, the epoch is the optimal training times.
CN202310031008.3A 2023-01-10 2023-01-10 Medical data missing value repairing method based on multiple confidence degrees Pending CN116089801A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310031008.3A CN116089801A (en) 2023-01-10 2023-01-10 Medical data missing value repairing method based on multiple confidence degrees

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310031008.3A CN116089801A (en) 2023-01-10 2023-01-10 Medical data missing value repairing method based on multiple confidence degrees

Publications (1)

Publication Number Publication Date
CN116089801A true CN116089801A (en) 2023-05-09

Family

ID=86211607

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310031008.3A Pending CN116089801A (en) 2023-01-10 2023-01-10 Medical data missing value repairing method based on multiple confidence degrees

Country Status (1)

Country Link
CN (1) CN116089801A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117421548A (en) * 2023-12-18 2024-01-19 四川互慧软件有限公司 Method and system for treating loss of physiological index data based on convolutional neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117421548A (en) * 2023-12-18 2024-01-19 四川互慧软件有限公司 Method and system for treating loss of physiological index data based on convolutional neural network
CN117421548B (en) * 2023-12-18 2024-03-12 四川互慧软件有限公司 Method and system for treating loss of physiological index data based on convolutional neural network

Similar Documents

Publication Publication Date Title
US10783433B1 (en) Method for training and self-organization of a neural network
Li et al. A study of project selection and feature weighting for analogy based software cost estimation
CN107292097B (en) Chinese medicine principal symptom selection method based on feature group
CN112557034B (en) Bearing fault diagnosis method based on PCA _ CNNS
CN110390561B (en) User-financial product selection tendency high-speed prediction method and device based on momentum acceleration random gradient decline
CN111612261B (en) Financial big data analysis system based on block chain
CN110348608A (en) A kind of prediction technique for improving LSTM based on fuzzy clustering algorithm
CN111768000A (en) Industrial process data modeling method for online adaptive fine-tuning deep learning
CN108229592A (en) Outlier detection method and device based on GMDH neuroids
CN113674862A (en) Acute renal function injury onset prediction method based on machine learning
CN116089801A (en) Medical data missing value repairing method based on multiple confidence degrees
CN116187835A (en) Data-driven-based method and system for estimating theoretical line loss interval of transformer area
CN112926640A (en) Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium
CN110110447B (en) Method for predicting thickness of strip steel of mixed frog leaping feedback extreme learning machine
CN115050022A (en) Crop pest and disease identification method based on multi-level self-adaptive attention
CN114579640A (en) Financial time sequence prediction system and method based on generating type countermeasure network
CN113239199B (en) Credit classification method based on multi-party data set
CN117727464A (en) Training method and device based on medical multi-view disease prediction model
CN112488188A (en) Feature selection method based on deep reinforcement learning
CN116702132A (en) Network intrusion detection method and system
CN111353525A (en) Modeling and missing value filling method for unbalanced incomplete data set
CN114066036B (en) Cost prediction method and device based on self-correction fusion model
CN112651168B (en) Construction land area prediction method based on improved neural network algorithm
CN112465054A (en) Multivariate time series data classification method based on FCN
CN112364193A (en) Image retrieval-oriented method for fusing multilayer characteristic deep neural network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination