CN113611411B - Body examination aid decision-making system based on false negative sample identification - Google Patents
Body examination aid decision-making system based on false negative sample identification Download PDFInfo
- Publication number
- CN113611411B CN113611411B CN202111175001.6A CN202111175001A CN113611411B CN 113611411 B CN113611411 B CN 113611411B CN 202111175001 A CN202111175001 A CN 202111175001A CN 113611411 B CN113611411 B CN 113611411B
- Authority
- CN
- China
- Prior art keywords
- physical examination
- sample
- samples
- false negative
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/50—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Public Health (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Theoretical Computer Science (AREA)
- Primary Health Care (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a physical examination assistant decision-making system based on false negative sample identification, which comprises a data acquisition module, a data preprocessing module, a basic characteristic analysis module, a false negative sample identification module, a prediction model construction module and an assistant decision-making module; according to the method, the data cause generated by diagnosis loss is analyzed by simulating a universal clinical diagnosis process, and the process is modeled, so that the method is more consistent with clinical logic, can better discover false negative samples in real world medical data, and improves the application capability of the real world medical data on the construction of a physical examination assistant decision model and clinical assistant decision; the invention does not need to use extra data in the modeling and clinical assistant decision-making processes, and embeds the universal clinical actual decision-making process into the development logic of the model, thereby having stronger universality without introducing extra medical knowledge aiming at application cases.
Description
Technical Field
The invention belongs to the technical field of medical health information, and particularly relates to a physical examination assistant decision-making system based on false negative sample identification.
Background
Retrospective clinical medical research and clinical decision support based on real-world clinical data (represented by electronic medical record data) has become a common and important tool in current medical informatics research. Compared with a prospective clinical Random Control Test (RCT), the retrospective real world data is used for informatics modeling, so that the method has the advantages of large data volume, complete clinical scenes, high patient distribution similarity and the like, can be closer to the actual diagnosis and treatment scenes, and has a better clinical application value.
Physical examination is an important means for finding potential diseases, wherein the blood routine, urine routine and other test indexes carry a large amount of health status information. But current physical examination procedures can only prompt screening for a small fraction of the diseases. Retrospective modeling is carried out based on the electronic case data, the recognition capability of physical examination data on diseases which are not included in the current physical examination discovery range can be greatly improved, and the health value brought by single physical examination is improved.
However, due to the complex sources of real-world medical data, the accuracy and completeness of the included data can be affected by the clinical process at the time of the specific data entry. One typical data incomplete situation is the absence of a positive label (i.e., a false negative sample) of a sample in a real diagnostic label, which may have a great influence on the subsequent predictive model modeling and clinical application processes. Reasons that may lead to the deletion of a positive tag include: 1) when the doctor visits, other irrelevant indexes/diseases which are more concerned by the input doctor subjectively exist; 2) when the doctor visits, the registration department or the doctor-seeking reason is inconsistent with the target disease; 3) there are omissions and the like when a doctor enters a disease.
Due to the prevalence of false negative samples in real world data, many studies have taken this issue into account. The most similar technical scheme of the application is as follows: positive and unlabeled learning (PU learning), this solution considers unlabeled samples in the data as unlabeled samples that may be positive or negative. Jinbo Chen et al [1] eliminated the effect of false negative samples on the overall model by adjusting the sample weights. On the basis of a logistic regression algorithm, the overall positive sample proportion is used as an additional unknown parameter, and the optimal value of the overall positive sample proportion under the data set is obtained by maximizing a likelihood function containing the overall positive sample proportion and a weight matrix, so that the model prediction value is corrected, and the final prediction result is obtained; and secondly, characterizing learning, namely manually/semi-automatically constructing a coding set associated with target diagnosis by Kavishwar B, Wagholigar and other [2] and Yoi Halpern and other [3], and screening additional associated data (such as text data, omics data and the like) of the sample based on the coding set, so that a label-free sample with a high probability of being a positive sample is marked as positive, and the overall influence of a false negative sample on a modeling process is reduced.
The prior art similar to the technical scheme corrects the final model parameters by adjusting the loss function, the sample weight and the like in the modeling process. In the technology, when the adjustment parameters are set, only the false negative samples in the data set are assumed to be a random subset of the positive samples, and the actual reason that the false negative samples which are actually positive but not diagnosed or not diagnosed by the patient in the target disease in the real medical scene are generated is not considered. In fact, the distribution of false negative samples and random distribution tend to vary greatly. The randomness assumption of false negative samples is not consistent with the appearance logic of actual false negative samples, and the actual clinical prediction effect is influenced.
The prior art similar to technical solution 2 complements the positive samples by means of characterization learning. However, the process of characterizing learning often requires constructing a term set with a high medical knowledge threshold for a specific disease, which is not favorable for the universal use of the technology. Meanwhile, the technical scheme needs a large amount of additional medical data to be matched so as to discover false negative samples. For single visit patient cases, which account for the majority of real-world data, the characterization learning-based approach cannot be used to solve the problem of false negatives in real-world medical data in the absence of sufficient additional data.
[1]Zhang L, Ding X, Ma Y, et al. A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients[J]. Journal of the American Medical Informatics Association, 2020, 27(1): 119-126.
[2]Wagholikar K B, Estiri H, Murphy M, et al. Polar labeling: silver standard algorithm for training disease classifiers[J]. Bioinformatics, 2020, 36(10): 3200-3206.
[3]Halpern Y, Horng S, Choi Y, et al. Electronic medical record phenotyping using the anchor and learn framework[J]. Journal of the American Medical Informatics Association, 2016, 23(4): 731-740。
Disclosure of Invention
Based on the basic setting of PU learning, the invention uses the characteristic dimension of physical examination data to be divided into two types of characteristics of directly related dimension and competitive dimension by analyzing the universal generation logic of false negative samples in real world medical data, and has a characteristic granularity hypothesis of different expressions in a data layer, replaces the default hypothesis of data set granularity 'random distribution of false negative samples' in the prior art, solves the problem of inconsistency between the hypothesis and the real world medical data distribution in PU learning modeling, thereby improving the utilization capacity of the real world data and improving the accuracy and range of auxiliary decision of the physical examination data on potential diseases. The invention adaptively determines the influence of data on each clinical characteristic dimension on clinical disease diagnosis and physical examination result input in a data-driven mode, has universality on different target physical examination results, does not depend on a priori medical knowledge system, is beneficial to the application of the invention to various diseases which can be preliminarily diagnosed based on basic physiological indexes and conventional assay indexes, and is particularly suitable for large-scale physical examination scenes. The identification process of the false negative samples does not depend on an additional characterization mining process, so that the data analysis result is not influenced by the lack of additional associated data in the used medical data.
The purpose of the invention is realized by the following technical scheme: a physical examination aid decision-making system based on false negative sample identification, comprising:
a data acquisition module: the system comprises a real-world physical examination data set, a real diagnosis label and a physical examination result, wherein the real-world physical examination data set is obtained and matrixed into an original data set comprising an input characteristic matrix and a real diagnosis label, and a sample with a negative physical examination result is regarded as a label-free sample;
a data preprocessing module: forming a standardized data set by unifying the standard deviation and the mean value of each characteristic component in the original data set; separating positive and negative half-axis components of each characteristic component in the standardized data set, and adding corresponding trainable upper and lower limit values to each positive and negative half-axis component to form an expanded data set;
a basic characteristic analysis module: using a logistic regression model, regarding the unlabeled sample as a negative sample, and training to obtain the characteristic weight of each characteristic dimension pair for generating a real diagnosis label under the condition of not considering the false negative sample;
a false negative sample identification module: the characteristic dimensions are divided into a direct correlation dimension and a competitive dimension, wherein the direct correlation dimension directly influences the judgment of the target physical examination result from a medical perspective, and the competitive dimension does not directly influence the judgment of the target physical examination result from a medical perspective but can compete with the target physical examination result for attention, so that the target physical examination result is lost, and a false negative sample is generated; constructing two logistic regression models and a joint loss function, performing joint training, screening true negative samples and false negative samples by using the joint loss function, enabling direct correlation dimensions to distinguish positive samples and screened suspected true negative samples to the maximum extent, and enabling competitive dimensions to distinguish positive samples and screened suspected false negative samples to the maximum extent; indicating, by the false negative indicator, a likelihood that the sample is a false negative sample;
a prediction model construction module: constructing a multilayer neural network and introducing a loss function of a false negative index, and training a physical examination assistant decision model based on a standardized data set and the false negative index;
an assistant decision module: based on physical examination data of a physical examination person, a standardized feature vector is obtained through a data preprocessing module, a prediction result is obtained through a physical examination assistant decision model, and the prediction result is output to a clinician as a physical examination assistant decision result.
Further, in the data acquisition module, the characteristic dimensions of the physical examination data comprise basic physiological indexes and routine test indexes, the basic physiological indexes comprise height, weight, BMI, systolic pressure and diastolic pressure, and the routine test indexes comprise blood routine and urine routine; the real diagnosis label is a physical examination result.
Further, in the data acquisition module, the physical examination data set is matrixed into a raw data set,In order to input the feature matrix, the feature matrix is input,in order to be the amount of the sample,the total number of physical examination indexes is shown,toA representation of each of the samples is shown,toFor the feature components of the original data set in each feature dimension,representing a transpose;is composed ofThe true diagnostic label of an individual sample,represents the firstOne of the samples was a positive sample,represents the firstThe samples are true negative samples or false negative samples and are regarded as label-free samples; the positive sample set was scored asSet of unlabeled exemplars asThe set of true negative samples was scored asThe false negative sample set was scored asIs provided withAnd is and,is known for the particular sample composition of (a),,is unknown.
Further, in the data preprocessing module, the data processing module is used for processing the dataStandardizing each characteristic component to ensure that the standard deviation of all physical examination data on each characteristic component is 1 and the mean value is 0; the feature matrix after normalization is recorded as,Is shown asThe number of samples after the normalization is determined,is the normalized firstThe dimensional feature component is a component of the feature,andforming a standardized data set;
WhereinIs shown asThe number of samples after the data expansion is one,are respectively asPositive and negative half-axis components of (a);an offset vector formed for trainable upper and lower limit values on each component,the addition is accomplished by a broadcast mechanism; trainable feature matricesAndforming extended data sets。
Further, in the basic feature analysis module, the unlabeled exemplars are regarded as negative exemplars and are based on the extended data setConstructing a logistic regression model,Loss function ofComprises the following steps:
whereinFor a vector of weights of features that is trainable,is a trainable intercept value;in order to be a sigmoid function,is a decision function, the value of which is a decision value,is a logistic regression model obtained after normalization by sigmoid functionThe output probability of (1).
Further, the false negative sample identification module comprises:
taking the weight vector of the feature obtained by training in the basic feature analysis moduleSetting a trainable non-negative matrixSatisfy the following requirements、The sum matrix of (A) is an identity matrix;
Construction of two logistic regression modelsAndrespectively having a characteristic weight coefficientRespectively having trainable intercept valuesThen, the output probabilities of the two logistic regression models after normalization by the sigmoid function are respectively expressed as:
wherein the content of the first and second substances,is a sample class weight;is a screening coefficient;,andgradient back propagation in the model training process is not involved;
for samples in unlabeled sample setRespectively through the modelAndobtaining direct probabilitiesAnd attention probabilityUsing false negative indicatorsIndicating sampleThe probability of false negatives.
Further, in the false negative sample identification module, a logistic regression model is usedBy multiplication termsScreening channelPredicted output probabilityLabel-free samples close to 1, and the selected label-free sample set is recorded as,And positive sample setIn the dimension of competitionThere are differences in the characteristics of the classes, in the direct correlation dimensionShould have no significant difference in the characteristics of the classes, through training toIs of positive type, withModels being negative classesIdentification of the dimensions of the features belonging to the competitionClass characteristics, training process optimizationTo obtainAndto make an optimal distinction between the samplesProbability of degree of interestTrend towards 0, for samplesProbability of degree of interestTending towards 1.
Further, in the false negative sample identification module, a logistic regression model is usedBy multiplication termsScreening channelPredicted attention probabilityLabel-free samples close to 1, and the selected label-free sample set is recorded as,And positive sample setIn the direct correlation dimensionThere are differences in the characteristics of the classes, in the competitive dimensionShould have no significant difference in the characteristics of the classes, through training toIs of positive type, withModels being negative classesIdentifying ones of the feature dimensions that belong to a direct correlationClass characteristics, training process optimizationTo obtainAndto make an optimal distinction between the samplesDirect probabilityTrend towards 0, for samplesDirect probabilityTending towards 1.
Further, the prediction model building module is based on a standardized data setAnd false negative index of each sampleBuilding the number of nodes of the input layer asThe number of nodes of the output layer is 1, the activation function of the output layer is a sigmoid function, and the set of transfer matrices between layers isMulti-layer neural network ofTo sampleWarp beamThe output after the operation is defined asBy minimizing the loss function introducing false negative indicatorsTo obtainThe optimum parameter of (2);
thenAnd constructing a physical examination assistant decision model after introducing false negative index optimization.
Further, in the assistant decision module, the single physical examinee is obtained through physical examinationVolume with items corresponding to feature dimensionsDetecting indexes, and obtaining normalized feature vectors by a data preprocessing moduleWill beInputting the physical examination assistant decision-making model constructed by the prediction model construction module, and outputting the prediction resultWhen is coming into contact withThe physical examination result tends to be positive when the trend is 1, and the physical examination result tends to be positive when the trend isWhen the trend is 0, the physical examination result tends to be negative, and the prediction result is provided to a clinician as a physical examination assistant decision result.
The invention has the beneficial effects that:
1. existing positive-label-free learning techniques treat clinical diagnostic deficits as randomly occurring behaviors. According to the invention, through simulating a universal clinical diagnosis process, the data cause generated by diagnosis deficiency is analyzed, the process is modeled, the clinical logic is better met, false negative samples in real world medical data can be better found, and the application capability of the real world medical data on the construction of a physical examination assistant decision model and clinical assistant decision is improved.
2. The existing characterization learning technology needs a large amount of additional data and a certain amount of medical professional knowledge to support the characterization mining process, and is weak in universality. The invention does not need to use extra data in the modeling and clinical assistant decision-making processes, and embeds the universal clinical actual decision-making process into the development logic of the model, thereby having stronger universality without introducing extra medical knowledge aiming at application cases.
Drawings
Fig. 1 is a block diagram of a medical examination assistant decision system based on false negative sample identification according to an embodiment of the present invention;
FIG. 2 is a flow chart of false negative sample identification provided by an embodiment of the present invention;
fig. 3 is a flowchart of a construction process of a physical examination assistant decision model after introducing false negative indicator optimization according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
The embodiment of the invention provides a physical examination assistant decision-making system based on false negative sample identification, which comprises a data acquisition module, a data preprocessing module, a basic feature analysis module, a false negative sample identification module, a prediction model construction module and an assistant decision-making module, and the implementation process of each module is explained in detail below, as shown in fig. 1.
Firstly, a data acquisition module: the system comprises a real-world physical examination data set, a real diagnosis label and a physical examination result, wherein the real-world physical examination data set is obtained and matrixed into an original data set comprising an input characteristic matrix and a real diagnosis label, and a sample with a negative physical examination result is regarded as a label-free sample;
specifically, the data acquisition module is used to acquire a real-world physical examination data set stored in the csv file, including feature dimensions and real diagnostic tags. The characteristic dimension of the physical examination data comprises basic physiological indexes and conventional assay indexes; the basic physiological indexes comprise height, weight, BMI, systolic pressure and diastolic pressure; conventional assay indices include blood convention (total protein, albumin, globulin, white globulin ratio, glutamic-pyruvic transaminase, glutamic-oxaloacetic transaminase, alkaline phosphatase, cholinesterase, total bile acid, total bilirubin, direct bilirubin, indirect bilirubin, adenylic deaminase, glutamyl transpeptidase, glomerular filtration rate, creatinine, urea, uric acid, somatostatin C, triglycerides, total cholesterol, high density lipoprotein-C, low density lipoprotein-C, very low density lipoprotein-C, fasting plasma glucose, potassium, sodium, chloride, total calcium, inorganic phosphorus, glyco-dipeptidyl aminopeptidase, alpha-fucosidase), urine convention (urine protein, uroketone, urine sugar, urobilirubin, urine sediment white cells, urine sediment red cells, urobilinogen, urine acidity); the real diagnosis label is the result of physical examination, such as the result of diabetes diagnosis.
Matrixing a physical examination dataset into a raw datasetWhereinInputting a feature matrix;in order to be the amount of the sample,is the total number of physical examination indexes, in the example,;ToRepresents each sample, embodied in the form of a feature vector,toFor the feature components of the original data set in each feature dimension,representing a transpose;is composed ofThe true diagnostic label, i.e. the target label,represents the firstThe physical examination result of each sample is positive, namely the sample is a positive sample;represents the firstThe result of the physical examination of each sample is negative, the sample can be a true negative sample or a false negative sample, and the sample is regarded as a label-free sample. The set of positive samples was scored asIncluding allThe sample of (1); record the set of unlabeled exemplars asIncluding allThe sample of (1); the collection of true negative samples was scored asThe set of false negative samples was scored asIs provided withAnd is and , is known for the particular sample composition of (a), , is unknown.
A data preprocessing module: forming a standardized data set by unifying the standard deviation and the mean value of each characteristic component in the original data set; separating positive and negative half-axis components of each characteristic component in the standardized data set, and adding corresponding trainable upper and lower limit values to each positive and negative half-axis component to form an expanded data set, wherein the expanded data set comprises the following steps:
to pairOf each characteristic componentNormalizing based on the componentMaking the standard deviation of all physical examination data on the component be 1 and the mean value be 0; the feature matrix after normalization is recorded as,Is shown asThe normalized samples are embodied in the form of feature vectors,andforming a standardized data set;
WhereinIs the normalized firstThe dimensional feature component is a component of the feature,is composed ofIn a component of a sampleThe average value of the above-mentioned average values,is composed ofIn a component of a sampleStandard deviation of (2).
As physical examination indexes usually provide assistant decision information in the form of 'higher than a normal upper limit value' and 'lower than a normal lower limit value' in real-life use, and physical examination results guided by the two types of assistant decision information are not completely contradictory, the data preprocessing process of the invention separately considers positive and negative data and additionally adds trainable offset vectors, so that the feature matrix of the constructed expanded data set is close to a clinical use scene. In particular, separationEach characteristic componentThe positive and negative half-axis components of the vector are used for simulating the difference of two types of assistant decision information, and an offset vector is addedTo simulate the upper and lower normal limits of physical examination index.
WhereinAre respectively asPositive and negative half-axis components of (a);for the offset vector formed by the trainable upper and lower limit values on each component, there areThe addition is done by a broadcast mechanism (broadcasting).
Expanding the data set in the preprocessed data setIs used for the basic feature analysis module and the false negative sample identification module to standardize the data setIs used for a prediction model building module and an assistant decision module.
Thirdly, a basic characteristic analysis module: training to obtain the feature weight of each feature dimension pair generating a real diagnosis label under the condition of not considering false negative samples by using a logistic regression model and regarding the unlabeled samples as the negative samples, wherein the method comprises the following steps:
all unlabeled samples are regarded as negative samples and are based on the preprocessed extended data setConstructing a logistic regression model,Loss function ofComprises the following steps:
whereinFor a vector of weights of features that is trainable,for the value of the intercept to be trainable,is shown asThe sample after data expansion is embodied in the form of a feature vector,is as followsTrue diagnostic label of individual sample;in order to be a sigmoid function,is a decision function, the value of which is a decision value,is a logistic regression model obtained after normalization by sigmoid functionOutput probability of, i.e.Predicted samplesProbability of being positive. In the examples, the model was trained using a small Batch Gradient Descent method (Mini-Batch Gradient Descent) with a sample size of 500 for a single Batch.
And fourthly, a false negative sample identification module: the characteristic dimensions are divided into a direct correlation dimension and a competitive dimension, wherein the direct correlation dimension directly influences the judgment of the target physical examination result from a medical perspective, and the competitive dimension does not directly influence the judgment of the target physical examination result from a medical perspective but can compete with the target physical examination result for attention, so that the target physical examination result is lost, and a false negative sample is generated; constructing two logistic regression models and a joint loss function, performing joint training, screening true negative samples and false negative samples by using the joint loss function, enabling direct correlation dimensions to distinguish positive samples and screened suspected true negative samples to the maximum extent, and enabling competitive dimensions to distinguish positive samples and screened suspected false negative samples to the maximum extent; indicating, by the false negative indicator, a likelihood that the sample is a false negative sample; the method comprises the following steps:
dividing the characteristic dimension into directly related dimensions based on the generation logic of physical examination results in physical examination clinical practiceClass and competition dimensionsAnd the two types are classified. It is defined as: dimension of direct correlationThe characteristics in the class directly influence the judgment of the target physical examination result from the medical perspective; dimension of competitionThe features in the class do not directly affect the determination of the target physical examination result from a medical perspective, but compete with the target physical examination result for attention, and thus may cause the target physical examination result to be missing and generate a false negative sample. From generating logically, feature weight vectorsIs generated under the combined action of the two characteristics. The core idea of the false negative sample identification module is to identify through data inductionAndthese two types of features allow the assessment of the likelihood that an unlabeled sample is false negative.
Taking the weight vector of the feature obtained by training in the basic feature analysis moduleSetting a trainable non-negative matrixSatisfy the following requirements、The sum matrix of (A) is an identity matrix(ii) a Then:
wherein, byThe decision value contributed by the class feature isThe maximum differentiation of the positive sample setsAnd true negative sample set(ii) a ByThe decision value contributed by the class feature isThe maximum differentiation of the positive sample setsAnd false negative sample set。
Based on the above recognition, the false negative sample identification module performs the following steps:
construction of two logistic regression modelsAndrespectively having a characteristic weight coefficientRespectively having trainable intercept values. The output probabilities of the two logistic regression models after normalization by the sigmoid function are respectively expressed as:
Under the condition of the optimal characteristic classification,the maximum discrimination of the positive sample setAnd true negative sample set,The maximum discrimination of the positive sample setAnd false negative sample set. Thus, the trainable parameters includeUsing extended data setsMinimizing joint loss functionObtaining optimal parameters and expanding data setOffset vector ofUsing basic feature analysis moduleThe optimization result obtained after training is not further trained.
Wherein the content of the first and second substances,weighting the sample class for adjusting the proportion of different classes of samples in training, wherein the sample class is used;For screening the coefficients whenThe larger the size, the higher the screening strength of the joint loss function for classifying the unlabeled sample into false negative and true negative samples is, but the diversity of the screened samples is reduced, and the sample is used in the example; But not in the gradient back-propagation during model training. Example modeling Using a Small batch gradient descent methodAndthe sample size of the single batch use of the combined training of (1) is 500.
(1) for modelBy multiplication termsScreening channelPredicted output probabilityHigher unlabeled samples, and recording the selected set of the unlabeled samplesRelatively holistic unlabeled exemplar set,The proportion of false negative samples in (1) is large.And positive sample setIn the dimension of competitionThere are differences in the characteristics of the classes, but in the directly related dimensionThere should be no significant difference in the characteristics of the classes, and thus it can be trained toIs of positive type, withModels being negative classesIdentification of the dimensions of the features belonging to the competitionThe characteristics of the class. Simultaneous optimization of training processTo obtainAndto make an optimal distinction between the samplesAbout probability of degree of attentionTrend towards 0, for samplesAbout probability of degree of attentionTending towards 1.
(2) For modelBy multiplication termsScreening channelPredicted attention probabilityHigher unlabeled samples, and recording the selected set of the unlabeled samplesRelatively holistic unlabeled exemplar set,The proportion of true negative samples in (1) is large.And positive sample setIn the direct correlation dimensionThere are differences in the characteristics of the classes, but in the competitive dimensionThere should be no significant difference in the characteristics of the classes, and thus it can be trained toIs of positive type, withModels being negative classesIdentifying ones of the feature dimensions that belong to a direct correlationThe characteristics of the class. Simultaneous optimization of training processTo obtainAndto make an optimal distinction between the samplesHaving a direct probabilityTrend towards 0, for samplesHaving a direct probabilityTending towards 1.
(3) Due to the existence of limit conditions in the model training processRequiring the use of joint loss functionsBy means of a modelAndand optimizing each parameter in a joint training mode.
After obtaining the optimal parameters, for the sampleRespectively through the modelAndobtain its direct probabilityAnd attention probability. If it isIf the sample is false negative, the sample should beThe trend is towards 1, and the trend is that,trend towards 0, using false negative indexIndicating each sampleThe probability of false negatives.
The flow of false negative sample identification is shown in FIG. 2.
Fifthly, a prediction model construction module: constructing a multilayer neural network and introducing a loss function of a false negative index, and training a physical examination assistant decision model based on a standardized data set and the false negative index, wherein the training process comprises the following steps:
based on markNormalized data setAnd false negative index of each sampleBuilding the number of nodes of the input layer asThe number of nodes of the output layer is 1, the activation function of the output layer is a sigmoid function, and the set of transfer matrices between layers isMulti-layer neural network ofTo sampleNeural networkThe output after the operation is defined asThe vector formed by all the outputs is recorded asThen the loss function can be introduced by minimizing the loss function that introduces false negative indicatorsTo obtainThe optimum parameter of (2).
ThenAnd a physical examination assistant decision-making model after the optimization of false negative indexes is introduced for construction. The construction process of the physical examination assistant decision model is shown in figure 3.
In the example, a three-layer neural network is constructed,The number of nodes of the input layer isThe number of nodes of the output layer is 1, the number of nodes of the intermediate layer is 20, and the set of transfer matrices between layers is,For the transfer matrix of the input layer to the intermediate layer,for the transition matrix from the middle layer to the output layer, the activation function between layers is { ReLU, sigmoid }. Model training was performed using a small batch gradient descent method, using a sample size of 500 for a single batch.
Sixthly, an auxiliary decision module: based on physical examination data of a physical examiner, standardized feature vectors are obtained through a data preprocessing module, a prediction result is obtained through a physical examination assistant decision model, and the prediction result is output to a clinician as a physical examination assistant decision result, and the method comprises the following steps:
obtained by subjecting a single physical examinee to physical examinationThe physical examination indexes corresponding to the items and the characteristic dimensions obtain the characteristic vectors after the standardization processing through a data preprocessing module. Then, willInputting the physical examination assistant decision-making model constructed by the prediction model construction module, and outputting the prediction resultWhen is coming into contact withThe physical examination result tends to be positive when the trend is 1, and the physical examination result tends to be positive when the trend isWhen the trend is 0, the physical examination result tends to be negative, and the prediction result is provided to a clinician as a physical examination assistant decision result.
The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.
Claims (10)
1. A physical examination assistant decision system based on false negative sample identification, comprising:
a data acquisition module: the system comprises a real-world physical examination data set, a real diagnosis label and a physical examination result, wherein the real-world physical examination data set is obtained and matrixed into an original data set comprising an input characteristic matrix and a real diagnosis label, and a sample with a negative physical examination result is regarded as a label-free sample;
a data preprocessing module: forming a standardized data set by unifying the standard deviation and the mean value of each characteristic component in the original data set; separating positive and negative half-axis components of each characteristic component in the standardized data set, and adding corresponding trainable upper and lower limit values to each positive and negative half-axis component to form an expanded data set;
a basic characteristic analysis module: using a logistic regression model, regarding the unlabeled sample as a negative sample, and training to obtain the characteristic weight of each characteristic dimension pair for generating a real diagnosis label under the condition of not considering the false negative sample;
a false negative sample identification module: the characteristic dimensions are divided into a direct correlation dimension and a competitive dimension, wherein the direct correlation dimension directly influences the judgment of the target physical examination result from a medical perspective, and the competitive dimension does not directly influence the judgment of the target physical examination result from a medical perspective but can compete with the target physical examination result for attention, so that the target physical examination result is lost, and a false negative sample is generated; constructing two logistic regression models and a joint loss function, performing joint training, screening true negative samples and false negative samples by using the joint loss function, enabling direct correlation dimensions to distinguish positive samples and screened suspected true negative samples to the maximum extent, and enabling competitive dimensions to distinguish positive samples and screened suspected false negative samples to the maximum extent; indicating, by the false negative indicator, a likelihood that the sample is a false negative sample;
a prediction model construction module: constructing a multilayer neural network and introducing a loss function of a false negative index, and training a physical examination assistant decision model based on a standardized data set and the false negative index;
an assistant decision module: based on physical examination data of a physical examination person, a standardized feature vector is obtained through a data preprocessing module, a prediction result is obtained through a physical examination assistant decision model, and the prediction result is output to a clinician as a physical examination assistant decision result.
2. A physical examination assistant decision system based on false negative sample identification as claimed in claim 1, characterized in that in the data acquisition module, the characteristic dimension of the physical examination data comprises basic physiological indexes and routine test indexes, the basic physiological indexes comprise height, weight, BMI, systolic pressure and diastolic pressure, and the routine test indexes comprise blood routine and urine routine; the real diagnosis label is a physical examination result.
3. The system of claim 1, wherein the data acquisition module is configured to matrix the physical examination dataset into a raw dataset,In order to input the feature matrix, the feature matrix is input,in order to be the amount of the sample,the total number of physical examination indexes is shown,toA representation of each of the samples is shown,toFor the feature components of the original data set in each feature dimension,representing a transpose;is composed ofThe true diagnostic label of an individual sample,represents the firstOne of the samples was a positive sample,represents the firstThe samples are true negative samples or false negative samples and are regarded as label-free samples; the positive sample set was scored asSet of unlabeled exemplars asThe set of true negative samples was scored asThe false negative sample set was scored asIs provided withAnd is and,is known for the particular sample composition of (a),,is unknown.
4. The system of claim 3, wherein the data preprocessing module is configured to perform on-line analysis of the sample data for the determination of the false negative sampleStandardizing each characteristic component to ensure that the standard deviation of all physical examination data on each characteristic component is 1 and the mean value is 0; the feature matrix after normalization is recorded as,Is shown asThe number of samples after the normalization is determined,is the normalized firstThe dimensional feature component is a component of the feature,andforming a standardized data set;
WhereinIs shown asThe number of samples after the data expansion is one,are respectively asPositive and negative half-axis components of (a);an offset vector formed for trainable upper and lower limit values on each component,the addition is accomplished by a broadcast mechanism; trainable feature matricesAndforming extended data sets。
5. The system of claim 4, wherein the basic feature analysis module considers unlabeled samples as negative samples and is based on an expanded data setConstructing a logistic regression model,Loss function ofComprises the following steps:
whereinFor a vector of weights of features that is trainable,is a trainable intercept value;in order to be a sigmoid function,is a decision function, the value of which is a decision value,is a logistic regression obtained after normalization by sigmoid functionModel (model)The output probability of (1).
6. The physical examination assistant decision system based on false negative sample identification as claimed in claim 5, wherein the false negative sample identification module comprises:
taking the weight vector of the feature obtained by training in the basic feature analysis moduleSetting a trainable non-negative matrixSatisfy the following requirements、The sum matrix of (A) is an identity matrix;
Construction of two logistic regression modelsAndrespectively having a characteristic weight coefficientRespectively having trainable intercept valuesRespectively representing the output probabilities of the two logistic regression models after normalization by sigmoid functionComprises the following steps:
wherein the content of the first and second substances,is a sample class weight;is a screening coefficient;,andgradient back propagation in the model training process is not involved;
7. The system of claim 6, wherein the false negative sample identification module is configured to identify the logistic regression modelBy multiplication termsScreening channelPredicted output probabilityUnlabeled samples approaching 1 are screenedIs recorded as a set of unlabeled exemplars,And positive sample setIn the dimension of competitionThere are differences in the characteristics of the classes, in the direct correlation dimensionShould have no significant difference in the characteristics of the classes, through training toIs of positive type, withModels being negative classesIdentification of the dimensions of the features belonging to the competitionClass characteristics, training process optimizationTo obtainAndto make an optimal distinction between the samplesProbability of degree of interestTrend towards 0, for samplesProbability of degree of interestTending towards 1.
8. The system of claim 6, wherein the false negative sample identification module is configured to identify the logistic regression modelBy multiplication termsScreening channelPredicted attention probabilityLabel-free samples close to 1, and the selected label-free sample set is recorded as,And positive sample setIn the direct correlation dimensionThere are differences in the characteristics of the classes, in the competitive dimensionShould have no significant difference in the characteristics of the classes, through training toIs of positive type, withModels being negative classesIdentifying ones of the feature dimensions that belong to a direct correlationClass characteristics, training process optimizationTo obtainAndto make an optimal distinction between the samplesDirect probabilityTrend towards 0, for samplesDirect probabilityTending towards 1.
9. The system of claim 6, wherein the predictive model building module is based on a standardized data setAnd false negative index of each sampleBuilding the number of nodes of the input layer asThe number of nodes of the output layer is 1, the activation function of the output layer is a sigmoid function, and the set of transfer matrices between layers isMulti-layer neural network ofTo sampleWarp beamThe output after the operation is defined asBy minimizing the loss function introducing false negative indicatorsTo obtainThe optimum parameter of (2);
10. The physical examination assistant decision system based on false negative sample identification as claimed in claim 9, wherein the assistant decision module is used for obtaining the physical examination of a single physical examinerThe physical examination indexes corresponding to the items and the characteristic dimensions obtain the characteristic vectors after the standardization processing through a data preprocessing moduleWill beInputting the physical examination assistant decision-making model constructed by the prediction model construction module, and outputting the prediction resultWhen is coming into contact withThe physical examination result tends to be positive when the trend is 1, and the physical examination result tends to be positive when the trend isWhen the trend is 0, the physical examination result tends to be negative, and the prediction result is provided to a clinician as a physical examination assistant decision result.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111175001.6A CN113611411B (en) | 2021-10-09 | 2021-10-09 | Body examination aid decision-making system based on false negative sample identification |
PCT/CN2022/123731 WO2023056918A1 (en) | 2021-10-09 | 2022-10-07 | False negative sample recognition-based physical examination assistant decision-making system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111175001.6A CN113611411B (en) | 2021-10-09 | 2021-10-09 | Body examination aid decision-making system based on false negative sample identification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113611411A CN113611411A (en) | 2021-11-05 |
CN113611411B true CN113611411B (en) | 2021-12-31 |
Family
ID=78343379
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111175001.6A Active CN113611411B (en) | 2021-10-09 | 2021-10-09 | Body examination aid decision-making system based on false negative sample identification |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113611411B (en) |
WO (1) | WO2023056918A1 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113611411B (en) * | 2021-10-09 | 2021-12-31 | 浙江大学 | Body examination aid decision-making system based on false negative sample identification |
CN113990494B (en) * | 2021-12-24 | 2022-03-25 | 浙江大学 | Tic disorder auxiliary screening system based on video data |
CN117150369B (en) * | 2023-10-30 | 2024-01-26 | 恒安标准人寿保险有限公司 | Training method of overweight prediction model and electronic equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106383853A (en) * | 2016-08-30 | 2017-02-08 | 刘勇 | Realization method and system for electronic medical record post-structuring and auxiliary diagnosis |
CN109830303A (en) * | 2019-02-01 | 2019-05-31 | 上海众恒信息产业股份有限公司 | Clinical data mining analysis and aid decision-making method based on internet integration medical platform |
CN110084374A (en) * | 2019-04-24 | 2019-08-02 | 第四范式(北京)技术有限公司 | Construct method, apparatus and prediction technique, device based on the PU model learnt |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107887036A (en) * | 2017-11-09 | 2018-04-06 | 北京纽伦智能科技有限公司 | Construction method, device and the clinical decision accessory system of clinical decision accessory system |
CN107798390B (en) * | 2017-11-22 | 2023-03-21 | 创新先进技术有限公司 | Training method and device of machine learning model and electronic equipment |
US20210174448A1 (en) * | 2019-12-04 | 2021-06-10 | Michael William Kotarinos | Artificial intelligence decision modeling processes using analytics and data shapely for multiple stakeholders |
CN111180068A (en) * | 2019-12-19 | 2020-05-19 | 浙江大学 | Chronic disease prediction system based on multi-task learning model |
CN111312401B (en) * | 2020-01-14 | 2021-12-17 | 之江实验室 | After-physical-examination chronic disease prognosis system based on multi-label learning |
CN113611411B (en) * | 2021-10-09 | 2021-12-31 | 浙江大学 | Body examination aid decision-making system based on false negative sample identification |
-
2021
- 2021-10-09 CN CN202111175001.6A patent/CN113611411B/en active Active
-
2022
- 2022-10-07 WO PCT/CN2022/123731 patent/WO2023056918A1/en unknown
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106383853A (en) * | 2016-08-30 | 2017-02-08 | 刘勇 | Realization method and system for electronic medical record post-structuring and auxiliary diagnosis |
CN109830303A (en) * | 2019-02-01 | 2019-05-31 | 上海众恒信息产业股份有限公司 | Clinical data mining analysis and aid decision-making method based on internet integration medical platform |
CN110084374A (en) * | 2019-04-24 | 2019-08-02 | 第四范式(北京)技术有限公司 | Construct method, apparatus and prediction technique, device based on the PU model learnt |
Non-Patent Citations (2)
Title |
---|
A maximum likelihood approach to electronic health;Lingjiao Zhang等;《Journal of the American Medical Informatics Association》;20191113;全文 * |
False positive rate control for positive unlab ele d learning;Shuchen Kong等;《Neurocomputing》;20190808;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113611411A (en) | 2021-11-05 |
WO2023056918A1 (en) | 2023-04-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113611411B (en) | Body examination aid decision-making system based on false negative sample identification | |
Singh et al. | Stacking-based multi-objective evolutionary ensemble framework for prediction of diabetes mellitus | |
Casiraghi et al. | Explainable machine learning for early assessment of COVID-19 risk prediction in emergency departments | |
Ali et al. | An approach based on mutually informed neural networks to optimize the generalization capabilities of decision support systems developed for heart failure prediction | |
CN109544518B (en) | Method and system applied to bone maturity assessment | |
CN111312401B (en) | After-physical-examination chronic disease prognosis system based on multi-label learning | |
CN111009321A (en) | Application method of machine learning classification model in juvenile autism auxiliary diagnosis | |
CN109558896A (en) | Disease intelligent analysis method and system based on ultrasound omics and deep learning | |
Ferrante et al. | Artificial intelligence in the diagnosis of pediatric allergic diseases | |
Zhou et al. | Cohesive multi-modality feature learning and fusion for COVID-19 patient severity prediction | |
JP2023184468A (en) | Passage abnormality detection system based on adaptive resampling deep encoder network | |
CN113610118A (en) | Fundus image classification method, device, equipment and medium based on multitask course learning | |
Ha et al. | Fine-grained interactive attention learning for semi-supervised white blood cell classification | |
CN115798730A (en) | Method, apparatus and medium for circular RNA-disease association prediction based on weighted graph attention and heterogeneous graph neural networks | |
CN117591953A (en) | Cancer classification method and system based on multiple groups of study data and electronic equipment | |
CN111047590A (en) | Hypertension classification method and device based on fundus images | |
CN114334162A (en) | Intelligent prognosis prediction method and device for disease patient, storage medium and equipment | |
Maurya et al. | Computer-aided diagnosis of auto-immune disease using capsule neural network | |
CN114300126A (en) | Cancer prediction system based on early cancer screening questionnaire and feed-forward neural network | |
US20210158967A1 (en) | Method of prediction of potential health risk | |
Mellal et al. | CNN Models Using Chest X-Ray Images for COVID-19 Detection: A Survey. | |
Peng et al. | Multi-view weighted feature fusion using cnn for pneumonia detection on chest x-rays | |
Reeha et al. | Alzheimers Disease Detection Using MIC and MLP | |
Shaheen et al. | Hi-Le and HiTCLe: Ensemble Learning Approaches for Early Diabetes Detection using Deep Learning and eXplainable Artificial Intelligence | |
Rabie et al. | Diseases diagnosis based on artificial intelligence and ensemble classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |