CN113611411B - Body examination aid decision-making system based on false negative sample identification - Google Patents

Body examination aid decision-making system based on false negative sample identification Download PDF

Info

Publication number
CN113611411B
CN113611411B CN202111175001.6A CN202111175001A CN113611411B CN 113611411 B CN113611411 B CN 113611411B CN 202111175001 A CN202111175001 A CN 202111175001A CN 113611411 B CN113611411 B CN 113611411B
Authority
CN
China
Prior art keywords
physical examination
sample
samples
false negative
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111175001.6A
Other languages
Chinese (zh)
Other versions
CN113611411A (en
Inventor
李劲松
吴承凯
周天舒
田雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202111175001.6A priority Critical patent/CN113611411B/en
Publication of CN113611411A publication Critical patent/CN113611411A/en
Application granted granted Critical
Publication of CN113611411B publication Critical patent/CN113611411B/en
Priority to PCT/CN2022/123731 priority patent/WO2023056918A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Theoretical Computer Science (AREA)
  • Primary Health Care (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a physical examination assistant decision-making system based on false negative sample identification, which comprises a data acquisition module, a data preprocessing module, a basic characteristic analysis module, a false negative sample identification module, a prediction model construction module and an assistant decision-making module; according to the method, the data cause generated by diagnosis loss is analyzed by simulating a universal clinical diagnosis process, and the process is modeled, so that the method is more consistent with clinical logic, can better discover false negative samples in real world medical data, and improves the application capability of the real world medical data on the construction of a physical examination assistant decision model and clinical assistant decision; the invention does not need to use extra data in the modeling and clinical assistant decision-making processes, and embeds the universal clinical actual decision-making process into the development logic of the model, thereby having stronger universality without introducing extra medical knowledge aiming at application cases.

Description

Body examination aid decision-making system based on false negative sample identification
Technical Field
The invention belongs to the technical field of medical health information, and particularly relates to a physical examination assistant decision-making system based on false negative sample identification.
Background
Retrospective clinical medical research and clinical decision support based on real-world clinical data (represented by electronic medical record data) has become a common and important tool in current medical informatics research. Compared with a prospective clinical Random Control Test (RCT), the retrospective real world data is used for informatics modeling, so that the method has the advantages of large data volume, complete clinical scenes, high patient distribution similarity and the like, can be closer to the actual diagnosis and treatment scenes, and has a better clinical application value.
Physical examination is an important means for finding potential diseases, wherein the blood routine, urine routine and other test indexes carry a large amount of health status information. But current physical examination procedures can only prompt screening for a small fraction of the diseases. Retrospective modeling is carried out based on the electronic case data, the recognition capability of physical examination data on diseases which are not included in the current physical examination discovery range can be greatly improved, and the health value brought by single physical examination is improved.
However, due to the complex sources of real-world medical data, the accuracy and completeness of the included data can be affected by the clinical process at the time of the specific data entry. One typical data incomplete situation is the absence of a positive label (i.e., a false negative sample) of a sample in a real diagnostic label, which may have a great influence on the subsequent predictive model modeling and clinical application processes. Reasons that may lead to the deletion of a positive tag include: 1) when the doctor visits, other irrelevant indexes/diseases which are more concerned by the input doctor subjectively exist; 2) when the doctor visits, the registration department or the doctor-seeking reason is inconsistent with the target disease; 3) there are omissions and the like when a doctor enters a disease.
Due to the prevalence of false negative samples in real world data, many studies have taken this issue into account. The most similar technical scheme of the application is as follows: positive and unlabeled learning (PU learning), this solution considers unlabeled samples in the data as unlabeled samples that may be positive or negative. Jinbo Chen et al [1] eliminated the effect of false negative samples on the overall model by adjusting the sample weights. On the basis of a logistic regression algorithm, the overall positive sample proportion is used as an additional unknown parameter, and the optimal value of the overall positive sample proportion under the data set is obtained by maximizing a likelihood function containing the overall positive sample proportion and a weight matrix, so that the model prediction value is corrected, and the final prediction result is obtained; and secondly, characterizing learning, namely manually/semi-automatically constructing a coding set associated with target diagnosis by Kavishwar B, Wagholigar and other [2] and Yoi Halpern and other [3], and screening additional associated data (such as text data, omics data and the like) of the sample based on the coding set, so that a label-free sample with a high probability of being a positive sample is marked as positive, and the overall influence of a false negative sample on a modeling process is reduced.
The prior art similar to the technical scheme corrects the final model parameters by adjusting the loss function, the sample weight and the like in the modeling process. In the technology, when the adjustment parameters are set, only the false negative samples in the data set are assumed to be a random subset of the positive samples, and the actual reason that the false negative samples which are actually positive but not diagnosed or not diagnosed by the patient in the target disease in the real medical scene are generated is not considered. In fact, the distribution of false negative samples and random distribution tend to vary greatly. The randomness assumption of false negative samples is not consistent with the appearance logic of actual false negative samples, and the actual clinical prediction effect is influenced.
The prior art similar to technical solution 2 complements the positive samples by means of characterization learning. However, the process of characterizing learning often requires constructing a term set with a high medical knowledge threshold for a specific disease, which is not favorable for the universal use of the technology. Meanwhile, the technical scheme needs a large amount of additional medical data to be matched so as to discover false negative samples. For single visit patient cases, which account for the majority of real-world data, the characterization learning-based approach cannot be used to solve the problem of false negatives in real-world medical data in the absence of sufficient additional data.
[1]Zhang L, Ding X, Ma Y, et al. A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients[J]. Journal of the American Medical Informatics Association, 2020, 27(1): 119-126.
[2]Wagholikar K B, Estiri H, Murphy M, et al. Polar labeling: silver standard algorithm for training disease classifiers[J]. Bioinformatics, 2020, 36(10): 3200-3206.
[3]Halpern Y, Horng S, Choi Y, et al. Electronic medical record phenotyping using the anchor and learn framework[J]. Journal of the American Medical Informatics Association, 2016, 23(4): 731-740。
Disclosure of Invention
Based on the basic setting of PU learning, the invention uses the characteristic dimension of physical examination data to be divided into two types of characteristics of directly related dimension and competitive dimension by analyzing the universal generation logic of false negative samples in real world medical data, and has a characteristic granularity hypothesis of different expressions in a data layer, replaces the default hypothesis of data set granularity 'random distribution of false negative samples' in the prior art, solves the problem of inconsistency between the hypothesis and the real world medical data distribution in PU learning modeling, thereby improving the utilization capacity of the real world data and improving the accuracy and range of auxiliary decision of the physical examination data on potential diseases. The invention adaptively determines the influence of data on each clinical characteristic dimension on clinical disease diagnosis and physical examination result input in a data-driven mode, has universality on different target physical examination results, does not depend on a priori medical knowledge system, is beneficial to the application of the invention to various diseases which can be preliminarily diagnosed based on basic physiological indexes and conventional assay indexes, and is particularly suitable for large-scale physical examination scenes. The identification process of the false negative samples does not depend on an additional characterization mining process, so that the data analysis result is not influenced by the lack of additional associated data in the used medical data.
The purpose of the invention is realized by the following technical scheme: a physical examination aid decision-making system based on false negative sample identification, comprising:
a data acquisition module: the system comprises a real-world physical examination data set, a real diagnosis label and a physical examination result, wherein the real-world physical examination data set is obtained and matrixed into an original data set comprising an input characteristic matrix and a real diagnosis label, and a sample with a negative physical examination result is regarded as a label-free sample;
a data preprocessing module: forming a standardized data set by unifying the standard deviation and the mean value of each characteristic component in the original data set; separating positive and negative half-axis components of each characteristic component in the standardized data set, and adding corresponding trainable upper and lower limit values to each positive and negative half-axis component to form an expanded data set;
a basic characteristic analysis module: using a logistic regression model, regarding the unlabeled sample as a negative sample, and training to obtain the characteristic weight of each characteristic dimension pair for generating a real diagnosis label under the condition of not considering the false negative sample;
a false negative sample identification module: the characteristic dimensions are divided into a direct correlation dimension and a competitive dimension, wherein the direct correlation dimension directly influences the judgment of the target physical examination result from a medical perspective, and the competitive dimension does not directly influence the judgment of the target physical examination result from a medical perspective but can compete with the target physical examination result for attention, so that the target physical examination result is lost, and a false negative sample is generated; constructing two logistic regression models and a joint loss function, performing joint training, screening true negative samples and false negative samples by using the joint loss function, enabling direct correlation dimensions to distinguish positive samples and screened suspected true negative samples to the maximum extent, and enabling competitive dimensions to distinguish positive samples and screened suspected false negative samples to the maximum extent; indicating, by the false negative indicator, a likelihood that the sample is a false negative sample;
a prediction model construction module: constructing a multilayer neural network and introducing a loss function of a false negative index, and training a physical examination assistant decision model based on a standardized data set and the false negative index;
an assistant decision module: based on physical examination data of a physical examination person, a standardized feature vector is obtained through a data preprocessing module, a prediction result is obtained through a physical examination assistant decision model, and the prediction result is output to a clinician as a physical examination assistant decision result.
Further, in the data acquisition module, the characteristic dimensions of the physical examination data comprise basic physiological indexes and routine test indexes, the basic physiological indexes comprise height, weight, BMI, systolic pressure and diastolic pressure, and the routine test indexes comprise blood routine and urine routine; the real diagnosis label is a physical examination result.
Further, in the data acquisition module, the physical examination data set is matrixed into a raw data set
Figure 256771DEST_PATH_IMAGE001
Figure 705070DEST_PATH_IMAGE002
In order to input the feature matrix, the feature matrix is input,
Figure 219228DEST_PATH_IMAGE003
in order to be the amount of the sample,
Figure 535809DEST_PATH_IMAGE004
the total number of physical examination indexes is shown,
Figure 84602DEST_PATH_IMAGE005
to
Figure 438223DEST_PATH_IMAGE006
A representation of each of the samples is shown,
Figure 377360DEST_PATH_IMAGE007
to
Figure 310681DEST_PATH_IMAGE008
For the feature components of the original data set in each feature dimension,
Figure 776297DEST_PATH_IMAGE009
representing a transpose;
Figure 504082DEST_PATH_IMAGE010
is composed of
Figure 430317DEST_PATH_IMAGE011
The true diagnostic label of an individual sample,
Figure 964066DEST_PATH_IMAGE012
represents the first
Figure 221872DEST_PATH_IMAGE013
One of the samples was a positive sample,
Figure 323820DEST_PATH_IMAGE014
represents the first
Figure 299867DEST_PATH_IMAGE013
The samples are true negative samples or false negative samples and are regarded as label-free samples; the positive sample set was scored as
Figure 106149DEST_PATH_IMAGE015
Set of unlabeled exemplars as
Figure 670991DEST_PATH_IMAGE016
The set of true negative samples was scored as
Figure 740578DEST_PATH_IMAGE017
The false negative sample set was scored as
Figure 266238DEST_PATH_IMAGE018
Is provided with
Figure 79473DEST_PATH_IMAGE019
And is and
Figure 983975DEST_PATH_IMAGE020
Figure 552359DEST_PATH_IMAGE016
is known for the particular sample composition of (a),
Figure 237419DEST_PATH_IMAGE017
Figure 542760DEST_PATH_IMAGE018
is unknown.
Further, in the data preprocessing module, the data processing module is used for processing the data
Figure 629665DEST_PATH_IMAGE021
Standardizing each characteristic component to ensure that the standard deviation of all physical examination data on each characteristic component is 1 and the mean value is 0; the feature matrix after normalization is recorded as
Figure 368951DEST_PATH_IMAGE022
Figure 213410DEST_PATH_IMAGE023
Is shown as
Figure 368448DEST_PATH_IMAGE013
The number of samples after the normalization is determined,
Figure 637755DEST_PATH_IMAGE024
is the normalized first
Figure 485626DEST_PATH_IMAGE025
The dimensional feature component is a component of the feature,
Figure 332228DEST_PATH_IMAGE026
and
Figure 87694DEST_PATH_IMAGE027
forming a standardized data set
Figure 883612DEST_PATH_IMAGE028
Will be provided with
Figure 840067DEST_PATH_IMAGE029
Expansion to form trainable feature matrices
Figure 987014DEST_PATH_IMAGE030
Figure 811751DEST_PATH_IMAGE031
Wherein
Figure 727754DEST_PATH_IMAGE032
Is shown as
Figure 71754DEST_PATH_IMAGE013
The number of samples after the data expansion is one,
Figure 768315DEST_PATH_IMAGE033
are respectively as
Figure 68846DEST_PATH_IMAGE024
Positive and negative half-axis components of (a);
Figure 42618DEST_PATH_IMAGE034
an offset vector formed for trainable upper and lower limit values on each component,
Figure 934351DEST_PATH_IMAGE035
the addition is accomplished by a broadcast mechanism; trainable feature matrices
Figure 55891DEST_PATH_IMAGE036
And
Figure 347064DEST_PATH_IMAGE027
forming extended data sets
Figure 768818DEST_PATH_IMAGE037
Further, in the basic feature analysis module, the unlabeled exemplars are regarded as negative exemplars and are based on the extended data set
Figure 972397DEST_PATH_IMAGE037
Constructing a logistic regression model
Figure 315654DEST_PATH_IMAGE038
Figure 285884DEST_PATH_IMAGE038
Loss function of
Figure 30986DEST_PATH_IMAGE039
Comprises the following steps:
Figure 421778DEST_PATH_IMAGE040
wherein
Figure 252331DEST_PATH_IMAGE041
For a vector of weights of features that is trainable,
Figure 760673DEST_PATH_IMAGE042
is a trainable intercept value;
Figure 32385DEST_PATH_IMAGE043
in order to be a sigmoid function,
Figure 905663DEST_PATH_IMAGE044
is a decision function, the value of which is a decision value,
Figure 285829DEST_PATH_IMAGE045
is a logistic regression model obtained after normalization by sigmoid function
Figure 269966DEST_PATH_IMAGE038
The output probability of (1).
Further, the false negative sample identification module comprises:
taking the weight vector of the feature obtained by training in the basic feature analysis module
Figure 176611DEST_PATH_IMAGE046
Setting a trainable non-negative matrix
Figure 17528DEST_PATH_IMAGE047
Satisfy the following requirements
Figure 822673DEST_PATH_IMAGE048
Figure 548183DEST_PATH_IMAGE049
The sum matrix of (A) is an identity matrix
Figure 122384DEST_PATH_IMAGE050
Construction of two logistic regression models
Figure 399782DEST_PATH_IMAGE051
And
Figure 426643DEST_PATH_IMAGE052
respectively having a characteristic weight coefficient
Figure 443928DEST_PATH_IMAGE053
Respectively having trainable intercept values
Figure 934952DEST_PATH_IMAGE054
Then, the output probabilities of the two logistic regression models after normalization by the sigmoid function are respectively expressed as:
Figure 320934DEST_PATH_IMAGE055
wherein
Figure 38354DEST_PATH_IMAGE056
In order to be a direct probability,
Figure 167984DEST_PATH_IMAGE057
is the attention probability;
utilizing extended data sets
Figure 247936DEST_PATH_IMAGE058
Minimizing joint loss function
Figure 257349DEST_PATH_IMAGE059
Obtaining an optimal parameter;
Figure 258803DEST_PATH_IMAGE060
wherein the content of the first and second substances,
Figure 254441DEST_PATH_IMAGE061
is a sample class weight;
Figure 392161DEST_PATH_IMAGE062
is a screening coefficient;
Figure 323208DEST_PATH_IMAGE063
Figure 608696DEST_PATH_IMAGE064
and
Figure 80128DEST_PATH_IMAGE065
gradient back propagation in the model training process is not involved;
for samples in unlabeled sample set
Figure 26350DEST_PATH_IMAGE066
Respectively through the model
Figure 190615DEST_PATH_IMAGE067
And
Figure 963399DEST_PATH_IMAGE052
obtaining direct probabilities
Figure 238523DEST_PATH_IMAGE068
And attention probability
Figure 288518DEST_PATH_IMAGE069
Using false negative indicators
Figure 420422DEST_PATH_IMAGE070
Indicating sample
Figure 883765DEST_PATH_IMAGE066
The probability of false negatives.
Further, in the false negative sample identification module, a logistic regression model is used
Figure 883950DEST_PATH_IMAGE052
By multiplication terms
Figure 913086DEST_PATH_IMAGE071
Screening channel
Figure 153575DEST_PATH_IMAGE038
Predicted output probability
Figure 41896DEST_PATH_IMAGE072
Label-free samples close to 1, and the selected label-free sample set is recorded as
Figure 658823DEST_PATH_IMAGE073
Figure 808044DEST_PATH_IMAGE074
And positive sample set
Figure 485013DEST_PATH_IMAGE020
In the dimension of competition
Figure 77275DEST_PATH_IMAGE075
There are differences in the characteristics of the classes, in the direct correlation dimension
Figure 294630DEST_PATH_IMAGE076
Should have no significant difference in the characteristics of the classes, through training to
Figure 501620DEST_PATH_IMAGE020
Is of positive type, with
Figure 21594DEST_PATH_IMAGE073
Models being negative classes
Figure 681246DEST_PATH_IMAGE052
Identification of the dimensions of the features belonging to the competition
Figure 702291DEST_PATH_IMAGE075
Class characteristics, training process optimization
Figure 685160DEST_PATH_IMAGE077
To obtain
Figure 703931DEST_PATH_IMAGE073
And
Figure 913196DEST_PATH_IMAGE020
to make an optimal distinction between the samples
Figure 410036DEST_PATH_IMAGE078
Probability of degree of interest
Figure 998144DEST_PATH_IMAGE069
Trend towards 0, for samples
Figure 250133DEST_PATH_IMAGE079
Probability of degree of interest
Figure 884377DEST_PATH_IMAGE069
Tending towards 1.
Further, in the false negative sample identification module, a logistic regression model is used
Figure 873324DEST_PATH_IMAGE067
By multiplication terms
Figure 643834DEST_PATH_IMAGE080
Screening channel
Figure 66725DEST_PATH_IMAGE052
Predicted attention probability
Figure 860369DEST_PATH_IMAGE069
Label-free samples close to 1, and the selected label-free sample set is recorded as
Figure 699012DEST_PATH_IMAGE081
Figure 651924DEST_PATH_IMAGE081
And positive sample set
Figure 183400DEST_PATH_IMAGE020
In the direct correlation dimension
Figure 713607DEST_PATH_IMAGE076
There are differences in the characteristics of the classes, in the competitive dimension
Figure 418258DEST_PATH_IMAGE075
Should have no significant difference in the characteristics of the classes, through training to
Figure 163360DEST_PATH_IMAGE020
Is of positive type, with
Figure 803420DEST_PATH_IMAGE081
Models being negative classes
Figure 633972DEST_PATH_IMAGE082
Identifying ones of the feature dimensions that belong to a direct correlation
Figure 876735DEST_PATH_IMAGE076
Class characteristics, training process optimization
Figure 741923DEST_PATH_IMAGE083
To obtain
Figure 29248DEST_PATH_IMAGE081
And
Figure 409414DEST_PATH_IMAGE020
to make an optimal distinction between the samples
Figure 393550DEST_PATH_IMAGE084
Direct probability
Figure 50928DEST_PATH_IMAGE085
Trend towards 0, for samples
Figure 891845DEST_PATH_IMAGE086
Direct probability
Figure 696990DEST_PATH_IMAGE068
Tending towards 1.
Further, the prediction model building module is based on a standardized data set
Figure 671768DEST_PATH_IMAGE087
And false negative index of each sample
Figure 245969DEST_PATH_IMAGE088
Building the number of nodes of the input layer as
Figure 523366DEST_PATH_IMAGE089
The number of nodes of the output layer is 1, the activation function of the output layer is a sigmoid function, and the set of transfer matrices between layers is
Figure 550228DEST_PATH_IMAGE090
Multi-layer neural network of
Figure 813850DEST_PATH_IMAGE091
To sample
Figure 304875DEST_PATH_IMAGE092
Warp beam
Figure 690857DEST_PATH_IMAGE091
The output after the operation is defined as
Figure 159009DEST_PATH_IMAGE093
By minimizing the loss function introducing false negative indicators
Figure 288639DEST_PATH_IMAGE094
To obtain
Figure 634170DEST_PATH_IMAGE091
The optimum parameter of (2);
Figure 128736DEST_PATH_IMAGE095
then
Figure 130190DEST_PATH_IMAGE091
And constructing a physical examination assistant decision model after introducing false negative index optimization.
Further, in the assistant decision module, the single physical examinee is obtained through physical examination
Figure 860249DEST_PATH_IMAGE096
Volume with items corresponding to feature dimensionsDetecting indexes, and obtaining normalized feature vectors by a data preprocessing module
Figure 263549DEST_PATH_IMAGE097
Will be
Figure 443863DEST_PATH_IMAGE097
Inputting the physical examination assistant decision-making model constructed by the prediction model construction module, and outputting the prediction result
Figure 667034DEST_PATH_IMAGE098
When is coming into contact with
Figure 466363DEST_PATH_IMAGE099
The physical examination result tends to be positive when the trend is 1, and the physical examination result tends to be positive when the trend is
Figure 396273DEST_PATH_IMAGE099
When the trend is 0, the physical examination result tends to be negative, and the prediction result is provided to a clinician as a physical examination assistant decision result.
The invention has the beneficial effects that:
1. existing positive-label-free learning techniques treat clinical diagnostic deficits as randomly occurring behaviors. According to the invention, through simulating a universal clinical diagnosis process, the data cause generated by diagnosis deficiency is analyzed, the process is modeled, the clinical logic is better met, false negative samples in real world medical data can be better found, and the application capability of the real world medical data on the construction of a physical examination assistant decision model and clinical assistant decision is improved.
2. The existing characterization learning technology needs a large amount of additional data and a certain amount of medical professional knowledge to support the characterization mining process, and is weak in universality. The invention does not need to use extra data in the modeling and clinical assistant decision-making processes, and embeds the universal clinical actual decision-making process into the development logic of the model, thereby having stronger universality without introducing extra medical knowledge aiming at application cases.
Drawings
Fig. 1 is a block diagram of a medical examination assistant decision system based on false negative sample identification according to an embodiment of the present invention;
FIG. 2 is a flow chart of false negative sample identification provided by an embodiment of the present invention;
fig. 3 is a flowchart of a construction process of a physical examination assistant decision model after introducing false negative indicator optimization according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.
The embodiment of the invention provides a physical examination assistant decision-making system based on false negative sample identification, which comprises a data acquisition module, a data preprocessing module, a basic feature analysis module, a false negative sample identification module, a prediction model construction module and an assistant decision-making module, and the implementation process of each module is explained in detail below, as shown in fig. 1.
Firstly, a data acquisition module: the system comprises a real-world physical examination data set, a real diagnosis label and a physical examination result, wherein the real-world physical examination data set is obtained and matrixed into an original data set comprising an input characteristic matrix and a real diagnosis label, and a sample with a negative physical examination result is regarded as a label-free sample;
specifically, the data acquisition module is used to acquire a real-world physical examination data set stored in the csv file, including feature dimensions and real diagnostic tags. The characteristic dimension of the physical examination data comprises basic physiological indexes and conventional assay indexes; the basic physiological indexes comprise height, weight, BMI, systolic pressure and diastolic pressure; conventional assay indices include blood convention (total protein, albumin, globulin, white globulin ratio, glutamic-pyruvic transaminase, glutamic-oxaloacetic transaminase, alkaline phosphatase, cholinesterase, total bile acid, total bilirubin, direct bilirubin, indirect bilirubin, adenylic deaminase, glutamyl transpeptidase, glomerular filtration rate, creatinine, urea, uric acid, somatostatin C, triglycerides, total cholesterol, high density lipoprotein-C, low density lipoprotein-C, very low density lipoprotein-C, fasting plasma glucose, potassium, sodium, chloride, total calcium, inorganic phosphorus, glyco-dipeptidyl aminopeptidase, alpha-fucosidase), urine convention (urine protein, uroketone, urine sugar, urobilirubin, urine sediment white cells, urine sediment red cells, urobilinogen, urine acidity); the real diagnosis label is the result of physical examination, such as the result of diabetes diagnosis.
Matrixing a physical examination dataset into a raw dataset
Figure 294958DEST_PATH_IMAGE100
Wherein
Figure 333322DEST_PATH_IMAGE101
Inputting a feature matrix;
Figure 608445DEST_PATH_IMAGE102
in order to be the amount of the sample,
Figure 140664DEST_PATH_IMAGE103
is the total number of physical examination indexes, in the example
Figure 538148DEST_PATH_IMAGE104
Figure 735911DEST_PATH_IMAGE105
Figure 752408DEST_PATH_IMAGE106
To
Figure 984806DEST_PATH_IMAGE107
Represents each sample, embodied in the form of a feature vector,
Figure 287612DEST_PATH_IMAGE108
to
Figure 238250DEST_PATH_IMAGE008
For the feature components of the original data set in each feature dimension,
Figure 776548DEST_PATH_IMAGE009
representing a transpose;
Figure 925770DEST_PATH_IMAGE109
is composed of
Figure 337159DEST_PATH_IMAGE011
The true diagnostic label, i.e. the target label,
Figure 447198DEST_PATH_IMAGE012
represents the first
Figure 664552DEST_PATH_IMAGE013
The physical examination result of each sample is positive, namely the sample is a positive sample;
Figure 605964DEST_PATH_IMAGE014
represents the first
Figure 142249DEST_PATH_IMAGE013
The result of the physical examination of each sample is negative, the sample can be a true negative sample or a false negative sample, and the sample is regarded as a label-free sample. The set of positive samples was scored as
Figure 801901DEST_PATH_IMAGE020
Including all
Figure 822947DEST_PATH_IMAGE012
The sample of (1); record the set of unlabeled exemplars as
Figure 556547DEST_PATH_IMAGE016
Including all
Figure 575319DEST_PATH_IMAGE014
The sample of (1); the collection of true negative samples was scored as
Figure 519004DEST_PATH_IMAGE110
The set of false negative samples was scored as
Figure 281424DEST_PATH_IMAGE018
Is provided with
Figure 384378DEST_PATH_IMAGE019
And is and
Figure 370788DEST_PATH_IMAGE020
Figure 5032DEST_PATH_IMAGE016
is known for the particular sample composition of (a),
Figure 243247DEST_PATH_IMAGE110
Figure 13756DEST_PATH_IMAGE018
is unknown.
A data preprocessing module: forming a standardized data set by unifying the standard deviation and the mean value of each characteristic component in the original data set; separating positive and negative half-axis components of each characteristic component in the standardized data set, and adding corresponding trainable upper and lower limit values to each positive and negative half-axis component to form an expanded data set, wherein the expanded data set comprises the following steps:
to pair
Figure 436648DEST_PATH_IMAGE111
Of each characteristic component
Figure 292608DEST_PATH_IMAGE112
Normalizing based on the component
Figure 822596DEST_PATH_IMAGE113
Making the standard deviation of all physical examination data on the component be 1 and the mean value be 0; the feature matrix after normalization is recorded as
Figure 775509DEST_PATH_IMAGE022
Figure 306984DEST_PATH_IMAGE114
Is shown as
Figure 587924DEST_PATH_IMAGE013
The normalized samples are embodied in the form of feature vectors,
Figure 26996DEST_PATH_IMAGE026
and
Figure 772098DEST_PATH_IMAGE027
forming a standardized data set
Figure 661425DEST_PATH_IMAGE115
Figure 288716DEST_PATH_IMAGE116
Wherein
Figure 734740DEST_PATH_IMAGE024
Is the normalized first
Figure 537611DEST_PATH_IMAGE025
The dimensional feature component is a component of the feature,
Figure 207627DEST_PATH_IMAGE117
is composed of
Figure 259897DEST_PATH_IMAGE118
In a component of a sample
Figure 198028DEST_PATH_IMAGE112
The average value of the above-mentioned average values,
Figure 917722DEST_PATH_IMAGE119
is composed of
Figure 24219DEST_PATH_IMAGE118
In a component of a sample
Figure 563784DEST_PATH_IMAGE112
Standard deviation of (2).
As physical examination indexes usually provide assistant decision information in the form of 'higher than a normal upper limit value' and 'lower than a normal lower limit value' in real-life use, and physical examination results guided by the two types of assistant decision information are not completely contradictory, the data preprocessing process of the invention separately considers positive and negative data and additionally adds trainable offset vectors, so that the feature matrix of the constructed expanded data set is close to a clinical use scene. In particular, separation
Figure 23716DEST_PATH_IMAGE026
Each characteristic component
Figure 660233DEST_PATH_IMAGE024
The positive and negative half-axis components of the vector are used for simulating the difference of two types of assistant decision information, and an offset vector is added
Figure 875314DEST_PATH_IMAGE120
To simulate the upper and lower normal limits of physical examination index.
Based on this, will
Figure 354706DEST_PATH_IMAGE029
Expansion to form trainable feature matrices
Figure 477383DEST_PATH_IMAGE036
Figure 171669DEST_PATH_IMAGE031
Wherein
Figure 495334DEST_PATH_IMAGE121
Are respectively as
Figure 9492DEST_PATH_IMAGE024
Positive and negative half-axis components of (a);
Figure 201439DEST_PATH_IMAGE122
for the offset vector formed by the trainable upper and lower limit values on each component, there are
Figure 170139DEST_PATH_IMAGE035
The addition is done by a broadcast mechanism (broadcasting).
Trainable feature matrices
Figure 727022DEST_PATH_IMAGE036
And
Figure 790793DEST_PATH_IMAGE027
forming extended data sets
Figure 724114DEST_PATH_IMAGE037
Expanding the data set in the preprocessed data set
Figure 65097DEST_PATH_IMAGE037
Is used for the basic feature analysis module and the false negative sample identification module to standardize the data set
Figure 792881DEST_PATH_IMAGE123
Is used for a prediction model building module and an assistant decision module.
Thirdly, a basic characteristic analysis module: training to obtain the feature weight of each feature dimension pair generating a real diagnosis label under the condition of not considering false negative samples by using a logistic regression model and regarding the unlabeled samples as the negative samples, wherein the method comprises the following steps:
all unlabeled samples are regarded as negative samples and are based on the preprocessed extended data set
Figure 78369DEST_PATH_IMAGE037
Constructing a logistic regression model
Figure 2331DEST_PATH_IMAGE038
Figure 260137DEST_PATH_IMAGE038
Loss function of
Figure 221140DEST_PATH_IMAGE124
Comprises the following steps:
Figure 869290DEST_PATH_IMAGE040
wherein
Figure 144414DEST_PATH_IMAGE125
For a vector of weights of features that is trainable,
Figure 319043DEST_PATH_IMAGE042
for the value of the intercept to be trainable,
Figure 654210DEST_PATH_IMAGE126
is shown as
Figure 540388DEST_PATH_IMAGE127
The sample after data expansion is embodied in the form of a feature vector,
Figure 415941DEST_PATH_IMAGE128
is as follows
Figure 382760DEST_PATH_IMAGE127
True diagnostic label of individual sample;
Figure 826510DEST_PATH_IMAGE129
in order to be a sigmoid function,
Figure 511570DEST_PATH_IMAGE044
is a decision function, the value of which is a decision value,
Figure 925233DEST_PATH_IMAGE072
is a logistic regression model obtained after normalization by sigmoid function
Figure 199089DEST_PATH_IMAGE038
Output probability of, i.e.
Figure 876058DEST_PATH_IMAGE038
Predicted samples
Figure 110730DEST_PATH_IMAGE130
Probability of being positive. In the examples, the model was trained using a small Batch Gradient Descent method (Mini-Batch Gradient Descent) with a sample size of 500 for a single Batch.
And fourthly, a false negative sample identification module: the characteristic dimensions are divided into a direct correlation dimension and a competitive dimension, wherein the direct correlation dimension directly influences the judgment of the target physical examination result from a medical perspective, and the competitive dimension does not directly influence the judgment of the target physical examination result from a medical perspective but can compete with the target physical examination result for attention, so that the target physical examination result is lost, and a false negative sample is generated; constructing two logistic regression models and a joint loss function, performing joint training, screening true negative samples and false negative samples by using the joint loss function, enabling direct correlation dimensions to distinguish positive samples and screened suspected true negative samples to the maximum extent, and enabling competitive dimensions to distinguish positive samples and screened suspected false negative samples to the maximum extent; indicating, by the false negative indicator, a likelihood that the sample is a false negative sample; the method comprises the following steps:
dividing the characteristic dimension into directly related dimensions based on the generation logic of physical examination results in physical examination clinical practice
Figure 531347DEST_PATH_IMAGE076
Class and competition dimensions
Figure 410441DEST_PATH_IMAGE075
And the two types are classified. It is defined as: dimension of direct correlation
Figure 55049DEST_PATH_IMAGE076
The characteristics in the class directly influence the judgment of the target physical examination result from the medical perspective; dimension of competition
Figure 714701DEST_PATH_IMAGE075
The features in the class do not directly affect the determination of the target physical examination result from a medical perspective, but compete with the target physical examination result for attention, and thus may cause the target physical examination result to be missing and generate a false negative sample. From generating logically, feature weight vectors
Figure 376493DEST_PATH_IMAGE046
Is generated under the combined action of the two characteristics. The core idea of the false negative sample identification module is to identify through data induction
Figure 437990DEST_PATH_IMAGE076
And
Figure 987920DEST_PATH_IMAGE075
these two types of features allow the assessment of the likelihood that an unlabeled sample is false negative.
Taking the weight vector of the feature obtained by training in the basic feature analysis module
Figure 72551DEST_PATH_IMAGE046
Setting a trainable non-negative matrix
Figure 834971DEST_PATH_IMAGE047
Satisfy the following requirements
Figure 813291DEST_PATH_IMAGE131
Figure 2964DEST_PATH_IMAGE049
The sum matrix of (A) is an identity matrix
Figure 293000DEST_PATH_IMAGE050
(ii) a Then:
Figure 655848DEST_PATH_IMAGE132
wherein, by
Figure 691937DEST_PATH_IMAGE076
The decision value contributed by the class feature is
Figure 990194DEST_PATH_IMAGE133
The maximum differentiation of the positive sample sets
Figure 846155DEST_PATH_IMAGE134
And true negative sample set
Figure 747115DEST_PATH_IMAGE135
(ii) a By
Figure 326126DEST_PATH_IMAGE136
The decision value contributed by the class feature is
Figure 857602DEST_PATH_IMAGE137
The maximum differentiation of the positive sample sets
Figure 263175DEST_PATH_IMAGE134
And false negative sample set
Figure 905509DEST_PATH_IMAGE138
Based on the above recognition, the false negative sample identification module performs the following steps:
construction of two logistic regression models
Figure 588294DEST_PATH_IMAGE067
And
Figure 87409DEST_PATH_IMAGE052
respectively having a characteristic weight coefficient
Figure 917961DEST_PATH_IMAGE053
Respectively having trainable intercept values
Figure 550937DEST_PATH_IMAGE054
. The output probabilities of the two logistic regression models after normalization by the sigmoid function are respectively expressed as:
Figure 150546DEST_PATH_IMAGE139
balance
Figure 86141DEST_PATH_IMAGE140
In order to be a direct probability,
Figure 403989DEST_PATH_IMAGE141
is the attention probability.
Under the condition of the optimal characteristic classification,
Figure 591388DEST_PATH_IMAGE067
the maximum discrimination of the positive sample set
Figure 45503DEST_PATH_IMAGE020
And true negative sample set
Figure 886420DEST_PATH_IMAGE135
Figure 377051DEST_PATH_IMAGE052
The maximum discrimination of the positive sample set
Figure 164879DEST_PATH_IMAGE015
And false negative sample set
Figure 535817DEST_PATH_IMAGE142
. Thus, the trainable parameters include
Figure 688581DEST_PATH_IMAGE143
Using extended data sets
Figure 981022DEST_PATH_IMAGE037
Minimizing joint loss function
Figure 369278DEST_PATH_IMAGE059
Obtaining optimal parameters and expanding data set
Figure 797985DEST_PATH_IMAGE037
Offset vector of
Figure 636497DEST_PATH_IMAGE144
Using basic feature analysis module
Figure 150655DEST_PATH_IMAGE145
The optimization result obtained after training is not further trained.
Figure 342602DEST_PATH_IMAGE146
Wherein the content of the first and second substances,
Figure 563499DEST_PATH_IMAGE061
weighting the sample class for adjusting the proportion of different classes of samples in training, wherein the sample class is used
Figure 120382DEST_PATH_IMAGE147
Figure 184153DEST_PATH_IMAGE062
For screening the coefficients when
Figure 540310DEST_PATH_IMAGE062
The larger the size, the higher the screening strength of the joint loss function for classifying the unlabeled sample into false negative and true negative samples is, but the diversity of the screened samples is reduced, and the sample is used in the example
Figure 943610DEST_PATH_IMAGE148
Figure 733711DEST_PATH_IMAGE149
But not in the gradient back-propagation during model training. Example modeling Using a Small batch gradient descent method
Figure 222462DEST_PATH_IMAGE051
And
Figure 631577DEST_PATH_IMAGE052
the sample size of the single batch use of the combined training of (1) is 500.
Joint loss function
Figure 686121DEST_PATH_IMAGE059
The construction logic of (1) is as follows:
(1) for model
Figure 850386DEST_PATH_IMAGE052
By multiplication terms
Figure 13383DEST_PATH_IMAGE150
Screening channel
Figure 22927DEST_PATH_IMAGE038
Predicted output probability
Figure 463136DEST_PATH_IMAGE045
Higher unlabeled samples, and recording the selected set of the unlabeled samples
Figure 204827DEST_PATH_IMAGE151
Relatively holistic unlabeled exemplar set
Figure 668169DEST_PATH_IMAGE152
Figure 543721DEST_PATH_IMAGE151
The proportion of false negative samples in (1) is large.
Figure 776120DEST_PATH_IMAGE151
And positive sample set
Figure 973533DEST_PATH_IMAGE015
In the dimension of competition
Figure 455329DEST_PATH_IMAGE153
There are differences in the characteristics of the classes, but in the directly related dimension
Figure 72256DEST_PATH_IMAGE076
There should be no significant difference in the characteristics of the classes, and thus it can be trained to
Figure 96843DEST_PATH_IMAGE015
Is of positive type, with
Figure 773812DEST_PATH_IMAGE151
Models being negative classes
Figure 8485DEST_PATH_IMAGE052
Identification of the dimensions of the features belonging to the competition
Figure 163522DEST_PATH_IMAGE153
The characteristics of the class. Simultaneous optimization of training process
Figure 291884DEST_PATH_IMAGE154
To obtain
Figure 202072DEST_PATH_IMAGE151
And
Figure 861723DEST_PATH_IMAGE020
to make an optimal distinction between the samples
Figure 758135DEST_PATH_IMAGE155
About probability of degree of attention
Figure 554053DEST_PATH_IMAGE156
Trend towards 0, for samples
Figure 635141DEST_PATH_IMAGE079
About probability of degree of attention
Figure 204925DEST_PATH_IMAGE156
Tending towards 1.
(2) For model
Figure 967345DEST_PATH_IMAGE082
By multiplication terms
Figure 945665DEST_PATH_IMAGE157
Screening channel
Figure 869758DEST_PATH_IMAGE052
Predicted attention probability
Figure 441685DEST_PATH_IMAGE069
Higher unlabeled samples, and recording the selected set of the unlabeled samples
Figure 538954DEST_PATH_IMAGE158
Relatively holistic unlabeled exemplar set
Figure 575043DEST_PATH_IMAGE159
Figure 856989DEST_PATH_IMAGE158
The proportion of true negative samples in (1) is large.
Figure 978529DEST_PATH_IMAGE160
And positive sample set
Figure 879489DEST_PATH_IMAGE015
In the direct correlation dimension
Figure 442188DEST_PATH_IMAGE076
There are differences in the characteristics of the classes, but in the competitive dimension
Figure 973664DEST_PATH_IMAGE161
There should be no significant difference in the characteristics of the classes, and thus it can be trained to
Figure 379237DEST_PATH_IMAGE015
Is of positive type, with
Figure 287150DEST_PATH_IMAGE162
Models being negative classes
Figure 717738DEST_PATH_IMAGE067
Identifying ones of the feature dimensions that belong to a direct correlation
Figure 216853DEST_PATH_IMAGE076
The characteristics of the class. Simultaneous optimization of training process
Figure 47406DEST_PATH_IMAGE163
To obtain
Figure 431113DEST_PATH_IMAGE162
And
Figure 30722DEST_PATH_IMAGE020
to make an optimal distinction between the samples
Figure 966317DEST_PATH_IMAGE084
Having a direct probability
Figure 284166DEST_PATH_IMAGE068
Trend towards 0, for samples
Figure 189674DEST_PATH_IMAGE086
Having a direct probability
Figure 971685DEST_PATH_IMAGE085
Tending towards 1.
(3) Due to the existence of limit conditions in the model training process
Figure 15864DEST_PATH_IMAGE164
Requiring the use of joint loss functions
Figure 758693DEST_PATH_IMAGE059
By means of a model
Figure 546520DEST_PATH_IMAGE067
And
Figure 183038DEST_PATH_IMAGE052
and optimizing each parameter in a joint training mode.
After obtaining the optimal parameters, for the sample
Figure 820955DEST_PATH_IMAGE165
Respectively through the model
Figure 847817DEST_PATH_IMAGE067
And
Figure 501652DEST_PATH_IMAGE052
obtain its direct probability
Figure 930359DEST_PATH_IMAGE068
And attention probability
Figure 988445DEST_PATH_IMAGE057
. If it is
Figure 830499DEST_PATH_IMAGE166
If the sample is false negative, the sample should be
Figure 960129DEST_PATH_IMAGE085
The trend is towards 1, and the trend is that,
Figure 430294DEST_PATH_IMAGE057
trend towards 0, using false negative index
Figure 49494DEST_PATH_IMAGE167
Indicating each sample
Figure 50948DEST_PATH_IMAGE066
The probability of false negatives.
The flow of false negative sample identification is shown in FIG. 2.
Fifthly, a prediction model construction module: constructing a multilayer neural network and introducing a loss function of a false negative index, and training a physical examination assistant decision model based on a standardized data set and the false negative index, wherein the training process comprises the following steps:
based on markNormalized data set
Figure 921952DEST_PATH_IMAGE087
And false negative index of each sample
Figure 59672DEST_PATH_IMAGE168
Building the number of nodes of the input layer as
Figure 115353DEST_PATH_IMAGE089
The number of nodes of the output layer is 1, the activation function of the output layer is a sigmoid function, and the set of transfer matrices between layers is
Figure 604103DEST_PATH_IMAGE090
Multi-layer neural network of
Figure 489583DEST_PATH_IMAGE091
To sample
Figure 809706DEST_PATH_IMAGE092
Neural network
Figure 973971DEST_PATH_IMAGE091
The output after the operation is defined as
Figure 622121DEST_PATH_IMAGE093
The vector formed by all the outputs is recorded as
Figure 897244DEST_PATH_IMAGE169
Then the loss function can be introduced by minimizing the loss function that introduces false negative indicators
Figure 71874DEST_PATH_IMAGE094
To obtain
Figure 328412DEST_PATH_IMAGE091
The optimum parameter of (2).
Figure 791754DEST_PATH_IMAGE095
Then
Figure 667306DEST_PATH_IMAGE091
And a physical examination assistant decision-making model after the optimization of false negative indexes is introduced for construction. The construction process of the physical examination assistant decision model is shown in figure 3.
In the example, a three-layer neural network is constructed
Figure 899704DEST_PATH_IMAGE091
Figure 812297DEST_PATH_IMAGE091
The number of nodes of the input layer is
Figure 825252DEST_PATH_IMAGE170
The number of nodes of the output layer is 1, the number of nodes of the intermediate layer is 20, and the set of transfer matrices between layers is
Figure 442178DEST_PATH_IMAGE171
Figure 217498DEST_PATH_IMAGE172
For the transfer matrix of the input layer to the intermediate layer,
Figure 894467DEST_PATH_IMAGE173
for the transition matrix from the middle layer to the output layer, the activation function between layers is { ReLU, sigmoid }. Model training was performed using a small batch gradient descent method, using a sample size of 500 for a single batch.
Sixthly, an auxiliary decision module: based on physical examination data of a physical examiner, standardized feature vectors are obtained through a data preprocessing module, a prediction result is obtained through a physical examination assistant decision model, and the prediction result is output to a clinician as a physical examination assistant decision result, and the method comprises the following steps:
obtained by subjecting a single physical examinee to physical examination
Figure 129140DEST_PATH_IMAGE096
The physical examination indexes corresponding to the items and the characteristic dimensions obtain the characteristic vectors after the standardization processing through a data preprocessing module
Figure 956281DEST_PATH_IMAGE097
. Then, will
Figure 163272DEST_PATH_IMAGE097
Inputting the physical examination assistant decision-making model constructed by the prediction model construction module, and outputting the prediction result
Figure 807880DEST_PATH_IMAGE098
When is coming into contact with
Figure 467531DEST_PATH_IMAGE099
The physical examination result tends to be positive when the trend is 1, and the physical examination result tends to be positive when the trend is
Figure 347631DEST_PATH_IMAGE099
When the trend is 0, the physical examination result tends to be negative, and the prediction result is provided to a clinician as a physical examination assistant decision result.
The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims (10)

1. A physical examination assistant decision system based on false negative sample identification, comprising:
a data acquisition module: the system comprises a real-world physical examination data set, a real diagnosis label and a physical examination result, wherein the real-world physical examination data set is obtained and matrixed into an original data set comprising an input characteristic matrix and a real diagnosis label, and a sample with a negative physical examination result is regarded as a label-free sample;
a data preprocessing module: forming a standardized data set by unifying the standard deviation and the mean value of each characteristic component in the original data set; separating positive and negative half-axis components of each characteristic component in the standardized data set, and adding corresponding trainable upper and lower limit values to each positive and negative half-axis component to form an expanded data set;
a basic characteristic analysis module: using a logistic regression model, regarding the unlabeled sample as a negative sample, and training to obtain the characteristic weight of each characteristic dimension pair for generating a real diagnosis label under the condition of not considering the false negative sample;
a false negative sample identification module: the characteristic dimensions are divided into a direct correlation dimension and a competitive dimension, wherein the direct correlation dimension directly influences the judgment of the target physical examination result from a medical perspective, and the competitive dimension does not directly influence the judgment of the target physical examination result from a medical perspective but can compete with the target physical examination result for attention, so that the target physical examination result is lost, and a false negative sample is generated; constructing two logistic regression models and a joint loss function, performing joint training, screening true negative samples and false negative samples by using the joint loss function, enabling direct correlation dimensions to distinguish positive samples and screened suspected true negative samples to the maximum extent, and enabling competitive dimensions to distinguish positive samples and screened suspected false negative samples to the maximum extent; indicating, by the false negative indicator, a likelihood that the sample is a false negative sample;
a prediction model construction module: constructing a multilayer neural network and introducing a loss function of a false negative index, and training a physical examination assistant decision model based on a standardized data set and the false negative index;
an assistant decision module: based on physical examination data of a physical examination person, a standardized feature vector is obtained through a data preprocessing module, a prediction result is obtained through a physical examination assistant decision model, and the prediction result is output to a clinician as a physical examination assistant decision result.
2. A physical examination assistant decision system based on false negative sample identification as claimed in claim 1, characterized in that in the data acquisition module, the characteristic dimension of the physical examination data comprises basic physiological indexes and routine test indexes, the basic physiological indexes comprise height, weight, BMI, systolic pressure and diastolic pressure, and the routine test indexes comprise blood routine and urine routine; the real diagnosis label is a physical examination result.
3. The system of claim 1, wherein the data acquisition module is configured to matrix the physical examination dataset into a raw dataset
Figure DEST_PATH_IMAGE001
Figure DEST_PATH_IMAGE002
In order to input the feature matrix, the feature matrix is input,
Figure DEST_PATH_IMAGE003
in order to be the amount of the sample,
Figure DEST_PATH_IMAGE004
the total number of physical examination indexes is shown,
Figure DEST_PATH_IMAGE005
to
Figure DEST_PATH_IMAGE006
A representation of each of the samples is shown,
Figure DEST_PATH_IMAGE007
to
Figure DEST_PATH_IMAGE008
For the feature components of the original data set in each feature dimension,
Figure DEST_PATH_IMAGE009
representing a transpose;
Figure DEST_PATH_IMAGE010
is composed of
Figure 96584DEST_PATH_IMAGE003
The true diagnostic label of an individual sample,
Figure DEST_PATH_IMAGE011
represents the first
Figure DEST_PATH_IMAGE012
One of the samples was a positive sample,
Figure DEST_PATH_IMAGE013
represents the first
Figure 685829DEST_PATH_IMAGE012
The samples are true negative samples or false negative samples and are regarded as label-free samples; the positive sample set was scored as
Figure DEST_PATH_IMAGE014
Set of unlabeled exemplars as
Figure DEST_PATH_IMAGE015
The set of true negative samples was scored as
Figure DEST_PATH_IMAGE016
The false negative sample set was scored as
Figure DEST_PATH_IMAGE017
Is provided with
Figure DEST_PATH_IMAGE018
And is and
Figure 855779DEST_PATH_IMAGE014
Figure 985409DEST_PATH_IMAGE015
is known for the particular sample composition of (a),
Figure 534202DEST_PATH_IMAGE016
Figure 91085DEST_PATH_IMAGE017
is unknown.
4. The system of claim 3, wherein the data preprocessing module is configured to perform on-line analysis of the sample data for the determination of the false negative sample
Figure DEST_PATH_IMAGE019
Standardizing each characteristic component to ensure that the standard deviation of all physical examination data on each characteristic component is 1 and the mean value is 0; the feature matrix after normalization is recorded as
Figure DEST_PATH_IMAGE020
Figure DEST_PATH_IMAGE021
Is shown as
Figure DEST_PATH_IMAGE022
The number of samples after the normalization is determined,
Figure DEST_PATH_IMAGE023
is the normalized first
Figure DEST_PATH_IMAGE024
The dimensional feature component is a component of the feature,
Figure DEST_PATH_IMAGE025
and
Figure DEST_PATH_IMAGE026
forming a standardized data set
Figure DEST_PATH_IMAGE027
Will be provided with
Figure DEST_PATH_IMAGE028
Expansion to form trainable feature matrices
Figure DEST_PATH_IMAGE029
Figure DEST_PATH_IMAGE030
Wherein
Figure DEST_PATH_IMAGE031
Is shown as
Figure 817774DEST_PATH_IMAGE022
The number of samples after the data expansion is one,
Figure DEST_PATH_IMAGE032
are respectively as
Figure 485516DEST_PATH_IMAGE023
Positive and negative half-axis components of (a);
Figure DEST_PATH_IMAGE033
an offset vector formed for trainable upper and lower limit values on each component,
Figure DEST_PATH_IMAGE034
the addition is accomplished by a broadcast mechanism; trainable feature matrices
Figure 606925DEST_PATH_IMAGE029
And
Figure 334709DEST_PATH_IMAGE026
forming extended data sets
Figure DEST_PATH_IMAGE035
5. The system of claim 4, wherein the basic feature analysis module considers unlabeled samples as negative samples and is based on an expanded data set
Figure 292301DEST_PATH_IMAGE035
Constructing a logistic regression model
Figure DEST_PATH_IMAGE036
Figure 232575DEST_PATH_IMAGE036
Loss function of
Figure DEST_PATH_IMAGE037
Comprises the following steps:
Figure DEST_PATH_IMAGE038
wherein
Figure DEST_PATH_IMAGE039
For a vector of weights of features that is trainable,
Figure DEST_PATH_IMAGE040
is a trainable intercept value;
Figure DEST_PATH_IMAGE041
in order to be a sigmoid function,
Figure DEST_PATH_IMAGE042
is a decision function, the value of which is a decision value,
Figure DEST_PATH_IMAGE043
is a logistic regression obtained after normalization by sigmoid functionModel (model)
Figure 100168DEST_PATH_IMAGE036
The output probability of (1).
6. The physical examination assistant decision system based on false negative sample identification as claimed in claim 5, wherein the false negative sample identification module comprises:
taking the weight vector of the feature obtained by training in the basic feature analysis module
Figure DEST_PATH_IMAGE044
Setting a trainable non-negative matrix
Figure DEST_PATH_IMAGE045
Satisfy the following requirements
Figure DEST_PATH_IMAGE046
Figure DEST_PATH_IMAGE047
The sum matrix of (A) is an identity matrix
Figure DEST_PATH_IMAGE048
Construction of two logistic regression models
Figure DEST_PATH_IMAGE049
And
Figure DEST_PATH_IMAGE050
respectively having a characteristic weight coefficient
Figure DEST_PATH_IMAGE051
Respectively having trainable intercept values
Figure DEST_PATH_IMAGE052
Respectively representing the output probabilities of the two logistic regression models after normalization by sigmoid functionComprises the following steps:
Figure DEST_PATH_IMAGE053
wherein
Figure DEST_PATH_IMAGE054
In order to be a direct probability,
Figure DEST_PATH_IMAGE055
is the attention probability;
utilizing extended data sets
Figure 543394DEST_PATH_IMAGE035
Minimizing joint loss function
Figure DEST_PATH_IMAGE056
Obtaining an optimal parameter;
Figure DEST_PATH_IMAGE057
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE058
is a sample class weight;
Figure DEST_PATH_IMAGE059
is a screening coefficient;
Figure DEST_PATH_IMAGE060
Figure DEST_PATH_IMAGE061
and
Figure DEST_PATH_IMAGE062
gradient back propagation in the model training process is not involved;
for unlabeled sample setOf (2) a sample
Figure DEST_PATH_IMAGE063
Respectively through the model
Figure 348802DEST_PATH_IMAGE049
And
Figure 92767DEST_PATH_IMAGE050
obtaining direct probabilities
Figure DEST_PATH_IMAGE064
And attention probability
Figure DEST_PATH_IMAGE065
Using false negative indicators
Figure DEST_PATH_IMAGE066
Indicating sample
Figure 923188DEST_PATH_IMAGE063
The probability of false negatives.
7. The system of claim 6, wherein the false negative sample identification module is configured to identify the logistic regression model
Figure 992776DEST_PATH_IMAGE050
By multiplication terms
Figure DEST_PATH_IMAGE067
Screening channel
Figure 190539DEST_PATH_IMAGE036
Predicted output probability
Figure DEST_PATH_IMAGE068
Unlabeled samples approaching 1 are screenedIs recorded as a set of unlabeled exemplars
Figure DEST_PATH_IMAGE069
Figure DEST_PATH_IMAGE070
And positive sample set
Figure DEST_PATH_IMAGE071
In the dimension of competition
Figure DEST_PATH_IMAGE072
There are differences in the characteristics of the classes, in the direct correlation dimension
Figure DEST_PATH_IMAGE073
Should have no significant difference in the characteristics of the classes, through training to
Figure 892522DEST_PATH_IMAGE071
Is of positive type, with
Figure 859341DEST_PATH_IMAGE069
Models being negative classes
Figure 614676DEST_PATH_IMAGE050
Identification of the dimensions of the features belonging to the competition
Figure 299736DEST_PATH_IMAGE072
Class characteristics, training process optimization
Figure DEST_PATH_IMAGE074
To obtain
Figure 385503DEST_PATH_IMAGE069
And
Figure 472408DEST_PATH_IMAGE071
to make an optimal distinction between the samples
Figure DEST_PATH_IMAGE075
Probability of degree of interest
Figure 883798DEST_PATH_IMAGE065
Trend towards 0, for samples
Figure DEST_PATH_IMAGE076
Probability of degree of interest
Figure 541306DEST_PATH_IMAGE065
Tending towards 1.
8. The system of claim 6, wherein the false negative sample identification module is configured to identify the logistic regression model
Figure 696344DEST_PATH_IMAGE049
By multiplication terms
Figure DEST_PATH_IMAGE077
Screening channel
Figure 372176DEST_PATH_IMAGE050
Predicted attention probability
Figure 220046DEST_PATH_IMAGE065
Label-free samples close to 1, and the selected label-free sample set is recorded as
Figure DEST_PATH_IMAGE078
Figure 863386DEST_PATH_IMAGE078
And positive sample set
Figure 822115DEST_PATH_IMAGE071
In the direct correlation dimension
Figure 352453DEST_PATH_IMAGE073
There are differences in the characteristics of the classes, in the competitive dimension
Figure 371225DEST_PATH_IMAGE072
Should have no significant difference in the characteristics of the classes, through training to
Figure 518173DEST_PATH_IMAGE071
Is of positive type, with
Figure 280592DEST_PATH_IMAGE078
Models being negative classes
Figure 196596DEST_PATH_IMAGE049
Identifying ones of the feature dimensions that belong to a direct correlation
Figure 602913DEST_PATH_IMAGE073
Class characteristics, training process optimization
Figure DEST_PATH_IMAGE079
To obtain
Figure 971577DEST_PATH_IMAGE078
And
Figure 272109DEST_PATH_IMAGE071
to make an optimal distinction between the samples
Figure DEST_PATH_IMAGE080
Direct probability
Figure DEST_PATH_IMAGE081
Trend towards 0, for samples
Figure DEST_PATH_IMAGE082
Direct probability
Figure 495148DEST_PATH_IMAGE081
Tending towards 1.
9. The system of claim 6, wherein the predictive model building module is based on a standardized data set
Figure DEST_PATH_IMAGE083
And false negative index of each sample
Figure DEST_PATH_IMAGE084
Building the number of nodes of the input layer as
Figure DEST_PATH_IMAGE085
The number of nodes of the output layer is 1, the activation function of the output layer is a sigmoid function, and the set of transfer matrices between layers is
Figure DEST_PATH_IMAGE086
Multi-layer neural network of
Figure DEST_PATH_IMAGE087
To sample
Figure DEST_PATH_IMAGE088
Warp beam
Figure 481821DEST_PATH_IMAGE087
The output after the operation is defined as
Figure DEST_PATH_IMAGE089
By minimizing the loss function introducing false negative indicators
Figure DEST_PATH_IMAGE090
To obtain
Figure 72203DEST_PATH_IMAGE087
The optimum parameter of (2);
Figure DEST_PATH_IMAGE091
then
Figure 645266DEST_PATH_IMAGE087
And constructing a physical examination assistant decision model after introducing false negative index optimization.
10. The physical examination assistant decision system based on false negative sample identification as claimed in claim 9, wherein the assistant decision module is used for obtaining the physical examination of a single physical examiner
Figure DEST_PATH_IMAGE092
The physical examination indexes corresponding to the items and the characteristic dimensions obtain the characteristic vectors after the standardization processing through a data preprocessing module
Figure DEST_PATH_IMAGE093
Will be
Figure DEST_PATH_IMAGE094
Inputting the physical examination assistant decision-making model constructed by the prediction model construction module, and outputting the prediction result
Figure DEST_PATH_IMAGE095
When is coming into contact with
Figure DEST_PATH_IMAGE096
The physical examination result tends to be positive when the trend is 1, and the physical examination result tends to be positive when the trend is
Figure 926075DEST_PATH_IMAGE096
When the trend is 0, the physical examination result tends to be negative, and the prediction result is provided to a clinician as a physical examination assistant decision result.
CN202111175001.6A 2021-10-09 2021-10-09 Body examination aid decision-making system based on false negative sample identification Active CN113611411B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111175001.6A CN113611411B (en) 2021-10-09 2021-10-09 Body examination aid decision-making system based on false negative sample identification
PCT/CN2022/123731 WO2023056918A1 (en) 2021-10-09 2022-10-07 False negative sample recognition-based physical examination assistant decision-making system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111175001.6A CN113611411B (en) 2021-10-09 2021-10-09 Body examination aid decision-making system based on false negative sample identification

Publications (2)

Publication Number Publication Date
CN113611411A CN113611411A (en) 2021-11-05
CN113611411B true CN113611411B (en) 2021-12-31

Family

ID=78343379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111175001.6A Active CN113611411B (en) 2021-10-09 2021-10-09 Body examination aid decision-making system based on false negative sample identification

Country Status (2)

Country Link
CN (1) CN113611411B (en)
WO (1) WO2023056918A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113611411B (en) * 2021-10-09 2021-12-31 浙江大学 Body examination aid decision-making system based on false negative sample identification
CN113990494B (en) * 2021-12-24 2022-03-25 浙江大学 Tic disorder auxiliary screening system based on video data
CN117150369B (en) * 2023-10-30 2024-01-26 恒安标准人寿保险有限公司 Training method of overweight prediction model and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN109830303A (en) * 2019-02-01 2019-05-31 上海众恒信息产业股份有限公司 Clinical data mining analysis and aid decision-making method based on internet integration medical platform
CN110084374A (en) * 2019-04-24 2019-08-02 第四范式(北京)技术有限公司 Construct method, apparatus and prediction technique, device based on the PU model learnt

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107887036A (en) * 2017-11-09 2018-04-06 北京纽伦智能科技有限公司 Construction method, device and the clinical decision accessory system of clinical decision accessory system
CN107798390B (en) * 2017-11-22 2023-03-21 创新先进技术有限公司 Training method and device of machine learning model and electronic equipment
US20210174448A1 (en) * 2019-12-04 2021-06-10 Michael William Kotarinos Artificial intelligence decision modeling processes using analytics and data shapely for multiple stakeholders
CN111180068A (en) * 2019-12-19 2020-05-19 浙江大学 Chronic disease prediction system based on multi-task learning model
CN111312401B (en) * 2020-01-14 2021-12-17 之江实验室 After-physical-examination chronic disease prognosis system based on multi-label learning
CN113611411B (en) * 2021-10-09 2021-12-31 浙江大学 Body examination aid decision-making system based on false negative sample identification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN109830303A (en) * 2019-02-01 2019-05-31 上海众恒信息产业股份有限公司 Clinical data mining analysis and aid decision-making method based on internet integration medical platform
CN110084374A (en) * 2019-04-24 2019-08-02 第四范式(北京)技术有限公司 Construct method, apparatus and prediction technique, device based on the PU model learnt

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A maximum likelihood approach to electronic health;Lingjiao Zhang等;《Journal of the American Medical Informatics Association》;20191113;全文 *
False positive rate control for positive unlab ele d learning;Shuchen Kong等;《Neurocomputing》;20190808;全文 *

Also Published As

Publication number Publication date
CN113611411A (en) 2021-11-05
WO2023056918A1 (en) 2023-04-13

Similar Documents

Publication Publication Date Title
CN113611411B (en) Body examination aid decision-making system based on false negative sample identification
Singh et al. Stacking-based multi-objective evolutionary ensemble framework for prediction of diabetes mellitus
Casiraghi et al. Explainable machine learning for early assessment of COVID-19 risk prediction in emergency departments
Ali et al. An approach based on mutually informed neural networks to optimize the generalization capabilities of decision support systems developed for heart failure prediction
CN109544518B (en) Method and system applied to bone maturity assessment
CN111312401B (en) After-physical-examination chronic disease prognosis system based on multi-label learning
CN111009321A (en) Application method of machine learning classification model in juvenile autism auxiliary diagnosis
CN109558896A (en) Disease intelligent analysis method and system based on ultrasound omics and deep learning
Ferrante et al. Artificial intelligence in the diagnosis of pediatric allergic diseases
Zhou et al. Cohesive multi-modality feature learning and fusion for COVID-19 patient severity prediction
JP2023184468A (en) Passage abnormality detection system based on adaptive resampling deep encoder network
CN113610118A (en) Fundus image classification method, device, equipment and medium based on multitask course learning
Ha et al. Fine-grained interactive attention learning for semi-supervised white blood cell classification
CN115798730A (en) Method, apparatus and medium for circular RNA-disease association prediction based on weighted graph attention and heterogeneous graph neural networks
CN117591953A (en) Cancer classification method and system based on multiple groups of study data and electronic equipment
CN111047590A (en) Hypertension classification method and device based on fundus images
CN114334162A (en) Intelligent prognosis prediction method and device for disease patient, storage medium and equipment
Maurya et al. Computer-aided diagnosis of auto-immune disease using capsule neural network
CN114300126A (en) Cancer prediction system based on early cancer screening questionnaire and feed-forward neural network
US20210158967A1 (en) Method of prediction of potential health risk
Mellal et al. CNN Models Using Chest X-Ray Images for COVID-19 Detection: A Survey.
Peng et al. Multi-view weighted feature fusion using cnn for pneumonia detection on chest x-rays
Reeha et al. Alzheimers Disease Detection Using MIC and MLP
Shaheen et al. Hi-Le and HiTCLe: Ensemble Learning Approaches for Early Diabetes Detection using Deep Learning and eXplainable Artificial Intelligence
Rabie et al. Diseases diagnosis based on artificial intelligence and ensemble classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant