CN113611411B

CN113611411B - Body examination aid decision-making system based on false negative sample identification

Info

Publication number: CN113611411B
Application number: CN202111175001.6A
Authority: CN
Inventors: 李劲松; 吴承凯; 周天舒; 田雨
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2021-12-31
Anticipated expiration: 2041-10-09
Also published as: CN113611411A; WO2023056918A1

Abstract

The invention discloses a physical examination assistant decision-making system based on false negative sample identification, which comprises a data acquisition module, a data preprocessing module, a basic characteristic analysis module, a false negative sample identification module, a prediction model construction module and an assistant decision-making module; according to the method, the data cause generated by diagnosis loss is analyzed by simulating a universal clinical diagnosis process, and the process is modeled, so that the method is more consistent with clinical logic, can better discover false negative samples in real world medical data, and improves the application capability of the real world medical data on the construction of a physical examination assistant decision model and clinical assistant decision; the invention does not need to use extra data in the modeling and clinical assistant decision-making processes, and embeds the universal clinical actual decision-making process into the development logic of the model, thereby having stronger universality without introducing extra medical knowledge aiming at application cases.

Description

Body examination aid decision-making system based on false negative sample identification

Technical Field

The invention belongs to the technical field of medical health information, and particularly relates to a physical examination assistant decision-making system based on false negative sample identification.

Background

Retrospective clinical medical research and clinical decision support based on real-world clinical data (represented by electronic medical record data) has become a common and important tool in current medical informatics research. Compared with a prospective clinical Random Control Test (RCT), the retrospective real world data is used for informatics modeling, so that the method has the advantages of large data volume, complete clinical scenes, high patient distribution similarity and the like, can be closer to the actual diagnosis and treatment scenes, and has a better clinical application value.

Physical examination is an important means for finding potential diseases, wherein the blood routine, urine routine and other test indexes carry a large amount of health status information. But current physical examination procedures can only prompt screening for a small fraction of the diseases. Retrospective modeling is carried out based on the electronic case data, the recognition capability of physical examination data on diseases which are not included in the current physical examination discovery range can be greatly improved, and the health value brought by single physical examination is improved.

However, due to the complex sources of real-world medical data, the accuracy and completeness of the included data can be affected by the clinical process at the time of the specific data entry. One typical data incomplete situation is the absence of a positive label (i.e., a false negative sample) of a sample in a real diagnostic label, which may have a great influence on the subsequent predictive model modeling and clinical application processes. Reasons that may lead to the deletion of a positive tag include: 1) when the doctor visits, other irrelevant indexes/diseases which are more concerned by the input doctor subjectively exist; 2) when the doctor visits, the registration department or the doctor-seeking reason is inconsistent with the target disease; 3) there are omissions and the like when a doctor enters a disease.

Due to the prevalence of false negative samples in real world data, many studies have taken this issue into account. The most similar technical scheme of the application is as follows: positive and unlabeled learning (PU learning), this solution considers unlabeled samples in the data as unlabeled samples that may be positive or negative. Jinbo Chen et al [1] eliminated the effect of false negative samples on the overall model by adjusting the sample weights. On the basis of a logistic regression algorithm, the overall positive sample proportion is used as an additional unknown parameter, and the optimal value of the overall positive sample proportion under the data set is obtained by maximizing a likelihood function containing the overall positive sample proportion and a weight matrix, so that the model prediction value is corrected, and the final prediction result is obtained; and secondly, characterizing learning, namely manually/semi-automatically constructing a coding set associated with target diagnosis by Kavishwar B, Wagholigar and other [2] and Yoi Halpern and other [3], and screening additional associated data (such as text data, omics data and the like) of the sample based on the coding set, so that a label-free sample with a high probability of being a positive sample is marked as positive, and the overall influence of a false negative sample on a modeling process is reduced.

The prior art similar to the technical scheme corrects the final model parameters by adjusting the loss function, the sample weight and the like in the modeling process. In the technology, when the adjustment parameters are set, only the false negative samples in the data set are assumed to be a random subset of the positive samples, and the actual reason that the false negative samples which are actually positive but not diagnosed or not diagnosed by the patient in the target disease in the real medical scene are generated is not considered. In fact, the distribution of false negative samples and random distribution tend to vary greatly. The randomness assumption of false negative samples is not consistent with the appearance logic of actual false negative samples, and the actual clinical prediction effect is influenced.

The prior art similar to technical solution 2 complements the positive samples by means of characterization learning. However, the process of characterizing learning often requires constructing a term set with a high medical knowledge threshold for a specific disease, which is not favorable for the universal use of the technology. Meanwhile, the technical scheme needs a large amount of additional medical data to be matched so as to discover false negative samples. For single visit patient cases, which account for the majority of real-world data, the characterization learning-based approach cannot be used to solve the problem of false negatives in real-world medical data in the absence of sufficient additional data.

[1]Zhang L, Ding X, Ma Y, et al. A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients[J]. Journal of the American Medical Informatics Association, 2020, 27(1): 119-126.

[2]Wagholikar K B, Estiri H, Murphy M, et al. Polar labeling: silver standard algorithm for training disease classifiers[J]. Bioinformatics, 2020, 36(10): 3200-3206.

[3]Halpern Y, Horng S, Choi Y, et al. Electronic medical record phenotyping using the anchor and learn framework[J]. Journal of the American Medical Informatics Association, 2016, 23(4): 731-740。

Disclosure of Invention

Based on the basic setting of PU learning, the invention uses the characteristic dimension of physical examination data to be divided into two types of characteristics of directly related dimension and competitive dimension by analyzing the universal generation logic of false negative samples in real world medical data, and has a characteristic granularity hypothesis of different expressions in a data layer, replaces the default hypothesis of data set granularity 'random distribution of false negative samples' in the prior art, solves the problem of inconsistency between the hypothesis and the real world medical data distribution in PU learning modeling, thereby improving the utilization capacity of the real world data and improving the accuracy and range of auxiliary decision of the physical examination data on potential diseases. The invention adaptively determines the influence of data on each clinical characteristic dimension on clinical disease diagnosis and physical examination result input in a data-driven mode, has universality on different target physical examination results, does not depend on a priori medical knowledge system, is beneficial to the application of the invention to various diseases which can be preliminarily diagnosed based on basic physiological indexes and conventional assay indexes, and is particularly suitable for large-scale physical examination scenes. The identification process of the false negative samples does not depend on an additional characterization mining process, so that the data analysis result is not influenced by the lack of additional associated data in the used medical data.

The purpose of the invention is realized by the following technical scheme: a physical examination aid decision-making system based on false negative sample identification, comprising:

a data acquisition module: the system comprises a real-world physical examination data set, a real diagnosis label and a physical examination result, wherein the real-world physical examination data set is obtained and matrixed into an original data set comprising an input characteristic matrix and a real diagnosis label, and a sample with a negative physical examination result is regarded as a label-free sample;

a data preprocessing module: forming a standardized data set by unifying the standard deviation and the mean value of each characteristic component in the original data set; separating positive and negative half-axis components of each characteristic component in the standardized data set, and adding corresponding trainable upper and lower limit values to each positive and negative half-axis component to form an expanded data set;

a basic characteristic analysis module: using a logistic regression model, regarding the unlabeled sample as a negative sample, and training to obtain the characteristic weight of each characteristic dimension pair for generating a real diagnosis label under the condition of not considering the false negative sample;

a false negative sample identification module: the characteristic dimensions are divided into a direct correlation dimension and a competitive dimension, wherein the direct correlation dimension directly influences the judgment of the target physical examination result from a medical perspective, and the competitive dimension does not directly influence the judgment of the target physical examination result from a medical perspective but can compete with the target physical examination result for attention, so that the target physical examination result is lost, and a false negative sample is generated; constructing two logistic regression models and a joint loss function, performing joint training, screening true negative samples and false negative samples by using the joint loss function, enabling direct correlation dimensions to distinguish positive samples and screened suspected true negative samples to the maximum extent, and enabling competitive dimensions to distinguish positive samples and screened suspected false negative samples to the maximum extent; indicating, by the false negative indicator, a likelihood that the sample is a false negative sample;

a prediction model construction module: constructing a multilayer neural network and introducing a loss function of a false negative index, and training a physical examination assistant decision model based on a standardized data set and the false negative index;

an assistant decision module: based on physical examination data of a physical examination person, a standardized feature vector is obtained through a data preprocessing module, a prediction result is obtained through a physical examination assistant decision model, and the prediction result is output to a clinician as a physical examination assistant decision result.

Further, in the data acquisition module, the characteristic dimensions of the physical examination data comprise basic physiological indexes and routine test indexes, the basic physiological indexes comprise height, weight, BMI, systolic pressure and diastolic pressure, and the routine test indexes comprise blood routine and urine routine; the real diagnosis label is a physical examination result.

Further, in the data acquisition module, the physical examination data set is matrixed into a raw data set

，

In order to input the feature matrix, the feature matrix is input,

in order to be the amount of the sample,

the total number of physical examination indexes is shown,

to

A representation of each of the samples is shown,

to

For the feature components of the original data set in each feature dimension,

representing a transpose;

is composed of

The true diagnostic label of an individual sample,

represents the first

One of the samples was a positive sample,

represents the first

The samples are true negative samples or false negative samples and are regarded as label-free samples; the positive sample set was scored as

Set of unlabeled exemplars as

The set of true negative samples was scored as

The false negative sample set was scored as

Is provided with

And is and

，

is known for the particular sample composition of (a),

，

is unknown.

Further, in the data preprocessing module, the data processing module is used for processing the data

Standardizing each characteristic component to ensure that the standard deviation of all physical examination data on each characteristic component is 1 and the mean value is 0; the feature matrix after normalization is recorded as

，

Is shown as

The number of samples after the normalization is determined,

is the normalized first

The dimensional feature component is a component of the feature,

and

forming a standardized data set

；

Will be provided with

Expansion to form trainable feature matrices

：

Wherein

Is shown as

The number of samples after the data expansion is one,

are respectively as

Positive and negative half-axis components of (a);

an offset vector formed for trainable upper and lower limit values on each component,

the addition is accomplished by a broadcast mechanism; trainable feature matrices

And

forming extended data sets

。

Further, in the basic feature analysis module, the unlabeled exemplars are regarded as negative exemplars and are based on the extended data set

Constructing a logistic regression model

，

Loss function of

Comprises the following steps:

wherein

For a vector of weights of features that is trainable,

is a trainable intercept value;

in order to be a sigmoid function,

is a decision function, the value of which is a decision value,

is a logistic regression model obtained after normalization by sigmoid function

The output probability of (1).

Further, the false negative sample identification module comprises:

taking the weight vector of the feature obtained by training in the basic feature analysis module

Setting a trainable non-negative matrix

Satisfy the following requirements

、

The sum matrix of (A) is an identity matrix

；

Construction of two logistic regression models

And

respectively having a characteristic weight coefficient

Respectively having trainable intercept values

Then, the output probabilities of the two logistic regression models after normalization by the sigmoid function are respectively expressed as:

wherein

In order to be a direct probability,

is the attention probability;

utilizing extended data sets

Minimizing joint loss function

Obtaining an optimal parameter;

wherein the content of the first and second substances,

is a sample class weight;

is a screening coefficient;

，

and

gradient back propagation in the model training process is not involved;

for samples in unlabeled sample set

Respectively through the model

And

obtaining direct probabilities

And attention probability

Using false negative indicators

Indicating sample

The probability of false negatives.

Further, in the false negative sample identification module, a logistic regression model is used

By multiplication terms

Screening channel

Predicted output probability

Label-free samples close to 1, and the selected label-free sample set is recorded as

，

And positive sample set

In the dimension of competition

There are differences in the characteristics of the classes, in the direct correlation dimension

Should have no significant difference in the characteristics of the classes, through training to

Is of positive type, with

Models being negative classes

Identification of the dimensions of the features belonging to the competition

Class characteristics, training process optimization

To obtain

And

to make an optimal distinction between the samples

Probability of degree of interest

Trend towards 0, for samples

Probability of degree of interest

Tending towards 1.

By multiplication terms

Screening channel

Predicted attention probability

，

And positive sample set

In the direct correlation dimension

There are differences in the characteristics of the classes, in the competitive dimension

Is of positive type, with

Models being negative classes

Identifying ones of the feature dimensions that belong to a direct correlation

Class characteristics, training process optimization

To obtain

And

to make an optimal distinction between the samples

Direct probability

Trend towards 0, for samples

Direct probability

Tending towards 1.

Further, the prediction model building module is based on a standardized data set

And false negative index of each sample

Building the number of nodes of the input layer as

The number of nodes of the output layer is 1, the activation function of the output layer is a sigmoid function, and the set of transfer matrices between layers is

Multi-layer neural network of

To sample

Warp beam

The output after the operation is defined as

By minimizing the loss function introducing false negative indicators

To obtain

The optimum parameter of (2);

then

And constructing a physical examination assistant decision model after introducing false negative index optimization.

Further, in the assistant decision module, the single physical examinee is obtained through physical examination

Volume with items corresponding to feature dimensionsDetecting indexes, and obtaining normalized feature vectors by a data preprocessing module

Will be

Inputting the physical examination assistant decision-making model constructed by the prediction model construction module, and outputting the prediction result

When is coming into contact with

The physical examination result tends to be positive when the trend is 1, and the physical examination result tends to be positive when the trend is

When the trend is 0, the physical examination result tends to be negative, and the prediction result is provided to a clinician as a physical examination assistant decision result.

The invention has the beneficial effects that:

1. existing positive-label-free learning techniques treat clinical diagnostic deficits as randomly occurring behaviors. According to the invention, through simulating a universal clinical diagnosis process, the data cause generated by diagnosis deficiency is analyzed, the process is modeled, the clinical logic is better met, false negative samples in real world medical data can be better found, and the application capability of the real world medical data on the construction of a physical examination assistant decision model and clinical assistant decision is improved.

2. The existing characterization learning technology needs a large amount of additional data and a certain amount of medical professional knowledge to support the characterization mining process, and is weak in universality. The invention does not need to use extra data in the modeling and clinical assistant decision-making processes, and embeds the universal clinical actual decision-making process into the development logic of the model, thereby having stronger universality without introducing extra medical knowledge aiming at application cases.

Drawings

Fig. 1 is a block diagram of a medical examination assistant decision system based on false negative sample identification according to an embodiment of the present invention;

FIG. 2 is a flow chart of false negative sample identification provided by an embodiment of the present invention;

fig. 3 is a flowchart of a construction process of a physical examination assistant decision model after introducing false negative indicator optimization according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

The embodiment of the invention provides a physical examination assistant decision-making system based on false negative sample identification, which comprises a data acquisition module, a data preprocessing module, a basic feature analysis module, a false negative sample identification module, a prediction model construction module and an assistant decision-making module, and the implementation process of each module is explained in detail below, as shown in fig. 1.

Firstly, a data acquisition module: the system comprises a real-world physical examination data set, a real diagnosis label and a physical examination result, wherein the real-world physical examination data set is obtained and matrixed into an original data set comprising an input characteristic matrix and a real diagnosis label, and a sample with a negative physical examination result is regarded as a label-free sample;

specifically, the data acquisition module is used to acquire a real-world physical examination data set stored in the csv file, including feature dimensions and real diagnostic tags. The characteristic dimension of the physical examination data comprises basic physiological indexes and conventional assay indexes; the basic physiological indexes comprise height, weight, BMI, systolic pressure and diastolic pressure; conventional assay indices include blood convention (total protein, albumin, globulin, white globulin ratio, glutamic-pyruvic transaminase, glutamic-oxaloacetic transaminase, alkaline phosphatase, cholinesterase, total bile acid, total bilirubin, direct bilirubin, indirect bilirubin, adenylic deaminase, glutamyl transpeptidase, glomerular filtration rate, creatinine, urea, uric acid, somatostatin C, triglycerides, total cholesterol, high density lipoprotein-C, low density lipoprotein-C, very low density lipoprotein-C, fasting plasma glucose, potassium, sodium, chloride, total calcium, inorganic phosphorus, glyco-dipeptidyl aminopeptidase, alpha-fucosidase), urine convention (urine protein, uroketone, urine sugar, urobilirubin, urine sediment white cells, urine sediment red cells, urobilinogen, urine acidity); the real diagnosis label is the result of physical examination, such as the result of diabetes diagnosis.

Matrixing a physical examination dataset into a raw dataset

Wherein

Inputting a feature matrix;

in order to be the amount of the sample,

is the total number of physical examination indexes, in the example

，

；

To

Represents each sample, embodied in the form of a feature vector,

to

For the feature components of the original data set in each feature dimension,

representing a transpose;

is composed of

The true diagnostic label, i.e. the target label,

represents the first

The physical examination result of each sample is positive, namely the sample is a positive sample;

represents the first

The result of the physical examination of each sample is negative, the sample can be a true negative sample or a false negative sample, and the sample is regarded as a label-free sample. The set of positive samples was scored as

Including all

The sample of (1); record the set of unlabeled exemplars as

Including all

The sample of (1); the collection of true negative samples was scored as

The set of false negative samples was scored as

Is provided with

And is and

，

is known for the particular sample composition of (a),

，

is unknown.

A data preprocessing module: forming a standardized data set by unifying the standard deviation and the mean value of each characteristic component in the original data set; separating positive and negative half-axis components of each characteristic component in the standardized data set, and adding corresponding trainable upper and lower limit values to each positive and negative half-axis component to form an expanded data set, wherein the expanded data set comprises the following steps:

to pair

Of each characteristic component

Normalizing based on the component

Making the standard deviation of all physical examination data on the component be 1 and the mean value be 0; the feature matrix after normalization is recorded as

，

Is shown as

The normalized samples are embodied in the form of feature vectors,

and

forming a standardized data set

；

Wherein

Is the normalized first

The dimensional feature component is a component of the feature,

is composed of

In a component of a sample

The average value of the above-mentioned average values,

is composed of

In a component of a sample

Standard deviation of (2).

As physical examination indexes usually provide assistant decision information in the form of 'higher than a normal upper limit value' and 'lower than a normal lower limit value' in real-life use, and physical examination results guided by the two types of assistant decision information are not completely contradictory, the data preprocessing process of the invention separately considers positive and negative data and additionally adds trainable offset vectors, so that the feature matrix of the constructed expanded data set is close to a clinical use scene. In particular, separation

Each characteristic component

The positive and negative half-axis components of the vector are used for simulating the difference of two types of assistant decision information, and an offset vector is added

To simulate the upper and lower normal limits of physical examination index.

Based on this, will

Expansion to form trainable feature matrices

：

Wherein

Are respectively as

Positive and negative half-axis components of (a);

for the offset vector formed by the trainable upper and lower limit values on each component, there are

The addition is done by a broadcast mechanism (broadcasting).

Trainable feature matrices

And

forming extended data sets

。

Expanding the data set in the preprocessed data set

Is used for the basic feature analysis module and the false negative sample identification module to standardize the data set

Is used for a prediction model building module and an assistant decision module.

Thirdly, a basic characteristic analysis module: training to obtain the feature weight of each feature dimension pair generating a real diagnosis label under the condition of not considering false negative samples by using a logistic regression model and regarding the unlabeled samples as the negative samples, wherein the method comprises the following steps:

all unlabeled samples are regarded as negative samples and are based on the preprocessed extended data set

Constructing a logistic regression model

，

Loss function of

Comprises the following steps:

wherein

For a vector of weights of features that is trainable,

for the value of the intercept to be trainable,

is shown as

The sample after data expansion is embodied in the form of a feature vector,

is as follows

True diagnostic label of individual sample;

in order to be a sigmoid function,

is a decision function, the value of which is a decision value,

is a logistic regression model obtained after normalization by sigmoid function

Output probability of, i.e.

Predicted samples

Probability of being positive. In the examples, the model was trained using a small Batch Gradient Descent method (Mini-Batch Gradient Descent) with a sample size of 500 for a single Batch.

And fourthly, a false negative sample identification module: the characteristic dimensions are divided into a direct correlation dimension and a competitive dimension, wherein the direct correlation dimension directly influences the judgment of the target physical examination result from a medical perspective, and the competitive dimension does not directly influence the judgment of the target physical examination result from a medical perspective but can compete with the target physical examination result for attention, so that the target physical examination result is lost, and a false negative sample is generated; constructing two logistic regression models and a joint loss function, performing joint training, screening true negative samples and false negative samples by using the joint loss function, enabling direct correlation dimensions to distinguish positive samples and screened suspected true negative samples to the maximum extent, and enabling competitive dimensions to distinguish positive samples and screened suspected false negative samples to the maximum extent; indicating, by the false negative indicator, a likelihood that the sample is a false negative sample; the method comprises the following steps:

dividing the characteristic dimension into directly related dimensions based on the generation logic of physical examination results in physical examination clinical practice

Class and competition dimensions

And the two types are classified. It is defined as: dimension of direct correlation

The characteristics in the class directly influence the judgment of the target physical examination result from the medical perspective; dimension of competition

The features in the class do not directly affect the determination of the target physical examination result from a medical perspective, but compete with the target physical examination result for attention, and thus may cause the target physical examination result to be missing and generate a false negative sample. From generating logically, feature weight vectors

Is generated under the combined action of the two characteristics. The core idea of the false negative sample identification module is to identify through data induction

And

these two types of features allow the assessment of the likelihood that an unlabeled sample is false negative.

Setting a trainable non-negative matrix

Satisfy the following requirements

、

The sum matrix of (A) is an identity matrix

(ii) a Then:

wherein, by

The decision value contributed by the class feature is

The maximum differentiation of the positive sample sets

And true negative sample set

(ii) a By

The decision value contributed by the class feature is

The maximum differentiation of the positive sample sets

And false negative sample set

。

Based on the above recognition, the false negative sample identification module performs the following steps:

construction of two logistic regression models

And

respectively having a characteristic weight coefficient

Respectively having trainable intercept values

. The output probabilities of the two logistic regression models after normalization by the sigmoid function are respectively expressed as:

balance

In order to be a direct probability,

is the attention probability.

Under the condition of the optimal characteristic classification,

the maximum discrimination of the positive sample set

And true negative sample set

，

The maximum discrimination of the positive sample set

And false negative sample set

. Thus, the trainable parameters include

Using extended data sets

Minimizing joint loss function

Obtaining optimal parameters and expanding data set

Offset vector of

Using basic feature analysis module

The optimization result obtained after training is not further trained.

Wherein the content of the first and second substances,

weighting the sample class for adjusting the proportion of different classes of samples in training, wherein the sample class is used

；

For screening the coefficients when

The larger the size, the higher the screening strength of the joint loss function for classifying the unlabeled sample into false negative and true negative samples is, but the diversity of the screened samples is reduced, and the sample is used in the example

；

But not in the gradient back-propagation during model training. Example modeling Using a Small batch gradient descent method

And

the sample size of the single batch use of the combined training of (1) is 500.

Joint loss function

The construction logic of (1) is as follows:

(1) for model

By multiplication terms

Screening channel

Predicted output probability

Higher unlabeled samples, and recording the selected set of the unlabeled samples

Relatively holistic unlabeled exemplar set

，

The proportion of false negative samples in (1) is large.

And positive sample set

In the dimension of competition

There are differences in the characteristics of the classes, but in the directly related dimension

There should be no significant difference in the characteristics of the classes, and thus it can be trained to

Is of positive type, with

Models being negative classes

Identification of the dimensions of the features belonging to the competition

The characteristics of the class. Simultaneous optimization of training process

To obtain

And

to make an optimal distinction between the samples

About probability of degree of attention

Trend towards 0, for samples

About probability of degree of attention

Tending towards 1.

(2) For model

By multiplication terms

Screening channel

Predicted attention probability

Relatively holistic unlabeled exemplar set

，

The proportion of true negative samples in (1) is large.

And positive sample set

In the direct correlation dimension

There are differences in the characteristics of the classes, but in the competitive dimension

Is of positive type, with

Models being negative classes

Identifying ones of the feature dimensions that belong to a direct correlation

The characteristics of the class. Simultaneous optimization of training process

To obtain

And

to make an optimal distinction between the samples

Having a direct probability

Trend towards 0, for samples

Having a direct probability

Tending towards 1.

(3) Due to the existence of limit conditions in the model training process

Requiring the use of joint loss functions

By means of a model

And

and optimizing each parameter in a joint training mode.

After obtaining the optimal parameters, for the sample

Respectively through the model

And

obtain its direct probability

And attention probability

. If it is

If the sample is false negative, the sample should be

The trend is towards 1, and the trend is that,

trend towards 0, using false negative index

Indicating each sample

The probability of false negatives.

The flow of false negative sample identification is shown in FIG. 2.

Fifthly, a prediction model construction module: constructing a multilayer neural network and introducing a loss function of a false negative index, and training a physical examination assistant decision model based on a standardized data set and the false negative index, wherein the training process comprises the following steps:

based on markNormalized data set

And false negative index of each sample

Building the number of nodes of the input layer as

Multi-layer neural network of

To sample

Neural network

The output after the operation is defined as

The vector formed by all the outputs is recorded as

Then the loss function can be introduced by minimizing the loss function that introduces false negative indicators

To obtain

The optimum parameter of (2).

Then

And a physical examination assistant decision-making model after the optimization of false negative indexes is introduced for construction. The construction process of the physical examination assistant decision model is shown in figure 3.

In the example, a three-layer neural network is constructed

，

The number of nodes of the input layer is

The number of nodes of the output layer is 1, the number of nodes of the intermediate layer is 20, and the set of transfer matrices between layers is

，

For the transfer matrix of the input layer to the intermediate layer,

for the transition matrix from the middle layer to the output layer, the activation function between layers is { ReLU, sigmoid }. Model training was performed using a small batch gradient descent method, using a sample size of 500 for a single batch.

Sixthly, an auxiliary decision module: based on physical examination data of a physical examiner, standardized feature vectors are obtained through a data preprocessing module, a prediction result is obtained through a physical examination assistant decision model, and the prediction result is output to a clinician as a physical examination assistant decision result, and the method comprises the following steps:

obtained by subjecting a single physical examinee to physical examination

The physical examination indexes corresponding to the items and the characteristic dimensions obtain the characteristic vectors after the standardization processing through a data preprocessing module

. Then, will

When is coming into contact with

The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A physical examination assistant decision system based on false negative sample identification, comprising:

2. A physical examination assistant decision system based on false negative sample identification as claimed in claim 1, characterized in that in the data acquisition module, the characteristic dimension of the physical examination data comprises basic physiological indexes and routine test indexes, the basic physiological indexes comprise height, weight, BMI, systolic pressure and diastolic pressure, and the routine test indexes comprise blood routine and urine routine; the real diagnosis label is a physical examination result.

3. The system of claim 1, wherein the data acquisition module is configured to matrix the physical examination dataset into a raw dataset

，

In order to input the feature matrix, the feature matrix is input,

in order to be the amount of the sample,

the total number of physical examination indexes is shown,

to

A representation of each of the samples is shown,

to

For the feature components of the original data set in each feature dimension,

representing a transpose;

is composed of

The true diagnostic label of an individual sample,

represents the first

One of the samples was a positive sample,

represents the first

Set of unlabeled exemplars as

The set of true negative samples was scored as

The false negative sample set was scored as

Is provided with

And is and

，

is known for the particular sample composition of (a),

，

is unknown.

4. The system of claim 3, wherein the data preprocessing module is configured to perform on-line analysis of the sample data for the determination of the false negative sample

，

Is shown as

The number of samples after the normalization is determined,

is the normalized first

The dimensional feature component is a component of the feature,

and

forming a standardized data set

；

Will be provided with

Expansion to form trainable feature matrices

：

Wherein

Is shown as

The number of samples after the data expansion is one,

are respectively as

Positive and negative half-axis components of (a);

And

forming extended data sets

。

5. The system of claim 4, wherein the basic feature analysis module considers unlabeled samples as negative samples and is based on an expanded data set

Constructing a logistic regression model

，

Loss function of

Comprises the following steps:

wherein

For a vector of weights of features that is trainable,

is a trainable intercept value;

in order to be a sigmoid function,

is a decision function, the value of which is a decision value,

is a logistic regression obtained after normalization by sigmoid functionModel (model)

The output probability of (1).

6. The physical examination assistant decision system based on false negative sample identification as claimed in claim 5, wherein the false negative sample identification module comprises:

Setting a trainable non-negative matrix

Satisfy the following requirements

、

The sum matrix of (A) is an identity matrix

；

Construction of two logistic regression models

And

respectively having a characteristic weight coefficient

Respectively having trainable intercept values

Respectively representing the output probabilities of the two logistic regression models after normalization by sigmoid functionComprises the following steps:

wherein

In order to be a direct probability,

is the attention probability;

utilizing extended data sets

Minimizing joint loss function

Obtaining an optimal parameter;

wherein the content of the first and second substances,

is a sample class weight;

is a screening coefficient;

，

and

gradient back propagation in the model training process is not involved;

for unlabeled sample setOf (2) a sample

Respectively through the model

And

obtaining direct probabilities

And attention probability

Using false negative indicators

Indicating sample

The probability of false negatives.

7. The system of claim 6, wherein the false negative sample identification module is configured to identify the logistic regression model

By multiplication terms

Screening channel

Predicted output probability

Unlabeled samples approaching 1 are screenedIs recorded as a set of unlabeled exemplars

，

And positive sample set

In the dimension of competition

Is of positive type, with

Models being negative classes

Identification of the dimensions of the features belonging to the competition

Class characteristics, training process optimization

To obtain

And

to make an optimal distinction between the samples

Probability of degree of interest

Trend towards 0, for samples

Probability of degree of interest

Tending towards 1.

8. The system of claim 6, wherein the false negative sample identification module is configured to identify the logistic regression model

By multiplication terms

Screening channel

Predicted attention probability

，

And positive sample set

In the direct correlation dimension

Is of positive type, with

Models being negative classes

Identifying ones of the feature dimensions that belong to a direct correlation

Class characteristics, training process optimization

To obtain

And

to make an optimal distinction between the samples

Direct probability

Trend towards 0, for samples

Direct probability

Tending towards 1.

9. The system of claim 6, wherein the predictive model building module is based on a standardized data set

And false negative index of each sample

Building the number of nodes of the input layer as

Multi-layer neural network of

To sample

Warp beam

The output after the operation is defined as

By minimizing the loss function introducing false negative indicators

To obtain

The optimum parameter of (2);

then

10. The physical examination assistant decision system based on false negative sample identification as claimed in claim 9, wherein the assistant decision module is used for obtaining the physical examination of a single physical examiner

Will be

When is coming into contact with