Embodiment one
Fig. 1-4 is please referred to, the present invention proposes a kind of nosocomial infection intelligent diagnosing method based on multi-model, including following step
It is rapid:
Step S1 obtains several medical record datas relevant to nosocomial infection;
The medical record data includes: course of disease information, checks checking information;The course of disease information includes: admission records, for the first time
Text data during progress note, attending physician are made the rounds of the wards, discharge records etc. for case history description;The inspection checking information packet
Include: image information, physical examination information, physical examination result data and body personality check data, body temperature information etc..It is corresponded to according to different infection
Number of patients screened, reject the infection that corresponding number of patients is less than 500, and data are divided into training set and test
Collection.
Step S2, pre-processes medical record data, obtains several discrete word column corresponding with every part of medical record data
Table;
Negative phrase filtering is carried out to text data, deletes such as " do not hear and bubble ", " lymph node not enlargement " feminine gender
Symptom, then according to the syntactic structure feature of different piece in electronic health record, including plain text, noun+numeral classifier phrase and name
Word+Adjective Phrases design different segmenting methods.Plain text is directly carried out using the participle tool (such as Ansj) of open source
Word segmentation processing, but due in case history include a large amount of medical professionalism terms, in order to using open source the effective cutting of participle tool this
A little terms need to construct professional domain dictionary;It is usually to be used to record patient for the noun in electronic health record+numeral-classifier compound phrase
Sign information and inspection result, can first extract numerical value, and the numerical value is compared with pre-set threshold value, according to than
Such phrase is converted to word feature by relatively result;Patient is typically occurred in for noun+Adjective Phrases in electronic health record
Physical examination part, such as sanity, check cooperation, by key-value pair transformation approach this kind of phrase is handled in the way of be will be short
The noun conversion bonding that patient's attribute is described in language, is converted to value for the adjective for describing attribute corresponding states in phrase, such as right
Cooperate in checking, transformation result is " mind "-" clear ".After pretreatment by original medical record data from continuous text be converted into from
Scattered word list.
Step S3 is proportionally divided into training set and test set to all word lists;It is directed in the training set
Different infection types obtains optimal characteristics collection;
According to the corresponding word list of medical record data obtained in step S2, the characteristic set to be selected of different infection is collected,
Every kind is selected to infect the most representative feature of preceding N kind most using Chi-square Test and the feature selection approach based on class discrimination degree
For the characteristic set of different infection.The size of N is determined by experiment.
Step S4 carries out tune ginseng to two or more basic mode types respectively, optimized parameter is selected to obtain two or more optimal bases
Model merges all optimal base models, obtains diagnostic model;
Step S5 tests diagnostic model with test set, and the performance of analyzing and diagnosing model.
Using RandomForest, XGBoost, GradientBoosting, ExtraTrees, SVC training basic mode type, and
This five basic mode types are merged with stacking Model Fusion method, obtain final disaggregated model.Basic mode type is being trained,
The optimized parameter of different basic mode types is found using grid search (GridSearch).
Nosocomial infection intelligent diagnosing method and system provided by the invention based on multi-model, passes through the disease to multiple patients
It counts one by one according to being pre-processed, several discrete word lists is obtained, using most of word list as training set, by being calculated as
The type of difference infection selects the strongest element of relevance to form optimal characteristics collection from word list;Utilize two or more calculations
Method carries out fusion to trained basic mode type and obtains diagnostic model by preferentially parameter training basic mode type, finally in test set
Word list tests diagnostic model, and then the performance of analyzing and diagnosing model;The good diagnostic model of passage capacity is to hospital
Infection carries out intelligent diagnostics, can carry out early warning to nosocomial infection, and the infection conditions of patient are made with the diagnosis of early stage, auxiliary doctor
Shield personnel carry out more comprehensive, accurate and efficient analysis to the infection conditions of patient, and complicated infection conditions can be done
A comprehensive analysis out, and the ability for utilizing machine learning to find information from data, excavate feature, can be more efficiently
The quite similar infection of clinical manifestation is distinguished, more accurate diagnosis is made;On the other hand based on hospital's sense of multi-model fusion
Dye intelligent diagnosing method overcomes the problems in traditional expert system, and machine learning classification model is to utilize patient's history's number
According to being trained, as long as obtaining new patient data can be carried out a new wheel training, model can be constantly updated, and in mould
After the completion of type training, it is only necessary to the relevant parameter of preservation model, when testing unknown sample, it is only necessary to according to ginseng
Number calling models you can get it the corresponding infection type of the sample;Furthermore this programme is relative to single diagnostic model, accuracy compared with
Height, and rate of failing to report is effectively reduced.
Preferably as a kind of embodiment of data prediction: it is described that medical record data is pre-processed, it obtains several
The step of discrete word list corresponding with every part of medical record data includes:
Step S21a, to the phrase for being divided, being formed after cutting about the text data of case history description in medical record data
It is filtered, will be filtered out comprising the relevant phrases for negating word;
The phrase of step S22a, reservation are connected using preset connector, are formed case history and are described segment;
Step S23a describes the medical terms for including in segment to the case history and carries out cutting, according to known drug name
Register and disease name register is claimed to establish professional domain dictionary.
Comprising a large amount of negative phrases in Chinese case history, " do not hear and bubble ", " lymph node not enlargement ", " negative hepatitis
History " etc., this kind of phrase is little generally for the diagnostic effect of disease, can be used as noise and directly eliminates, and in Chinese electronic health record,
The description of one section of case history is usually made of several sentences, between sentence and sentence usually by ".","!",";" and "? " these characterize
The symbol that one sentence terminates is separated, and in a sentence, generally comprises several phrases, between phrase usually by ", " into
Row separate (", " etc. the word of symbols connection to can be generally thought be comprising in the same phrase).According to this separation mode, into
When row negative phrase filtering, the whole section of case history description of acquisition patient first, then utilize ".","!",";" and "? " paragraph is drawn
It is divided into multiple sentences, recycles ", " that sentence is cut into multiple phrases, finally traverse these phrases, judge whether deposit in phrase
Negative word in negative word list, which is then deleted, otherwise retain if it exists, finally by institute's phrase with a grain of salt in advance
The connector being first arranged is attached, and is obtained filtered case history and is described segment;Due to including a large amount of Special Medicals in electronic health record
Term is treated, in order to utilize these terms of the effective cutting of participle tool of increasing income, from state food pharmaceuticals administration general bureau official
Net has crawled common drug Chinese, obtains common disease Chinese name from ICD-10 (International Classification of Diseases coding) system
Claim, constructs professional domain dictionary using these drugs and disease name.
Preferably as the another embodiment of data prediction: it is described that medical record data is pre-processed, if obtaining
The step of doing discrete word list corresponding with every part of medical record data include:
Step S21b extracts the name in data to the physical examination information and physical examination result data recorded in medical record data respectively
Word part and numeral-classifier compound part, and connect to form phrase using preset connector;
Step S22b is the range that different signs or different inspections divide threshold value according to medical standard, by numeral-classifier compound
Numerical value judges suitable threshold value comparison with the noun connecting through the numeral-classifier compound;
Step S23b converts word feature for the phrase according to comparison result.
When being segmented, participle tool used herein is AnsjSeg Words partition system, which is based on the Chinese Academy of Sciences
The Words partition system ICTCLAS exploitation that calculation machine is developed, be the java Chinese automatic word-cut based on n-gram+CRF+HMM;
Noun+numeral-classifier compound phrase in electronic health record is usually the sign information and inspection result for being used to record patient, such as records patient
The sign data of body temperature (38.2 DEG C of body temperature), and record patient's hemoglobin count the inspection result etc. of (HGB:118g/L),
For the noun in electronic health record+numeral-classifier compound phrase, numerical value can be first extracted, and the numerical value and pre-set threshold value are carried out
Compare, such phrase is converted to by word feature according to comparison result.For the temperature data (such as 38.2 DEG C of body temperature) of patient
Conversion process is as follows:
Separate noun part and numeral-classifier compound part.38.2 DEG C of separating resultings of body temperature are as follows: body temperature (38.2 DEG C);
It is that different signs or inspection divide threshold range according to related specifications, as the oral cavity normal temperature range of body temperature is
36.3 DEG C~37.2 DEG C, the armpit normal temperature range of body temperature is 36.1 DEG C~37 DEG C, according to the threshold range by the body of patient
Warm information is converted, and such as 38.2 DEG C have been more than normality threshold range, it is possible to be converted into " body temperature (higher) " or " body temperature
(rising) ";
Conversion results are converted into the form of key-value pair, such as the key-value pair form of body temperature are as follows: body temperature (decline, it is normal, on
It rises).
Preferably as another embodiment of data prediction: it is described that medical record data is pre-processed, if obtaining
The step of doing discrete word list corresponding with every part of medical record data include:
Step S21c, checks data to the body personality that records in medical record data, extract respectively noun part in data and
Adjective part, and connect to form phrase using preset connector;
The noun that patient attribute is described in the phrase is converted to key using key-value pair transformation approach, by institute by step S22c
It states and describes the adjective of noun corresponding states in phrase and be converted into value, and using preset connector connecting key and value, form key
Value tag.
Step S23c, noun+Adjective Phrases in electronic health record typically occur in patient's physical examination part, such as mind
Clear, inspection cooperation etc..By key-value pair transformation approach this kind of phrase is handled in the way of be that patient's attribute will be described in phrase
Noun converts bonding, and the adjective that attribute corresponding states is described in phrase is converted to value, such as cooperates for checking, transformation result
For " mind "-" clear ".For the same key, corresponding value may more than one, such as " mind " in addition to can use " clear " into
Row description can also be held with morphologies such as " fuzzy ", " unclear " and " in a trance ", so when being converted, when needing to collect and survey
The corresponding different value of the same key obtained, such as " mind ": " clear ", " in a trance ".
Preferably, described pair of all word list is proportionally divided into training set and test set;In the training set
Include: for the step of different infection type acquisition optimal characteristics collection
Step S31, to all word lists according to 7:3 or 8:2 ratio cut partition be training set and test set;
Step S32, according to the corresponding word list of every part of medical record data, obtains different infection type packets in training set
The characteristic set to be selected contained;
Step S33, by Chi-square Test feature selection approach or the selection of the feature selection approach based on class discrimination degree is every
The preceding most representative feature of N kind is as optimal characteristics set in kind infection type;The size of N is determined by experiment.
After data prediction, the corresponding word list of available every part of case history corresponds to patient according to case history and is suffered from
Infection arranged, the corresponding characteristic set of available every kind of infection.Specific feature selection approach includes Chi-square Test
With the feature selecting based on class discrimination degree.
Preferably, described to select preceding N kind in every kind of infection type most representative by Chi-square Test feature selection approach
Feature as optimal characteristics set;The step of size of N is determined by experiment include:
Step S331a, it is assumed that feature and infection are unrelated, obtain the deviation of actual value and theoretical value;
Step S332a, according to take from high to low the corresponding N kind feature of deviation as optimal characteristics set;
The size of step S333a, N are determined by experiment.
The basic thought of Chi-square Test is that theoretical correctness is determined by the deviation of observation actual value and theoretical value,
When carrying out feature selecting, null hypothesis is that feature and infection are unrelated, is worth bigger, representative and null hypothesis when Chi-square Test is calculated
Deviation is bigger, also just represents this feature and infection correlation is higher.
Preferably, described that preceding N kind is selected in every kind of infection type most by the feature selection approach based on class discrimination degree
Representative feature is as optimal characteristics set;The step of size of N is determined by experiment include:
Step S331b calculates different characteristic for the representative degree of infection type, is arranged from high in the end according to representative degree
Sequence, representative degree is bigger to represent feature and the correlation of infection is higher;
Step S332b, optimal characteristics collection of the n feature as different infection before selecting, the size of n are carried out true by experiment
Fixed, experimental evaluation standard includes accuracy rate and rate of failing to report.
Feature selecting based on class discrimination degree is using the composition and characteristic distributions infected in case history: (1) feeling in case history
The repetition rate of the keyword of dye is low;(2) the electronic health record key symptoms word overlapping degree of the patient with similar infection is high;
(3) the key symptoms word between different infection excludes each other, and obtains one and calculates different characteristic for the representative degree of infection,
It is ranked up according to representative degree, representative degree is bigger to represent feature and the correlation of infection is higher.Obtain feature ordering result it
Afterwards, optimal characteristics collection of the n feature as different infection before selecting, and the size of n can not can only pass through reality by artificially determining
It tests and is determined, experimental evaluation standard includes accuracy rate and rate of failing to report.
Preferably, described that tune ginseng is carried out respectively to two or more basic mode types, select optimized parameter acquisition two or more most
Excellent basic mode type merges all optimal base models, obtain diagnostic model the step of include:
Step S41, the basic mode type include: RandomForest, XGBoost, GradientBoosting,
ExtraTrees and SVC;The optimized parameter of the model is found using grid-search algorithms GridSearch;
It before training pattern, needs that training set is first carried out random division, obtains part training set and verifying collection.To base
It when model is trained, needs to carry out model tune ginseng, obtains different model optimized parameters.
For RandomForest, to parameter n_estimators, bootstrap, criterion and min_ in detail below
Samples_leaf is adjusted and preferentially obtains optimal RandomForest model, and wherein n_estimators indicates decision tree
Number, bootstrap indicate whether put back to sampling, and criterion indicates the evaluation criterion used in partitioning site,
The least sample number of leaf node is indicated including Geordie purity gini and information gain entropy, min_samples_leaf;
For XGBoost, to parameter booster, eta in detail below, min_child_weight, gamma,
Objective is adjusted and preferentially obtains optimal XGBoost model, and booster indicates the model of each iteration, comprising: base
Model gbtree, linear model gbliner in tree;Eta indicates that learning rate, min_samples_leaf determine minimum leaf section
Point sample weights and;Least disadvantage function minimum needed for gamma specifies node split;Objective defines needs
The loss function being minimized, common function include binary:logistic, multi:softmax;
For GradientBoosting, to following parameter loss, learning_rate, n_estimators, max_
Depth is adjusted and preferentially obtains optimal GradientBoosting model, and wherein loss indicates the loss function of selection,
Learning-rate indicates that learning rate, n_estimators indicate the number of weak learner, and max_depth indicates each weak
The depth capacity for practising device, for limiting the interstitial content of regression tree;
For ExtraTrees and SVC, following parameter C, kernel, degree, gamma, coef0 are adjusted simultaneously
Optimal ExtraTrees model and optimal RF model are preferentially obtained respectively, and C indicates slack variable, the i.e. penalty term to mistake classification
Coefficient, kernel indicate kernel function type, including linear kernel function linear, gaussian kernel function RBF, Polynomial kernel function
Poly, sigmoid kernel function, degree indicate multinomial dimension when kernel function is poly, and gamma indicates that when kernel function be Gauss
The parameter of kernel function impliedly determines the distribution that data are mapped to after the feature space newly arrived;
Step S42 merges above-mentioned model according to stacking algorithm, by the output of above-mentioned model as new number
Linear regression is used to export as final disaggregated model according to new one new model of data set re -training according to collection.
Table 1 is that the infection data that the embodiment of the present invention uses are described in detail.
Table 1
Since in the training process, the size of data set can have a significant impact to final result, so after collecting data
It needs to be filtered, the infection by number of patients less than 500 is rejected, and remaining infection type is respectively clinical septicopyemia, master
The infection of table shallow cut and urethral infection are wanted, then will be infected in remaining 3 and be divided into training set and test set according to 7:3.
Table 2 is in the present invention using Chi-square Test and the rows of the feature selecting based on class discrimination degree obtains 3 kinds of infection
5 feature before name, feature A is the feature that the feature selecting based on class discrimination degree obtains, and feature B is the spy that Chi-square Test obtains
Sign.
Table 2
Case history content mainly includes patient from being admitted to hospital to a series of records during discharge, such as admission records, attending physician
Make the rounds of the wards and leave hospital record etc.;After pre-processing to medical record data, the corresponding word list of case history is obtained, then by data by certain
Ratio is divided, and training set and test set are obtained;Feature selecting is carried out to training set, using Chi-square Test and is based on classification area
The feature selecting of indexing obtains the optimal characteristics collection of different infection, and the dimension size n of feature set is determined by experiment;In structure
When building intelligent diagnostics model, tune ginseng is carried out to 5 kinds of basic mode types, selects optimized parameter to obtain optimal base model, then to 5 kinds of basic modes
Type is merged, and is obtained final diagnostic model, is finally tested with test set, the performance of analyzing and diagnosing model.