CN109036577A - Diabetic complication analysis method and device - Google Patents
Diabetic complication analysis method and device Download PDFInfo
- Publication number
- CN109036577A CN109036577A CN201810844798.6A CN201810844798A CN109036577A CN 109036577 A CN109036577 A CN 109036577A CN 201810844798 A CN201810844798 A CN 201810844798A CN 109036577 A CN109036577 A CN 109036577A
- Authority
- CN
- China
- Prior art keywords
- case history
- theme
- progress note
- document
- disease
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Public Health (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Medical Treatment And Welfare Office Work (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of diabetic complication analysis method and devices.The described method includes: obtaining case history document sets;The case history document sets include first quantity part case history document;Every part of case history document includes at least one progress note;Progress note-the theme distribution for obtaining at least one progress note, obtains the progress note vector of every part of case history document;Obtain the tag along sort of the progress note vector;Disease discovery model is trained using multiple tag along sorts of every part of medical history record, obtains final disease discovery model.As it can be seen that the present invention can use the disease development actually occurred in progress note, not found complication in admission diagnosis is detected, be conducive to the diagnosis and treatment accuracy for promoting successive patients.
Description
Technical field
The present invention relates to data mining technology field more particularly to a kind of diabetic complication analysis method and devices.
Background technique
For patient when being admitted to hospital, doctor can do an admission diagnosis, be provided during follow-up visit based on the admission diagnosis
Therapeutic scheme, so that the pain of patient is preferably solved, therefore admission diagnosis is a very important task, one of them
Emphasis is exactly disease discovery.
Currently, disease discovery method passes through prolonged developmental research, have become the weight in the fields such as medical data excavation
Want research direction.Traditional disease finds main relevant rule discovery, classification analysis and clustering etc., mainly with knot
Structure data are Research foundation, however text of the medical information data mostly in the form of Html is stored in the information system of medical institutions
In system, need to handle work by complicated data structured.In addition, many and diverse multiplicity of the characteristic attribute of various disease, structure
Change that treated that data equally include much noise, the accuracy for being easy to find complication causes significant impact.To sum up institute
It states, data structured processing and Feature Engineering considerably increase traditional medical data mining previous work.
Summary of the invention
For the defects in the prior art, it the present invention provides a kind of diabetic complication analysis method and device, is used for
Solve technical problem present in the relevant technologies.
In a first aspect, the embodiment of the invention provides a kind of diabetic complication analysis methods, which comprises
Obtain case history document sets;The case history document sets include first quantity part case history document;Every part of case history document includes
At least one progress note;
Progress note-the theme distribution for obtaining at least one progress note obtains the course of disease note of every part of case history document
Record vector;
Obtain the tag along sort of the progress note vector;
Disease discovery model is trained using multiple tag along sorts of every part of medical history record, obtains final disease
It was found that model.
Optionally, the progress note-theme distribution for obtaining at least one progress note, obtains every part of case history document
Progress note vector include:
The multi-dimensional time sequence of the case history document is obtained according to progress note-theme distribution of at least one progress note
Column theme;
Feature extraction is carried out to the multidimensional time-series theme using singular value decomposition, obtains the surprise of diagonal positions
Different value parameter is the progress note vector of the case history document.
Optionally, the tag along sort for obtaining the progress note vector includes:
Obtain the corresponding disease collection of case history document sets;It includes a variety of disease labels that the disease, which is concentrated,;
An optional disease label is concentrated from the disease, is added the disease label using BP binary classification method
To the progress note vector including the disease, the corresponding multiple tag along sorts of every part of medical history record are obtained.
Optionally, the progress note-theme distribution for obtaining at least one progress note, obtains every part of case history document
Progress note vector include:
Calculate the similarity between the case history document sets any two case history document, obtain the similarity be greater than or
The similarity constraint case history collection constituted equal to multiple case history documents of similarity threshold;
Case history document each in the similarity constraint case history collection is sequentially inputted to default LDA model, by described pre-
If document-the theme distribution and theme of each case history document of LDA model inference-word distribution;
According to the document-theme distribution and theme-word distribution every part of case history document of building progress note vector.
Optionally, calculating the similarity in initial case history between any two case history document includes:
Obtain multiple Similarity measures factors of case history and the weighted value of each Similarity measures factor;
Calculate separately numerical value of any two case history document about each Similarity measures factor;The Similarity measures because
Element includes: distance, the distance of diagnostic result of the distance of gender attribute, segmentation belonging to the age;
Any two are calculated according to the numerical value of each Similarity measures factor and the weighted value of each Similarity measures factor
The similarity of case history document.
Optionally, pass through the document-theme distribution and theme-word of default each case history document of LDA model inference point
Cloth includes:
Theme number z is assigned at random to each word in each case history document in the similarity constraint case history collection;
Rescan the similarity constraint case history collection, to each word according toAgain it adopts
Sample theme, the new theme made meet Gibbs Sampling convergence;
Theme-word co-occurrence frequency matrix in corpus is counted, document-theme distribution and theme-word point is calculated
Cloth.
Optionally, the default LDA model includes:
The constraint of any two case history document similarity uses theme distribution distance dis (θ rm, θ rn) indicate, formula are as follows:
Wherein θ rm={ θM, 1, θM, 2..., θM, Lm, indicate that each case history document includes LmA progress note;θM, LmIt indicates
LmThe theme of a progress note;d(θM, Lm, θN, Ln) it is expressed as the Euclidean distance between the theme vector of two courses of disease;
The default LDA model further includes Gibbs-EM iteration function, are as follows:
Theme is represented in similarity constraint case history collection as the quantity of the word i of k.
Second aspect, the embodiment of the invention provides a kind of diabetic complication analytical equipment, described device includes:
Case history collection obtains module, for obtaining case history document sets;The case history document sets include first quantity part case history text
Shelves;Every part of case history document includes at least one progress note;
Vector space obtains module, for obtaining progress note-theme distribution of at least one progress note, obtains
To the progress note vector of every part of case history document;
Tag along sort obtains module, for obtaining the tag along sort of the progress note vector;
It was found that model obtain module, for using every part of medical history record multiple tag along sorts to disease find model into
Row training obtains final disease discovery model.
As shown from the above technical solution, by obtaining case history document sets in the embodiment of the present invention;The case history document sets packet
Include first quantity part case history document;Every part of case history document includes at least one progress note;Then, obtain it is described at least one
Progress note-theme distribution of progress note obtains the progress note vector of every part of case history document;Later, the course of disease is obtained
Record the tag along sort of vector;Finally, multiple tag along sorts using every part of medical history record instruct disease discovery model
Practice, obtains final disease discovery model.As it can be seen that the present invention can use the disease development actually occurred in progress note, inspection
Not found complication in admission diagnosis is measured, the diagnosis and treatment accuracy for promoting successive patients is conducive to.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will to embodiment or
Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only
Some embodiments of the present invention, for those of ordinary skill in the art, without creative efforts, also
Other attached drawings can be obtained according to these figures.
Fig. 1 is the flow diagram for the diabetic complication analysis method that one embodiment of the invention provides;
Fig. 2 is progress note in case history document;
Fig. 3 is male patient's diabetic complication distributed number figure;
Fig. 4 is female patient diabetic complication distributed number figure;
Fig. 5 is that theme quantity and similarity constraint indicate relationship between SIM when similarity threshold is respectively 0.5 and 0.6
Schematic diagram;
Fig. 6 is that theme quantity and similarity constraint indicate relationship between SIM when similarity threshold is respectively 0.7 and 0.8
Schematic diagram;
Fig. 7, which is the theme, counts and hands over the relationship between interactive information;
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its
His embodiment, shall fall within the protection scope of the present invention.
Currently, disease discovery method passes through prolonged developmental research, have become the weight in the fields such as medical data excavation
Want research direction.Traditional disease finds main relevant rule discovery, classification analysis and clustering etc., mainly with knot
Structure data are Research foundation, however the text of medical information data mostly in xml form is stored in the information system of medical institutions
In, it needs to handle work by complicated data structured.In addition, many and diverse multiplicity of the characteristic attribute of various disease, structuring
Data that treated equally include much noise, and the accuracy for being easy to find complication causes significant impact.In conclusion
Data structured processing and Feature Engineering considerably increase traditional medical data mining previous work.
For this purpose, the embodiment of the invention provides a kind of diabetic complication analysis method, Fig. 1 is one embodiment of the invention
The flow diagram of the diabetic complication analysis method of offer.Referring to Fig. 1, a kind of diabetic complication analysis method, packet
It includes:
101, obtain case history document sets;The case history document sets include first quantity part case history document;Every part of case history document
Including at least one progress note;
102, progress note-theme distribution of at least one progress note is obtained, the disease of every part of case history document is obtained
Cheng Jilu vector;
103, obtain the tag along sort of the progress note vector;
104, disease discovery model is trained using multiple tag along sorts of every part of medical history record, is obtained final
Disease finds model.
A kind of each step of diabetic complication analysis method is described in detail with reference to the accompanying drawings and examples.
Firstly, the step of introducing 101, obtaining case history document sets.
Patient can generate various detection records, such as admission records, discharge record, the course of disease note during hospitalization
Record, consultation note etc..If directly calculating the similitude between detection record, it can greatly increase calculation amount.For convenience of saying
It is bright, the detection record before processing is referred to as initial case history in the present embodiment.
To take into account real-time and calculation amount, first quantity part case history document is obtained in the embodiment of the present invention and constitutes case history text
Shelves collection.It will be appreciated that every part of case history document may include at least one progress note.
Wherein the first quantity can be set as the case may be, such as 1000,10000 etc., be not limited thereto.
Secondly, introducing 102, progress note-theme distribution of at least one progress note is obtained, every part of disease is obtained
The step of going through the progress note vector of document.
The present embodiment is to utilize default LDA model (subsequent title Medical Record Similarity based
Latent Dirichlet Allocation, MRS-LDA) obtain the disease of at least one progress note in each case history document
Cheng Jilu-theme distribution.
The similitude of admission diagnosis part in case history document is only considered in the present embodiment.Wherein similitude is and calculates to appoint
The distance for two parts of initial case histories of anticipating, and the building of case history similarity constraint can be regarded as distance between collecting two-by-two and be less than some threshold
The case history collection of value.
In practical application, will also include the multiple complications of some illness in initial case history, for example, diabetes will lead to it is more
Kind complication, as shown in table 1.
1 diabetic's complication example of table
Analytical table 1 is it is found that the patient of different age group has differences diabetes and its complication characterization;In addition, different
Age bracket patient is different to the ability to bear of medicament, leads to can have characterization, medication etc. no during clinic diagnosis
Together.Therefore, need to consider the essential information of patient when calculating the similitude of case history document, by patient's name in the present embodiment
The Similarity measures factor of case history document is included in the age.
In one embodiment, 1 is set by the distance of gender attribute between identical gender, gender category between different sexes
The distance of property is set as 0, is shown below:
Wherein, sexi, sexjIt is expressed as the gender of different two people.
In one embodiment, 4 age brackets will be divided into the age according to international age composition in the population, and is respectively as follows: teenager, 0
~17 years old, it is expressed as 1;Youth 18~45 years old, is expressed as 2;Middle age 18~45 years old, is expressed as 3;Old age is greater than 59 years old, indicates
It is 4.In this way, the present embodiment can calculate the distance of two affiliated age brackets of patient, as following formula indicates:
Wherein, agei, agejIt is expressed as the age of different two people, flagi, flagjIndicate segmentation belonging to all ages and classes.
Also, segmentation belonging to two ages is closer to then apart from smaller, the more remote then distance of affiliated segmentation is bigger.
In view of using the textual of discrete type to describe in initial case history, calculated in the present embodiment using Jaccard distance
The distance between diagnostic result in different initial case histories, is shown below:
Wherein, diai, diajThe discharge diagnosis boolean vector space for indicating case history i and case history j, largely considers glycosuria herein
Illness between sick complication.
Such as: diai={ 123 }, diaj={ 234 }, diai∩diaj={ 2,3 }; diai∪diaj={ 1,2,3,4 },
So d (diai, diaj)=2/4=0.5.
It should be noted that only accounted in the present embodiment the Similarity measures factor include: gender attribute away from
With a distance from the segmentation belonging to, age, diagnostic result apart from the case where, change in the application scenarios of text subject analysis method
When, the concrete composition of Similarity measures factor can also adjust accordingly, and scheme adjusted equally falls into the guarantor of the application
Protect range.
After determining Similarity measures factor, weight is respectively set and adjusts adjustment parameter μ1, μ2, μ3, and calculate any
Similarity between two initial case histories, is shown below:
sim(Ti, Tj)=μ1*d(sexi, sexj)+μ2*d(agei, agej)+μ3*d(diai, diaj)
(3)
μ1+μ2+μ3=1 (4)
0≤μ1, μ2, μ3≤1 (5)
Similarity is made comparisons with similarity threshold τ, it is more more than or equal to similarity threshold to filter out similarity value
A initial case history, and the similarity constraint case history collection that multiple initial case histories are constituted is obtained, it is denoted as D={ (Ti, Tj) | i, j ∈
[1, M] }.
In the present embodiment, presetting LDA model is to improve to obtain on the basis of existing LDA model.For convenience of technology people
Default LDA model is better understood, the basic principle of LDA model is first described:
Potential Di Li Cray distribution (Latent Dirichlet Allocation, LDA) is a kind of topic model, mesh
Be to find document subject matter, include document, theme and word three-decker, and every document has respective theme relevant
Probability distribution, and word is to be distributed to sample by different themes in document, as shown in following formula (6):
∑ p (word | document)=∑ p (word | theme) * p (theme | document) (6)
Using LDA model to case history Document Modeling, equipped with case history total number of documents M, it is a that there are Nm in m-th of case history document
Clinic description word, each word are expressed as ωM, n, document and word are expressed as according to bag of words (bag of words)
Document-theme distribution and theme-word distribution.Theme can be understood as medication, observation, symptom, operation etc. in case history text
The general designation of clinical care means, each case history text are the multinomial distributions of multiple themes, i.e., each case history text is by facing
Multiple steps in bed nursing process are composed.
In the related technology, LDA model generates the step of case history text, as shown in table 2.
It will be appreciated that since each theme is the multinomial distribution of multiple words, corresponding each clinical care step packet
It is operated containing multiple clinical practices, and it is α and β that document-theme distribution and theme-word distribution, which meet Di Li Cray parameter,
Prior distribution, therefore LDA model can simulate the thought process that doctor makes case history text in diagnosis and treatment process very well.
Based on above-mentioned analysis it is found that LDA model reasoning purpose is: calculating LDA model by current test document collection
In unknown parameterAnd according toCalculate theme-word distribution and document-theme distribution.In fact, meter
Theme-word distribution and document-theme distribution can be gone out during calculation with direct derivation, without calculating
In practical application, the Parameter reasoning algorithm of LDA model includes Gibbs sampling and two kinds of EM variation.It is described below two
Kind method.
First, Gibbs Sampling core concept are Markov Monte Carlo (MCMC) methods, in iteration each time
The parameter value for only changing a dimension in the process, until convergence exports parameter value to be estimated.According to Di Li Cray parameter Estimation,
Reasoning is available:
Wherein:Indicate document-theme distribution,Indicate theme-word distribution,It indicates
WordIt is distributed as the probability of k, i is a data to (m, n), indicates n-th of word in m-th of document.
Due to sharing K theme, it is therefore desirable to K iteration is carried out, using training step as shown in table 3:
Second, EM variational algorithm are to find suitable parameter, so that the theme-word observed in text set point
Cloth maximum probability is similar to Maximum-likelihood estimation problem.EM variational algorithm is divided into two iterative steps:
Variation E-step considers that posterior probability p in former step (w | α, β) formula derivation is difficult, introduces variational parameter
Acquire approximate Posterior probability distribution
Variation M-step maximizes approximate function according to the variational parameter of variation E-stepWherein, first
It tests Di Li Cray distribution parameter (α, β) and determines that theme-word distribution and document-theme distribution θ, w represent word, z represents master
Topic.
Since the iterative target of LDA model is to maximize word probability of occurrence p (Z, W | α, β), can effectively meet in this way
The data characteristics of diabetic duration record, while there is larger difference in the theme distribution for also resulting in similar case history, so as to cause
Case history can not effectively be statisticallyd analyze according to case history theme distribution.
To establish the topic model for meeting case history similarity constraint, received in the present embodiment by changing Gibbs sampling
Conditional policies are held back to realize this target.
In view of multiple progress notes according to time sequence, case history document similarity meter can be existed simultaneously in each case history
Calculate the similitude that consider in each case history document between different progress note set, i.e. it is each in similarity constraint case history collection D
Document-theme distribution of the different progress note set of case history document is as similar as possible.
If TmIndicate the case history of number m, including LmA progress note, the theme set expression of progress note are θ rm=
{θM, 1,θM, 2..., θM, Lm}.There are the progress note theme set θ r of two case history documentsm, θ rn, can use theme two-by-two
Distribution distance mean value computation case history similarity constraint, as follows:
Wherein, d (θM, Lm, θN, Ln) it is expressed as two Euclidean distances between disease and vector, dis (θ rm, θ rn) bigger table
Show that similarity is lower.
Maximum target function can be modified are as follows:
LDA model inference is carried out using Gibbs-EM alternative manner in the present embodiment, by it by document-theme distribution αm
It is revised as normal distribution μm, obtain default LDA model:
Wherein, μmkIt represents case history document m and belongs to the probability of theme k, since thinking μmStandardized normal distribution is obeyed, then is changed
It is expressed as follows into maximum target function:
In addition, first fixing document subject matter in the present embodiment in sampling process is distributed αm, then Gibbs-EM iteration function
Expression formula are as follows:
Wherein,Theme in similarity constraint case history collection is represented as the quantity of the word i of k, due to using normal state point
Cloth replaces original α, so formula (14) can be derived with stochastic gradient descent method, model training process such as table
4:
So far, the building of default LDA (i.e. MRS-LDA) model is completed in the embodiment of the present invention.In the embodiment of the present invention,
In the modeling process and inference method for analyzing influence and potential Di Li Cray topic model of the text mining to medical diagnosis
On the basis of, devise the default LDA model based on the constraint of case history similarity.The default LDA model not only considers different diseases
The similarity constraint between document is gone through, and medical text subject modeling target, reasoning process and the model degree of correlation has been determined
Figureofmerit, so as to can clearly reflect the emphasis and state of an illness evolutionary process in each diagnosis and treatment stage from setting LDA model,
Be conducive to be promoted science, validity and the accuracy of case history Topics Crawling.
Later, case history document each in similarity constraint case history collection is sequentially inputted to default LDA in the embodiment of the present invention
Model, document-theme distribution and master by least one progress note in default each case history document of LDA model inference
Topic-word distribution, and then the multidimensional time-series theme of available each case history document.
Feature extraction is carried out to multidimensional time-series theme using singular value decomposition in the embodiment of the present invention, when by multidimensional
Between sequence theme be mapped to one and be characterized in the subspace indicated with singular value is affected.It is different from feature vector solution, it is unusual
Value decomposition does not require to be decomposed matrix as square matrix, it is assumed that there are a matrix A m*n, singular value decomposition indicate are as follows:
A=U ∑ VT (15)
Wherein, U and V is referred to as unitary matrice (Unitary Matrix), and U is the matrix of a M*M, and V is a N*N
Matrix meets: UT* U=I, VT* V=I;∑ is the matrix of a M*N, ∑={ σ1..., σr, r=rank (∑) is square
The maximum order for the minor being not zero in battle array ∑, other values are 0 other than diagonal line, the unusual value parameter of diagonal positions
{σ1..., σrBe progress note feature vector.
Matrix decomposition is carried out to time series theme using singular value decomposition in the present embodiment, because of different patient medical records
Be hospitalized duration it is inconsistent, it is also inconsistent for causing the case history subject nucleotide sequence of different patients.But in the present embodiment by
In the presence of disease discovery model, it is equal for leading to the theme dimension of the subject nucleotide sequence of different case histories, therefore utilizes singular value
Multi-dimensional time subject nucleotide sequence can be mapped to length to be feasible in the subspace of r by decomposing.
Again, the step of introducing 103, obtaining the tag along sort of the progress note vector.
Due to the possible while different disease of different patients, diabetic complication as shown in Table 1.Therefore, the present embodiment
The fact that middle disease discovery model needs in view of patient while suffering from a variety of diseases needs the case history text to same patient
Shelves carry out multi-tag classification.Wherein, multi-tag classification refers to that sample exists simultaneously multiple labels, and may deposit between label
Incidence relation.Such as a film is either literary film can be romance movie again, and literature and art and love be exist it is certain
Relationship, such classification problem is referred to as multi-tag classification.
In practical application, multi-tag classification problem solution mainly includes two kinds: improving classifier and model conversion.
Wherein, improving classifier is change sorting algorithm, can meet multi-tag classification demand, and can not change data
Structure.Common multi-tag sorting algorithm has Boosting algorithm, BP neural network, decision tree and support vector machines etc., improves
The advantages of classifier, is to can adapt to data structure, but usually will cause complicated solution logic, increases algorithm complexity.
And the purpose of model conversion is to change data acquisition system, enables to be applicable in existing single labeling algorithm, existing plan
Slightly BP binary crelation method, RPC compare ranking method and LP label power set method in pairs.
Using ECC (Ensembles of Classifier Chains) assembled classifier chain in the present embodiment, belong to
In the improved method of BP binary crelation method, i.e., multi-tag classification problem is converted into two classification problem of multiple groups in the present embodiment: first
First according to the corresponding disease collection of acquisition case history document sets;It includes a variety of disease labels that the disease, which is concentrated,.Then, from disease collection
The case history document for belonging to the disease label is divided into a category set, remaining case history text by one disease label of middle selection
Shelves are divided into another category set, and label are substituted into the characteristic of the case history document.Each disease available in this way
Go through multiple labels of document.
Finally, introducing 104, disease discovery model is trained using multiple tag along sorts of every part of medical history record, is obtained
The step of finding model to final disease.
Disease discovery model is trained using multiple tag along sorts of every part of medical history record in the present embodiment, thus
Obtain final disease discovery model.
For example, now with 100 parts of case history documents, wherein every part of case history document includes one or more discharge diagnosis knot
Fruit.According to BP binary classification thought, a complete disease collection is first established, a kind of disease (such as diabetic nephropathy) is taken, for
Each disease document in case history document sets, discharge diagnosis result include that the case history document of the disease is divided into positive class, and other case histories
Document divides the class that is negative, while other classification diseases being added in characteristic according to practical discharge diagnosis result, if comprising
The stigmata is 1, is otherwise 0, test set of the set that positive class case history document is constituted as the corresponding disease.Then, after
It is continuous to choose another disease as positive class label, test set is constructed according to above step again.And so on, until disease collection
In disease all construct test set separately as positive class.Later, it is calculated using k nearest neighbor, support vector machines and random forest etc.
Method carries out classification based training and constructs multiple classifiers.In forecast period, the present embodiment generates disease with each prediction data not yet
Based on, genius morbi is labeled as 0 at this time, after being classified using different classifiers, and last classification results is included in
Next time in genius morbi (being labeled as 1), until having traversed all classifiers.In this way, after the available training of the present embodiment
Disease finds model.
In the present embodiment, case history document is input to disease discovery model, can find to be possible to fail to pinpoint a disease in diagnosis in admission diagnosis
Disease, be conducive to promoted diagnosis efficiency.
Illustrate a kind of having for diabetic complication analysis method provided in an embodiment of the present invention using comparative experiments below
Effect property and superiority.
The present embodiment uses the inpatient cases of division of endocrinology of First Attached Hospital, Anhui Medical Univ. patient, including 2015
Year, 1294 being hospitalized for diabetic recorded in total to 2017, and every part of case history document mainly includes admission records, course of disease note
Record (as shown in Figure 2), consultation note and record etc. of leaving hospital.Wherein men and women's patient medical record document number ratio 648: 646, substantially
It is identical.
Referring to Fig. 3 and Fig. 4, Anhui endocrine department of the first affiliated hospital of university of section patient of diabetes is used in the present embodiment
The case history text of person is as initial data, and progress note quantity is usually patient's length of stay, concrete condition in patient medical record
As shown in Figure 3.
In view of different patients are by suffered from complication and otherwise similitude, the MRS- in step 102 is used
LDA model excavates the theme feature of different patient's progress notes, chooses theme quantity K=15, case history similarity constraint threshold tau
=0.5 experimental result can obtain the multidimensional time-series subject data based on progress note, handle by singular value decomposition
The lesser feature space of dimension is mapped that later.Diabetic complication is the discovery that a multi-tag classification problem simultaneously,
Therefore traditional disaggregated model method can be suitable for by needing to carry out data acquisition system processing again, be made in the present embodiment
The applicable sample data set of binary classification is processed into binary crelation method.
Consider that the performance performance that diabetic complication is found under different Topics Crawling methods of different classifications device exists
Difference, in order to find can be suitably based on theme diabetic complication discovery classifier, select k nearest neighbor, random forest,
Logistic regression and support vector machines etc. carry out classification based training.Adjusting theme number parameter K in this experiment is 15, will respectively
It chooses different sorting algorithms and carries out tradition LDA and the experiment of this paper category of model, experimental result is as follows
(a) and Fig. 4 (b) referring to fig. 4 are reflected and are being carried out assorting process to case history document using different disaggregated models
In, averagely classification accuracy is as the increase of theme quantity slightly rises and falls and finally tends to 0.8 to 0.82.Wherein in number of topics
Averagely classification precision fluctuation is obvious when amount is 7, has biggish amount of increase, accuracy is up to 0.948, but conventional method is most
Pinpoint accuracy is 0.9.As can be seen that assembled classifier chain is accurate in average classification compared with traditional LDA model in the present embodiment
On have preferable performance, and support vector machines and Logic Regression Models also have better performance in classification.
Referring to Fig. 5 (a) and Fig. 5 (b), reflects and assorting process is being carried out to case history document using different disaggregated models
In, average classification specificity fluctuates situation with the increase of theme quantity, relatively rises sharply wherein having when theme quantity is 15
Width.There is preferable table in average classification specificity compared to traditional LDA model with assembled classifier chain in sample embodiment
Existing, specificity can reach 1, and support vector machines and Logic Regression Models also have better table in complication discovery classification
It is existing.
Referring to Fig. 6 (a) and Fig. 6 (b), assembled classifier chain and LDA model are described in the present embodiment with theme
Quantity increase when averagely classify susceptibility situation of change.Assembled classifier chain is compared to tradition LDA model average in the present embodiment
Performance is good in terms of precision, average specificity and the isocratic figureofmerit of average sensitivity.This is because calculating document-theme point
Similar case history constraint is considered when cloth, keeps the similar case history of diagnostic result close on document-theme distribution, so as to cause dividing
The same tag along sort is divided into class device training process.
By comparative experiments, the inaccuracy and discharge that diabetic's admission diagnosis is analyzed in the present embodiment are examined
The realistic meaning of the diabetic complication discovery based on theme is elaborated on the features such as disconnected completeness, while specifying multidimensional
The data characteristics extracting method of time series topic model, and simplify a variety of diabetic complications using binary crelation method and send out
Existing experimental program can efficiently use the illness that actually occurs in progress note and develop and clinic diagnosis data, to entering
Not found complication has good detection effect in institute's diagnosis, has affirmed the diabetic complication discovery based on theme
Scientific and importance.
Second aspect, the embodiment of the invention provides a kind of diabetic complication analytical equipments, referring to Fig. 7, described device
Include:
Case history collection obtains module 701, for obtaining case history document sets;The case history document sets include first quantity part disease
Go through document;Every part of case history document includes at least one progress note;
Vector space obtains module 702, for obtaining progress note-theme distribution of at least one progress note,
Obtain the progress note vector of every part of case history document;
Tag along sort obtains module 703, for obtaining the tag along sort of the progress note vector;
It was found that model obtains module 704, model is found to disease for multiple tag along sorts using every part of medical history record
It is trained, obtains final disease discovery model.
It should be noted that diabetic complication analytical equipment provided in an embodiment of the present invention and the above method are one by one
Corresponding relationship, the implementation detail of the above method are equally applicable to above-mentioned apparatus, the embodiment of the present invention no longer to above system into
Row is described in detail.
In specification of the invention, numerous specific details are set forth.It is to be appreciated, however, that the embodiment of the present invention can be with
It practices without these specific details.In some instances, well known method, structure and skill is not been shown in detail
Art, so as not to obscure the understanding of this specification.
Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;
Although present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its
It is still possible to modify the technical solutions described in the foregoing embodiments, or special to some or all of technologies
Sign is equivalently replaced;And these are modified or replaceed, the present invention that it does not separate the essence of the corresponding technical solution is each to be implemented
The range of example technical solution, should all cover within the scope of the claims and the description of the invention.
Claims (8)
1. a kind of diabetic complication analysis method, which is characterized in that the described method includes:
Obtain case history document sets;The case history document sets include first quantity part case history document;Every part of case history document includes at least
One progress note;
Progress note-the theme distribution for obtaining at least one progress note, obtain the progress note of every part of case history document to
Amount;
Obtain the tag along sort of the progress note vector;
Disease discovery model is trained using multiple tag along sorts of every part of medical history record, obtains final disease discovery mould
Type.
2. the method according to claim 1, wherein obtaining the progress note-of at least one progress note
Theme distribution, the progress note vector for obtaining every part of case history document include:
The multidimensional time-series master of the case history document is obtained according to progress note-theme distribution of at least one progress note
Topic;
Feature extraction is carried out to the multidimensional time-series theme using singular value decomposition, obtains the singular value ginseng of diagonal positions
Number is the progress note vector of the case history document.
3. the method according to claim 1, wherein the tag along sort for obtaining the progress note vector includes:
Obtain the corresponding disease collection of case history document sets;It includes a variety of disease labels that the disease, which is concentrated,;
Concentrate an optional disease label from the disease, using BP binary classification method by the disease label be added to including
The progress note vector of the disease.
4. the method according to claim 1, wherein obtaining the progress note-of at least one progress note
Theme distribution, the progress note vector for obtaining every part of case history document include:
The similarity between the case history document sets any two case history document is calculated, obtains the similarity more than or equal to phase
The similarity constraint case history collection constituted like multiple case history documents of degree threshold value;
Case history document each in the similarity constraint case history collection is sequentially inputted to default LDA model, passes through the default LDA
Document-the theme distribution and theme of each case history document of model inference-word distribution;
According to the document-theme distribution and theme-word distribution every part of case history document of building progress note vector.
5. according to the method described in claim 4, it is characterized in that, calculating in initial case history between any two case history document
Similarity includes:
Obtain multiple Similarity measures factors of case history and the weighted value of each Similarity measures factor;
Calculate separately numerical value of any two case history document about each Similarity measures factor;The Similarity measures factor packet
It includes: distance, the distance of diagnostic result of segmentation belonging to the distance of gender attribute, age;
Any two case history is calculated according to the numerical value of each Similarity measures factor and the weighted value of each Similarity measures factor
The similarity of document.
6. according to the method described in claim 4, it is characterized in that, passing through default each case history document of LDA model inference
Document-theme distribution and theme-word distribution include:
Theme number z is assigned at random to each word in each case history document in the similarity constraint case history collection;
Rescan the similarity constraint case history collection, to each word according toResampling master
Topic, the new theme made meet Gibbs Sampling convergence;
Theme-word co-occurrence frequency matrix in corpus is counted, document-theme distribution and theme-word distribution is calculated.
7. according to the method described in claim 5, it is characterized in that, the default LDA model includes:
The constraint of any two case history document similarity uses theme distribution distance dis (θ rm, θ rn) indicate, formula are as follows:
Wherein θ rm={ θM, 1, θM, 2..., θM, Lm, indicate that each case history document includes LmA progress note;θM, LmIndicate LmIt is a
The theme of progress note;d(θM, Lm, θN, Ln) it is expressed as the Euclidean distance between the theme vector of two courses of disease;
The default LDA model further includes Gibbs-EM iteration function, are as follows:
Theme is represented in similarity constraint case history collection as the quantity of the word i of k.
8. a kind of diabetic complication analytical equipment, which is characterized in that described device includes:
Case history collection obtains module, for obtaining case history document sets;The case history document sets include first quantity part case history document;Often
Part case history document includes at least one progress note;
Vector space obtains module, for obtaining progress note-theme distribution of at least one progress note, obtains every part
The progress note vector of case history document;
Tag along sort obtains module, for obtaining the tag along sort of the progress note vector;
It was found that model obtains module, disease discovery model is instructed for multiple tag along sorts using every part of medical history record
Practice, obtains final disease discovery model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810844798.6A CN109036577B (en) | 2018-07-27 | 2018-07-27 | Diabetes complication analysis method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810844798.6A CN109036577B (en) | 2018-07-27 | 2018-07-27 | Diabetes complication analysis method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109036577A true CN109036577A (en) | 2018-12-18 |
CN109036577B CN109036577B (en) | 2021-10-22 |
Family
ID=64646314
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810844798.6A Active CN109036577B (en) | 2018-07-27 | 2018-07-27 | Diabetes complication analysis method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109036577B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046339A (en) * | 2018-12-24 | 2019-07-23 | 北京字节跳动网络技术有限公司 | Determine method, apparatus, storage medium and the electronic equipment of document subject matter |
CN110232958A (en) * | 2019-06-15 | 2019-09-13 | 浙江爱多特大健康科技有限公司 | The one-stop community in diabetes internet changes the place of examination management method and system |
CN110246587A (en) * | 2019-06-15 | 2019-09-17 | 浙江爱多特大健康科技有限公司 | The one-stop complication consultation of doctors management method in diabetes internet and system |
CN110968693A (en) * | 2019-11-08 | 2020-04-07 | 华北电力大学 | Multi-label text classification calculation method based on ensemble learning |
CN111430037A (en) * | 2020-03-30 | 2020-07-17 | 安徽科大讯飞医疗信息技术有限公司 | Similar medical record searching method and system |
CN111462909A (en) * | 2020-03-30 | 2020-07-28 | 安徽科大讯飞医疗信息技术有限公司 | Disease evolution tracking and disease condition prompting method and device and electronic equipment |
CN111553442A (en) * | 2020-05-12 | 2020-08-18 | 全球能源互联网研究院有限公司 | Method and system for optimizing classifier chain label sequence |
CN111785386A (en) * | 2020-06-30 | 2020-10-16 | 安徽科大讯飞医疗信息技术有限公司 | Time interval dividing method, related device and readable storage medium |
CN112117009A (en) * | 2020-09-25 | 2020-12-22 | 北京百度网讯科技有限公司 | Method, device, electronic equipment and medium for constructing label prediction model |
WO2021227511A1 (en) * | 2020-05-15 | 2021-11-18 | 深圳先进技术研究院 | Complication onset risk prediction method and system based on electronic medical record big data, and terminal and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106228023A (en) * | 2016-08-01 | 2016-12-14 | 清华大学 | A kind of clinical path method for digging based on body and topic model |
CN106295186A (en) * | 2016-08-11 | 2017-01-04 | 中国科学院计算技术研究所 | A kind of method and system of aided disease diagnosis based on intelligent inference |
-
2018
- 2018-07-27 CN CN201810844798.6A patent/CN109036577B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106228023A (en) * | 2016-08-01 | 2016-12-14 | 清华大学 | A kind of clinical path method for digging based on body and topic model |
CN106295186A (en) * | 2016-08-11 | 2017-01-04 | 中国科学院计算技术研究所 | A kind of method and system of aided disease diagnosis based on intelligent inference |
Non-Patent Citations (3)
Title |
---|
KUNLI ZHANG等: "The Comparative Experimental Study of Multilabel Classification for Diagnosis Assistant Based on Chinese Obstetric EMRs", 《JOURNAL OF HEALTHCARE ENGINEERING》 * |
谭海龙: "多维时间序列的分类技术研究", 《中国优秀硕士学位论文全文数据库基础科学辑》 * |
马鸿超等: "基于特征融合的产科多标记辅助诊断研究", 《中文信息学报》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046339A (en) * | 2018-12-24 | 2019-07-23 | 北京字节跳动网络技术有限公司 | Determine method, apparatus, storage medium and the electronic equipment of document subject matter |
CN110232958A (en) * | 2019-06-15 | 2019-09-13 | 浙江爱多特大健康科技有限公司 | The one-stop community in diabetes internet changes the place of examination management method and system |
CN110246587A (en) * | 2019-06-15 | 2019-09-17 | 浙江爱多特大健康科技有限公司 | The one-stop complication consultation of doctors management method in diabetes internet and system |
CN110968693A (en) * | 2019-11-08 | 2020-04-07 | 华北电力大学 | Multi-label text classification calculation method based on ensemble learning |
CN111462909B (en) * | 2020-03-30 | 2024-04-05 | 讯飞医疗科技股份有限公司 | Disease evolution tracking and disease condition prompting method and device and electronic equipment |
CN111430037A (en) * | 2020-03-30 | 2020-07-17 | 安徽科大讯飞医疗信息技术有限公司 | Similar medical record searching method and system |
CN111462909A (en) * | 2020-03-30 | 2020-07-28 | 安徽科大讯飞医疗信息技术有限公司 | Disease evolution tracking and disease condition prompting method and device and electronic equipment |
CN111430037B (en) * | 2020-03-30 | 2024-04-09 | 讯飞医疗科技股份有限公司 | Similar medical record searching method and system |
CN111553442A (en) * | 2020-05-12 | 2020-08-18 | 全球能源互联网研究院有限公司 | Method and system for optimizing classifier chain label sequence |
CN111553442B (en) * | 2020-05-12 | 2024-03-12 | 国网智能电网研究院有限公司 | Optimization method and system for classifier chain tag sequence |
WO2021227511A1 (en) * | 2020-05-15 | 2021-11-18 | 深圳先进技术研究院 | Complication onset risk prediction method and system based on electronic medical record big data, and terminal and storage medium |
CN111785386B (en) * | 2020-06-30 | 2024-04-05 | 讯飞医疗科技股份有限公司 | Time interval division method, related device and readable storage medium |
CN111785386A (en) * | 2020-06-30 | 2020-10-16 | 安徽科大讯飞医疗信息技术有限公司 | Time interval dividing method, related device and readable storage medium |
CN112117009A (en) * | 2020-09-25 | 2020-12-22 | 北京百度网讯科技有限公司 | Method, device, electronic equipment and medium for constructing label prediction model |
Also Published As
Publication number | Publication date |
---|---|
CN109036577B (en) | 2021-10-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109036577A (en) | Diabetic complication analysis method and device | |
Du et al. | ML-Net: multi-label classification of biomedical texts with deep neural networks | |
US20190252074A1 (en) | Knowledge graph-based clinical diagnosis assistant | |
Fang et al. | Feature Selection Method Based on Class Discriminative Degree for Intelligent Medical Diagnosis. | |
Lee et al. | Harmonized representation learning on dynamic EHR graphs | |
Zhu et al. | Identifying the technology convergence using patent text information: A graph convolutional networks (GCN)-based approach | |
Kim et al. | seq2vec: Analyzing sequential data using multi-rank embedding vectors | |
CN109065174A (en) | Consider the case history theme acquisition methods and device of similar constraint | |
Hu et al. | Predicting the quality of online health expert question-answering services with temporal features in a deep learning framework | |
CN114628008A (en) | Social user depression tendency detection method based on heterogeneous graph attention network | |
Zhao et al. | Modeling patient visit using electronic medical records for cost profile estimation | |
Lilhore et al. | Unveiling the prevalence and risk factors of early stage postpartum depression: a hybrid deep learning approach | |
Zou et al. | Modeling electronic health record data using an end-to-end knowledge-graph-informed topic model | |
Zhang et al. | Exploring unsupervised multivariate time series representation learning for chronic disease diagnosis | |
Ahmed et al. | Graph attention-based curriculum learning for mental healthcare classification | |
CN116110594B (en) | Knowledge evaluation method and system of medical knowledge graph based on associated literature | |
Gandhi et al. | The survey on approaches to efficient clustering and classification analysis of big data | |
Sathyabama et al. | An effective learning rate scheduler for stochastic gradient descent-based deep learning model in healthcare diagnosis system | |
Nguyen et al. | Estimating county health indices using graph neural networks | |
Lisjana et al. | Classifying complaint reports using rnn and handling imbalanced dataset | |
Voronov et al. | Forecasting popularity of news article by title analyzing with BN-LSTM network | |
Helwe et al. | CCS coding of discharge diagnoses via deep neural networks | |
An et al. | Merge: A multi-graph attentive representation learning framework integrating group information from similar patients | |
Appari et al. | An Improved CHI 2 Feature Selection Based a Two-Stage Prediction of Comorbid Cancer Patient Survivability | |
Kaliyapillai et al. | Differential evolution based hyperparameters tuned deep learning models for disease diagnosis and classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |