CN109036577A

CN109036577A - Diabetic complication analysis method and device

Info

Publication number: CN109036577A
Application number: CN201810844798.6A
Authority: CN
Inventors: 丁帅; 杨善林; 金行; 俞尧
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2018-12-18
Anticipated expiration: 2038-07-27
Also published as: CN109036577B

Abstract

The present invention provides a kind of diabetic complication analysis method and devices.The described method includes: obtaining case history document sets；The case history document sets include first quantity part case history document；Every part of case history document includes at least one progress note；Progress note-the theme distribution for obtaining at least one progress note, obtains the progress note vector of every part of case history document；Obtain the tag along sort of the progress note vector；Disease discovery model is trained using multiple tag along sorts of every part of medical history record, obtains final disease discovery model.As it can be seen that the present invention can use the disease development actually occurred in progress note, not found complication in admission diagnosis is detected, be conducive to the diagnosis and treatment accuracy for promoting successive patients.

Description

Diabetic complication analysis method and device

Technical field

The present invention relates to data mining technology field more particularly to a kind of diabetic complication analysis method and devices.

Background technique

For patient when being admitted to hospital, doctor can do an admission diagnosis, be provided during follow-up visit based on the admission diagnosis Therapeutic scheme, so that the pain of patient is preferably solved, therefore admission diagnosis is a very important task, one of them Emphasis is exactly disease discovery.

Currently, disease discovery method passes through prolonged developmental research, have become the weight in the fields such as medical data excavation Want research direction.Traditional disease finds main relevant rule discovery, classification analysis and clustering etc., mainly with knot Structure data are Research foundation, however text of the medical information data mostly in the form of Html is stored in the information system of medical institutions In system, need to handle work by complicated data structured.In addition, many and diverse multiplicity of the characteristic attribute of various disease, structure Change that treated that data equally include much noise, the accuracy for being easy to find complication causes significant impact.To sum up institute It states, data structured processing and Feature Engineering considerably increase traditional medical data mining previous work.

Summary of the invention

For the defects in the prior art, it the present invention provides a kind of diabetic complication analysis method and device, is used for Solve technical problem present in the relevant technologies.

In a first aspect, the embodiment of the invention provides a kind of diabetic complication analysis methods, which comprises

Obtain case history document sets；The case history document sets include first quantity part case history document；Every part of case history document includes At least one progress note；

Progress note-the theme distribution for obtaining at least one progress note obtains the course of disease note of every part of case history document Record vector；

Obtain the tag along sort of the progress note vector；

Disease discovery model is trained using multiple tag along sorts of every part of medical history record, obtains final disease It was found that model.

Optionally, the progress note-theme distribution for obtaining at least one progress note, obtains every part of case history document Progress note vector include:

The multi-dimensional time sequence of the case history document is obtained according to progress note-theme distribution of at least one progress note Column theme；

Feature extraction is carried out to the multidimensional time-series theme using singular value decomposition, obtains the surprise of diagonal positions Different value parameter is the progress note vector of the case history document.

Optionally, the tag along sort for obtaining the progress note vector includes:

Obtain the corresponding disease collection of case history document sets；It includes a variety of disease labels that the disease, which is concentrated,；

An optional disease label is concentrated from the disease, is added the disease label using BP binary classification method To the progress note vector including the disease, the corresponding multiple tag along sorts of every part of medical history record are obtained.

Calculate the similarity between the case history document sets any two case history document, obtain the similarity be greater than or The similarity constraint case history collection constituted equal to multiple case history documents of similarity threshold；

Case history document each in the similarity constraint case history collection is sequentially inputted to default LDA model, by described pre- If document-the theme distribution and theme of each case history document of LDA model inference-word distribution；

According to the document-theme distribution and theme-word distribution every part of case history document of building progress note vector.

Optionally, calculating the similarity in initial case history between any two case history document includes:

Obtain multiple Similarity measures factors of case history and the weighted value of each Similarity measures factor；

Calculate separately numerical value of any two case history document about each Similarity measures factor；The Similarity measures because Element includes: distance, the distance of diagnostic result of the distance of gender attribute, segmentation belonging to the age；

Any two are calculated according to the numerical value of each Similarity measures factor and the weighted value of each Similarity measures factor The similarity of case history document.

Optionally, pass through the document-theme distribution and theme-word of default each case history document of LDA model inference point Cloth includes:

Theme number z is assigned at random to each word in each case history document in the similarity constraint case history collection；

Rescan the similarity constraint case history collection, to each word according toAgain it adopts Sample theme, the new theme made meet Gibbs Sampling convergence；

Theme-word co-occurrence frequency matrix in corpus is counted, document-theme distribution and theme-word point is calculated Cloth.

Optionally, the default LDA model includes:

The constraint of any two case history document similarity uses theme distribution distance dis (θ r^m, θ rⁿ) indicate, formula are as follows:

Wherein θ r^m={ θ_{M, 1}, θ_{M, 2}..., θ_{M, Lm}, indicate that each case history document includes L_mA progress note；θ_{M, Lm}It indicates L_mThe theme of a progress note；d(θ_{M, Lm}, θ_{N, Ln}) it is expressed as the Euclidean distance between the theme vector of two courses of disease；

The default LDA model further includes Gibbs-EM iteration function, are as follows:

Theme is represented in similarity constraint case history collection as the quantity of the word i of k.

Second aspect, the embodiment of the invention provides a kind of diabetic complication analytical equipment, described device includes:

Case history collection obtains module, for obtaining case history document sets；The case history document sets include first quantity part case history text Shelves；Every part of case history document includes at least one progress note；

Vector space obtains module, for obtaining progress note-theme distribution of at least one progress note, obtains To the progress note vector of every part of case history document；

Tag along sort obtains module, for obtaining the tag along sort of the progress note vector；

It was found that model obtain module, for using every part of medical history record multiple tag along sorts to disease find model into Row training obtains final disease discovery model.

As shown from the above technical solution, by obtaining case history document sets in the embodiment of the present invention；The case history document sets packet Include first quantity part case history document；Every part of case history document includes at least one progress note；Then, obtain it is described at least one Progress note-theme distribution of progress note obtains the progress note vector of every part of case history document；Later, the course of disease is obtained Record the tag along sort of vector；Finally, multiple tag along sorts using every part of medical history record instruct disease discovery model Practice, obtains final disease discovery model.As it can be seen that the present invention can use the disease development actually occurred in progress note, inspection Not found complication in admission diagnosis is measured, the diagnosis and treatment accuracy for promoting successive patients is conducive to.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only Some embodiments of the present invention, for those of ordinary skill in the art, without creative efforts, also Other attached drawings can be obtained according to these figures.

Fig. 1 is the flow diagram for the diabetic complication analysis method that one embodiment of the invention provides；

Fig. 2 is progress note in case history document；

Fig. 3 is male patient's diabetic complication distributed number figure；

Fig. 4 is female patient diabetic complication distributed number figure；

Fig. 5 is that theme quantity and similarity constraint indicate relationship between SIM when similarity threshold is respectively 0.5 and 0.6 Schematic diagram；

Fig. 6 is that theme quantity and similarity constraint indicate relationship between SIM when similarity threshold is respectively 0.7 and 0.8 Schematic diagram；

Fig. 7, which is the theme, counts and hands over the relationship between interactive information；

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its His embodiment, shall fall within the protection scope of the present invention.

Currently, disease discovery method passes through prolonged developmental research, have become the weight in the fields such as medical data excavation Want research direction.Traditional disease finds main relevant rule discovery, classification analysis and clustering etc., mainly with knot Structure data are Research foundation, however the text of medical information data mostly in xml form is stored in the information system of medical institutions In, it needs to handle work by complicated data structured.In addition, many and diverse multiplicity of the characteristic attribute of various disease, structuring Data that treated equally include much noise, and the accuracy for being easy to find complication causes significant impact.In conclusion Data structured processing and Feature Engineering considerably increase traditional medical data mining previous work.

For this purpose, the embodiment of the invention provides a kind of diabetic complication analysis method, Fig. 1 is one embodiment of the invention The flow diagram of the diabetic complication analysis method of offer.Referring to Fig. 1, a kind of diabetic complication analysis method, packet It includes:

101, obtain case history document sets；The case history document sets include first quantity part case history document；Every part of case history document Including at least one progress note；

102, progress note-theme distribution of at least one progress note is obtained, the disease of every part of case history document is obtained Cheng Jilu vector；

103, obtain the tag along sort of the progress note vector；

104, disease discovery model is trained using multiple tag along sorts of every part of medical history record, is obtained final Disease finds model.

A kind of each step of diabetic complication analysis method is described in detail with reference to the accompanying drawings and examples.

Firstly, the step of introducing 101, obtaining case history document sets.

Patient can generate various detection records, such as admission records, discharge record, the course of disease note during hospitalization Record, consultation note etc..If directly calculating the similitude between detection record, it can greatly increase calculation amount.For convenience of saying It is bright, the detection record before processing is referred to as initial case history in the present embodiment.

To take into account real-time and calculation amount, first quantity part case history document is obtained in the embodiment of the present invention and constitutes case history text Shelves collection.It will be appreciated that every part of case history document may include at least one progress note.

Wherein the first quantity can be set as the case may be, such as 1000,10000 etc., be not limited thereto.

Secondly, introducing 102, progress note-theme distribution of at least one progress note is obtained, every part of disease is obtained The step of going through the progress note vector of document.

The present embodiment is to utilize default LDA model (subsequent title Medical Record Similarity based Latent Dirichlet Allocation, MRS-LDA) obtain the disease of at least one progress note in each case history document Cheng Jilu-theme distribution.

The similitude of admission diagnosis part in case history document is only considered in the present embodiment.Wherein similitude is and calculates to appoint The distance for two parts of initial case histories of anticipating, and the building of case history similarity constraint can be regarded as distance between collecting two-by-two and be less than some threshold The case history collection of value.

In practical application, will also include the multiple complications of some illness in initial case history, for example, diabetes will lead to it is more Kind complication, as shown in table 1.

1 diabetic's complication example of table

Analytical table 1 is it is found that the patient of different age group has differences diabetes and its complication characterization；In addition, different Age bracket patient is different to the ability to bear of medicament, leads to can have characterization, medication etc. no during clinic diagnosis Together.Therefore, need to consider the essential information of patient when calculating the similitude of case history document, by patient's name in the present embodiment The Similarity measures factor of case history document is included in the age.

In one embodiment, 1 is set by the distance of gender attribute between identical gender, gender category between different sexes The distance of property is set as 0, is shown below:

Wherein, sex_i, sex_jIt is expressed as the gender of different two people.

In one embodiment, 4 age brackets will be divided into the age according to international age composition in the population, and is respectively as follows: teenager, 0 ~17 years old, it is expressed as 1；Youth 18~45 years old, is expressed as 2；Middle age 18~45 years old, is expressed as 3；Old age is greater than 59 years old, indicates It is 4.In this way, the present embodiment can calculate the distance of two affiliated age brackets of patient, as following formula indicates:

Wherein, age_i, age_jIt is expressed as the age of different two people, flag_i, flag_jIndicate segmentation belonging to all ages and classes. Also, segmentation belonging to two ages is closer to then apart from smaller, the more remote then distance of affiliated segmentation is bigger.

In view of using the textual of discrete type to describe in initial case history, calculated in the present embodiment using Jaccard distance The distance between diagnostic result in different initial case histories, is shown below:

Wherein, dia_i, dia_jThe discharge diagnosis boolean vector space for indicating case history i and case history j, largely considers glycosuria herein Illness between sick complication.

Such as: dia_i={ 123 }, dia_j={ 234 }, dia_i∩dia_j={ 2,3 }； dia_i∪dia_j={ 1,2,3,4 }, So d (dia_i, dia_j)=2/4=0.5.

It should be noted that only accounted in the present embodiment the Similarity measures factor include: gender attribute away from With a distance from the segmentation belonging to, age, diagnostic result apart from the case where, change in the application scenarios of text subject analysis method When, the concrete composition of Similarity measures factor can also adjust accordingly, and scheme adjusted equally falls into the guarantor of the application Protect range.

After determining Similarity measures factor, weight is respectively set and adjusts adjustment parameter μ₁, μ₂, μ₃, and calculate any Similarity between two initial case histories, is shown below:

sim(T_i, T_j)=μ₁*d(sex_i, sex_j)+μ₂*d(age_i, age_j)+μ₃*d(dia_i, dia_j)

(3)

μ₁+μ₂+μ₃=1 (4)

0≤μ₁, μ₂, μ₃≤1 (5)

Similarity is made comparisons with similarity threshold τ, it is more more than or equal to similarity threshold to filter out similarity value A initial case history, and the similarity constraint case history collection that multiple initial case histories are constituted is obtained, it is denoted as D={ (T_i, T_j) | i, j ∈ [1, M] }.

In the present embodiment, presetting LDA model is to improve to obtain on the basis of existing LDA model.For convenience of technology people Default LDA model is better understood, the basic principle of LDA model is first described:

Potential Di Li Cray distribution (Latent Dirichlet Allocation, LDA) is a kind of topic model, mesh Be to find document subject matter, include document, theme and word three-decker, and every document has respective theme relevant Probability distribution, and word is to be distributed to sample by different themes in document, as shown in following formula (6):

∑ p (word | document)=∑ p (word | theme) * p (theme | document) (6)

Using LDA model to case history Document Modeling, equipped with case history total number of documents M, it is a that there are Nm in m-th of case history document Clinic description word, each word are expressed as ω_{M, n}, document and word are expressed as according to bag of words (bag of words) Document-theme distribution and theme-word distribution.Theme can be understood as medication, observation, symptom, operation etc. in case history text The general designation of clinical care means, each case history text are the multinomial distributions of multiple themes, i.e., each case history text is by facing Multiple steps in bed nursing process are composed.

In the related technology, LDA model generates the step of case history text, as shown in table 2.

It will be appreciated that since each theme is the multinomial distribution of multiple words, corresponding each clinical care step packet It is operated containing multiple clinical practices, and it is α and β that document-theme distribution and theme-word distribution, which meet Di Li Cray parameter, Prior distribution, therefore LDA model can simulate the thought process that doctor makes case history text in diagnosis and treatment process very well.

Based on above-mentioned analysis it is found that LDA model reasoning purpose is: calculating LDA model by current test document collection In unknown parameterAnd according toCalculate theme-word distribution and document-theme distribution.In fact, meter Theme-word distribution and document-theme distribution can be gone out during calculation with direct derivation, without calculating

In practical application, the Parameter reasoning algorithm of LDA model includes Gibbs sampling and two kinds of EM variation.It is described below two Kind method.

First, Gibbs Sampling core concept are Markov Monte Carlo (MCMC) methods, in iteration each time The parameter value for only changing a dimension in the process, until convergence exports parameter value to be estimated.According to Di Li Cray parameter Estimation, Reasoning is available:

Wherein:Indicate document-theme distribution,Indicate theme-word distribution,It indicates WordIt is distributed as the probability of k, i is a data to (m, n), indicates n-th of word in m-th of document.

Due to sharing K theme, it is therefore desirable to K iteration is carried out, using training step as shown in table 3:

Second, EM variational algorithm are to find suitable parameter, so that the theme-word observed in text set point Cloth maximum probability is similar to Maximum-likelihood estimation problem.EM variational algorithm is divided into two iterative steps:

Variation E-step considers that posterior probability p in former step (w | α, β) formula derivation is difficult, introduces variational parameter Acquire approximate Posterior probability distribution

Variation M-step maximizes approximate function according to the variational parameter of variation E-stepWherein, first It tests Di Li Cray distribution parameter (α, β) and determines that theme-word distribution and document-theme distribution θ, w represent word, z represents master Topic.

Since the iterative target of LDA model is to maximize word probability of occurrence p (Z, W | α, β), can effectively meet in this way The data characteristics of diabetic duration record, while there is larger difference in the theme distribution for also resulting in similar case history, so as to cause Case history can not effectively be statisticallyd analyze according to case history theme distribution.

To establish the topic model for meeting case history similarity constraint, received in the present embodiment by changing Gibbs sampling Conditional policies are held back to realize this target.

In view of multiple progress notes according to time sequence, case history document similarity meter can be existed simultaneously in each case history Calculate the similitude that consider in each case history document between different progress note set, i.e. it is each in similarity constraint case history collection D Document-theme distribution of the different progress note set of case history document is as similar as possible.

If T_mIndicate the case history of number m, including L_mA progress note, the theme set expression of progress note are θ r^m= {θ_{M, 1,}θ_{M, 2}..., θ_{M, Lm}}.There are the progress note theme set θ r of two case history documents^m, θ rⁿ, can use theme two-by-two Distribution distance mean value computation case history similarity constraint, as follows:

Wherein, d (θ_{M, Lm}, θ_{N, Ln}) it is expressed as two Euclidean distances between disease and vector, dis (θ r^m, θ rⁿ) bigger table Show that similarity is lower.

Maximum target function can be modified are as follows:

LDA model inference is carried out using Gibbs-EM alternative manner in the present embodiment, by it by document-theme distribution α_m It is revised as normal distribution μ_m, obtain default LDA model:

Wherein, μ_mkIt represents case history document m and belongs to the probability of theme k, since thinking μ_mStandardized normal distribution is obeyed, then is changed It is expressed as follows into maximum target function:

In addition, first fixing document subject matter in the present embodiment in sampling process is distributed α_m, then Gibbs-EM iteration function Expression formula are as follows:

Wherein,Theme in similarity constraint case history collection is represented as the quantity of the word i of k, due to using normal state point Cloth replaces original α, so formula (14) can be derived with stochastic gradient descent method, model training process such as table 4:

So far, the building of default LDA (i.e. MRS-LDA) model is completed in the embodiment of the present invention.In the embodiment of the present invention, In the modeling process and inference method for analyzing influence and potential Di Li Cray topic model of the text mining to medical diagnosis On the basis of, devise the default LDA model based on the constraint of case history similarity.The default LDA model not only considers different diseases The similarity constraint between document is gone through, and medical text subject modeling target, reasoning process and the model degree of correlation has been determined Figureofmerit, so as to can clearly reflect the emphasis and state of an illness evolutionary process in each diagnosis and treatment stage from setting LDA model, Be conducive to be promoted science, validity and the accuracy of case history Topics Crawling.

Later, case history document each in similarity constraint case history collection is sequentially inputted to default LDA in the embodiment of the present invention Model, document-theme distribution and master by least one progress note in default each case history document of LDA model inference Topic-word distribution, and then the multidimensional time-series theme of available each case history document.

Feature extraction is carried out to multidimensional time-series theme using singular value decomposition in the embodiment of the present invention, when by multidimensional Between sequence theme be mapped to one and be characterized in the subspace indicated with singular value is affected.It is different from feature vector solution, it is unusual Value decomposition does not require to be decomposed matrix as square matrix, it is assumed that there are a matrix A m*n, singular value decomposition indicate are as follows:

A=U ∑ V^T (15)

Wherein, U and V is referred to as unitary matrice (Unitary Matrix), and U is the matrix of a M*M, and V is a N*N Matrix meets: U^T* U=I, V^T* V=I；∑ is the matrix of a M*N, ∑={ σ₁..., σ_r, r=rank (∑) is square The maximum order for the minor being not zero in battle array ∑, other values are 0 other than diagonal line, the unusual value parameter of diagonal positions {σ₁..., σ_rBe progress note feature vector.

Matrix decomposition is carried out to time series theme using singular value decomposition in the present embodiment, because of different patient medical records Be hospitalized duration it is inconsistent, it is also inconsistent for causing the case history subject nucleotide sequence of different patients.But in the present embodiment by In the presence of disease discovery model, it is equal for leading to the theme dimension of the subject nucleotide sequence of different case histories, therefore utilizes singular value Multi-dimensional time subject nucleotide sequence can be mapped to length to be feasible in the subspace of r by decomposing.

Again, the step of introducing 103, obtaining the tag along sort of the progress note vector.

Due to the possible while different disease of different patients, diabetic complication as shown in Table 1.Therefore, the present embodiment The fact that middle disease discovery model needs in view of patient while suffering from a variety of diseases needs the case history text to same patient Shelves carry out multi-tag classification.Wherein, multi-tag classification refers to that sample exists simultaneously multiple labels, and may deposit between label Incidence relation.Such as a film is either literary film can be romance movie again, and literature and art and love be exist it is certain Relationship, such classification problem is referred to as multi-tag classification.

In practical application, multi-tag classification problem solution mainly includes two kinds: improving classifier and model conversion. Wherein, improving classifier is change sorting algorithm, can meet multi-tag classification demand, and can not change data Structure.Common multi-tag sorting algorithm has Boosting algorithm, BP neural network, decision tree and support vector machines etc., improves The advantages of classifier, is to can adapt to data structure, but usually will cause complicated solution logic, increases algorithm complexity. And the purpose of model conversion is to change data acquisition system, enables to be applicable in existing single labeling algorithm, existing plan Slightly BP binary crelation method, RPC compare ranking method and LP label power set method in pairs.

Using ECC (Ensembles of Classifier Chains) assembled classifier chain in the present embodiment, belong to In the improved method of BP binary crelation method, i.e., multi-tag classification problem is converted into two classification problem of multiple groups in the present embodiment: first First according to the corresponding disease collection of acquisition case history document sets；It includes a variety of disease labels that the disease, which is concentrated,.Then, from disease collection The case history document for belonging to the disease label is divided into a category set, remaining case history text by one disease label of middle selection Shelves are divided into another category set, and label are substituted into the characteristic of the case history document.Each disease available in this way Go through multiple labels of document.

Finally, introducing 104, disease discovery model is trained using multiple tag along sorts of every part of medical history record, is obtained The step of finding model to final disease.

Disease discovery model is trained using multiple tag along sorts of every part of medical history record in the present embodiment, thus Obtain final disease discovery model.

For example, now with 100 parts of case history documents, wherein every part of case history document includes one or more discharge diagnosis knot Fruit.According to BP binary classification thought, a complete disease collection is first established, a kind of disease (such as diabetic nephropathy) is taken, for Each disease document in case history document sets, discharge diagnosis result include that the case history document of the disease is divided into positive class, and other case histories Document divides the class that is negative, while other classification diseases being added in characteristic according to practical discharge diagnosis result, if comprising The stigmata is 1, is otherwise 0, test set of the set that positive class case history document is constituted as the corresponding disease.Then, after It is continuous to choose another disease as positive class label, test set is constructed according to above step again.And so on, until disease collection In disease all construct test set separately as positive class.Later, it is calculated using k nearest neighbor, support vector machines and random forest etc. Method carries out classification based training and constructs multiple classifiers.In forecast period, the present embodiment generates disease with each prediction data not yet Based on, genius morbi is labeled as 0 at this time, after being classified using different classifiers, and last classification results is included in Next time in genius morbi (being labeled as 1), until having traversed all classifiers.In this way, after the available training of the present embodiment Disease finds model.

In the present embodiment, case history document is input to disease discovery model, can find to be possible to fail to pinpoint a disease in diagnosis in admission diagnosis Disease, be conducive to promoted diagnosis efficiency.

Illustrate a kind of having for diabetic complication analysis method provided in an embodiment of the present invention using comparative experiments below Effect property and superiority.

The present embodiment uses the inpatient cases of division of endocrinology of First Attached Hospital, Anhui Medical Univ. patient, including 2015 Year, 1294 being hospitalized for diabetic recorded in total to 2017, and every part of case history document mainly includes admission records, course of disease note Record (as shown in Figure 2), consultation note and record etc. of leaving hospital.Wherein men and women's patient medical record document number ratio 648: 646, substantially It is identical.

Referring to Fig. 3 and Fig. 4, Anhui endocrine department of the first affiliated hospital of university of section patient of diabetes is used in the present embodiment The case history text of person is as initial data, and progress note quantity is usually patient's length of stay, concrete condition in patient medical record As shown in Figure 3.

In view of different patients are by suffered from complication and otherwise similitude, the MRS- in step 102 is used LDA model excavates the theme feature of different patient's progress notes, chooses theme quantity K=15, case history similarity constraint threshold tau =0.5 experimental result can obtain the multidimensional time-series subject data based on progress note, handle by singular value decomposition The lesser feature space of dimension is mapped that later.Diabetic complication is the discovery that a multi-tag classification problem simultaneously, Therefore traditional disaggregated model method can be suitable for by needing to carry out data acquisition system processing again, be made in the present embodiment The applicable sample data set of binary classification is processed into binary crelation method.

Consider that the performance performance that diabetic complication is found under different Topics Crawling methods of different classifications device exists Difference, in order to find can be suitably based on theme diabetic complication discovery classifier, select k nearest neighbor, random forest, Logistic regression and support vector machines etc. carry out classification based training.Adjusting theme number parameter K in this experiment is 15, will respectively It chooses different sorting algorithms and carries out tradition LDA and the experiment of this paper category of model, experimental result is as follows

(a) and Fig. 4 (b) referring to fig. 4 are reflected and are being carried out assorting process to case history document using different disaggregated models In, averagely classification accuracy is as the increase of theme quantity slightly rises and falls and finally tends to 0.8 to 0.82.Wherein in number of topics Averagely classification precision fluctuation is obvious when amount is 7, has biggish amount of increase, accuracy is up to 0.948, but conventional method is most Pinpoint accuracy is 0.9.As can be seen that assembled classifier chain is accurate in average classification compared with traditional LDA model in the present embodiment On have preferable performance, and support vector machines and Logic Regression Models also have better performance in classification.

Referring to Fig. 5 (a) and Fig. 5 (b), reflects and assorting process is being carried out to case history document using different disaggregated models In, average classification specificity fluctuates situation with the increase of theme quantity, relatively rises sharply wherein having when theme quantity is 15 Width.There is preferable table in average classification specificity compared to traditional LDA model with assembled classifier chain in sample embodiment Existing, specificity can reach 1, and support vector machines and Logic Regression Models also have better table in complication discovery classification It is existing.

Referring to Fig. 6 (a) and Fig. 6 (b), assembled classifier chain and LDA model are described in the present embodiment with theme Quantity increase when averagely classify susceptibility situation of change.Assembled classifier chain is compared to tradition LDA model average in the present embodiment Performance is good in terms of precision, average specificity and the isocratic figureofmerit of average sensitivity.This is because calculating document-theme point Similar case history constraint is considered when cloth, keeps the similar case history of diagnostic result close on document-theme distribution, so as to cause dividing The same tag along sort is divided into class device training process.

By comparative experiments, the inaccuracy and discharge that diabetic's admission diagnosis is analyzed in the present embodiment are examined The realistic meaning of the diabetic complication discovery based on theme is elaborated on the features such as disconnected completeness, while specifying multidimensional The data characteristics extracting method of time series topic model, and simplify a variety of diabetic complications using binary crelation method and send out Existing experimental program can efficiently use the illness that actually occurs in progress note and develop and clinic diagnosis data, to entering Not found complication has good detection effect in institute's diagnosis, has affirmed the diabetic complication discovery based on theme Scientific and importance.

Second aspect, the embodiment of the invention provides a kind of diabetic complication analytical equipments, referring to Fig. 7, described device Include:

Case history collection obtains module 701, for obtaining case history document sets；The case history document sets include first quantity part disease Go through document；Every part of case history document includes at least one progress note；

Vector space obtains module 702, for obtaining progress note-theme distribution of at least one progress note, Obtain the progress note vector of every part of case history document；

Tag along sort obtains module 703, for obtaining the tag along sort of the progress note vector；

It was found that model obtains module 704, model is found to disease for multiple tag along sorts using every part of medical history record It is trained, obtains final disease discovery model.

It should be noted that diabetic complication analytical equipment provided in an embodiment of the present invention and the above method are one by one Corresponding relationship, the implementation detail of the above method are equally applicable to above-mentioned apparatus, the embodiment of the present invention no longer to above system into Row is described in detail.

In specification of the invention, numerous specific details are set forth.It is to be appreciated, however, that the embodiment of the present invention can be with It practices without these specific details.In some instances, well known method, structure and skill is not been shown in detail Art, so as not to obscure the understanding of this specification.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations； Although present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its It is still possible to modify the technical solutions described in the foregoing embodiments, or special to some or all of technologies Sign is equivalently replaced；And these are modified or replaceed, the present invention that it does not separate the essence of the corresponding technical solution is each to be implemented The range of example technical solution, should all cover within the scope of the claims and the description of the invention.

Claims

1. a kind of diabetic complication analysis method, which is characterized in that the described method includes:

Progress note-the theme distribution for obtaining at least one progress note, obtain the progress note of every part of case history document to Amount；

Obtain the tag along sort of the progress note vector；

Disease discovery model is trained using multiple tag along sorts of every part of medical history record, obtains final disease discovery mould Type.

2. the method according to claim 1, wherein obtaining the progress note-of at least one progress note Theme distribution, the progress note vector for obtaining every part of case history document include:

The multidimensional time-series master of the case history document is obtained according to progress note-theme distribution of at least one progress note Topic；

Feature extraction is carried out to the multidimensional time-series theme using singular value decomposition, obtains the singular value ginseng of diagonal positions Number is the progress note vector of the case history document.

3. the method according to claim 1, wherein the tag along sort for obtaining the progress note vector includes:

Concentrate an optional disease label from the disease, using BP binary classification method by the disease label be added to including The progress note vector of the disease.

4. the method according to claim 1, wherein obtaining the progress note-of at least one progress note Theme distribution, the progress note vector for obtaining every part of case history document include:

The similarity between the case history document sets any two case history document is calculated, obtains the similarity more than or equal to phase The similarity constraint case history collection constituted like multiple case history documents of degree threshold value；

Case history document each in the similarity constraint case history collection is sequentially inputted to default LDA model, passes through the default LDA Document-the theme distribution and theme of each case history document of model inference-word distribution；

5. according to the method described in claim 4, it is characterized in that, calculating in initial case history between any two case history document Similarity includes:

Calculate separately numerical value of any two case history document about each Similarity measures factor；The Similarity measures factor packet It includes: distance, the distance of diagnostic result of segmentation belonging to the distance of gender attribute, age；

Any two case history is calculated according to the numerical value of each Similarity measures factor and the weighted value of each Similarity measures factor The similarity of document.

6. according to the method described in claim 4, it is characterized in that, passing through default each case history document of LDA model inference Document-theme distribution and theme-word distribution include:

Rescan the similarity constraint case history collection, to each word according toResampling master Topic, the new theme made meet Gibbs Sampling convergence；

Theme-word co-occurrence frequency matrix in corpus is counted, document-theme distribution and theme-word distribution is calculated.

7. according to the method described in claim 5, it is characterized in that, the default LDA model includes:

Wherein θ r^m={ θ_{M, 1}, θ_{M, 2}..., θ_{M, Lm}, indicate that each case history document includes L_mA progress note；θ_{M, Lm}Indicate L_mIt is a The theme of progress note；d(θ_{M, Lm}, θ_{N, Ln}) it is expressed as the Euclidean distance between the theme vector of two courses of disease；

8. a kind of diabetic complication analytical equipment, which is characterized in that described device includes:

Case history collection obtains module, for obtaining case history document sets；The case history document sets include first quantity part case history document；Often Part case history document includes at least one progress note；

Vector space obtains module, for obtaining progress note-theme distribution of at least one progress note, obtains every part The progress note vector of case history document；

It was found that model obtains module, disease discovery model is instructed for multiple tag along sorts using every part of medical history record Practice, obtains final disease discovery model.