WO2024027438A1

WO2024027438A1 - Personalized state-space progression model-based assisted decision-making system for disease

Info

Publication number: WO2024027438A1
Application number: PCT/CN2023/105550
Authority: WO
Inventors: 田雨; 李劲松; 周天舒; 潘昌荣
Original assignee: 浙江大学
Priority date: 2022-08-01
Filing date: 2023-07-03
Publication date: 2024-02-08
Also published as: CN115019960A; CN115019960B

Abstract

Disclosed is a personalized state-space progression model-based assisted decision-making system for a disease. According to the present invention, patient clustering and disease progression trajectory identification are nested together, and are updated and iterated until convergence to obtain a personalized state-space progression model, and by iteratively clustering disease progression trajectories by means of establishing a model, patient data within the same category is fully utilized. Patients are grouped into a plurality of subtypes during the mining of the disease progression trajectories, and as the number of patients within a subtype increases, a disease progression trajectory of said subtype is continuously revised. Finally, a future disease progression of a patient is predicted on the basis of the personalized state-space progression model, thereby helping a clinical doctor to perform assisted decision-making. According to the present invention, a state-space model is used to focus on a potential state-space of a disease, so that a hidden state of the disease is effectively explained, thus providing an understandable disease progression model.

Description

一种基于个性化状态空间进展模型的疾病辅助决策***A disease-assisted decision-making system based on personalized state space progression model

技术领域Technical field

本发明涉及医疗信息技术领域，尤其涉及一种基于个性化状态空间进展模型的疾病辅助决策***。The invention relates to the field of medical information technology, and in particular to a disease auxiliary decision-making system based on a personalized state space progression model.

背景技术Background technique

心血管疾病、糖尿病、帕金森等慢性疾病进展缓慢，进程长，往往需要终生治疗，给患者本人、医护人员和医疗服务***带来巨大的负担。现代电子健康记录(EHR)的出现为建立疾病进展模型提供了机会，该模型可以预测个体水平的疾病轨迹，并提取可理解和可操作的疾病动态表征。Chronic diseases such as cardiovascular disease, diabetes, and Parkinson's disease progress slowly and over a long period of time, often requiring lifelong treatment, placing a huge burden on patients, medical staff, and the medical service system. The emergence of modern electronic health records (EHRs) provides the opportunity to build disease progression models that can predict disease trajectories at the individual level and extract understandable and actionable representations of disease dynamics.

构建疾病进展模型主要有三个任务：疾病轨迹建模、疾病进展预测、解释疾病异质性。疾病轨迹建模旨在挖掘疾病随时间的变化模式；疾病进展预测旨在预测患者未来的疾病特征变化，估计死亡率、再入院率、药物不良反应等；解释疾病异质性是构建疾病进展模型的一大难点。造成这种异质性的主要原因是，患者一方面拥有表型异质性(个体水平)，另一方面还处于动态疾病过程的不同阶段，导致了疾病生物标志物在不同患者不同疾病阶段的显著差异，包括成像的体积测量、肿瘤特异性蛋白水平测量和行为测量评分等。构建一个能同时分离表型和时间异质性的疾病进展模型仍然是一个挑战。There are three main tasks in building a disease progression model: disease trajectory modeling, disease progression prediction, and explaining disease heterogeneity. Disease trajectory modeling aims to mine the change pattern of the disease over time; disease progression prediction aims to predict future changes in disease characteristics of patients, estimating mortality, readmission rates, adverse drug reactions, etc.; explaining disease heterogeneity is to build a disease progression model A major difficulty. The main reason for this heterogeneity is that on the one hand, patients have phenotypic heterogeneity (individual level) and on the other hand, they are at different stages of the dynamic disease process, resulting in the variation of disease biomarkers in different patients at different disease stages. Significant differences included imaging volumetric measurements, tumor-specific protein level measurements, and behavioral measurement scores. Constructing a disease progression model that simultaneously separates phenotypic and temporal heterogeneity remains a challenge.

现有技术基于患者的特征来预测特定患者的疾病进展。基于特定患者的预测疾病进展来执行优化以确定特定患者的最佳疗法类型和最佳疗法时机。主要步骤：1、获取时序数据2、构建动态***(即轨迹建模，使用动态***识别算法，神经网络等)3、轨迹聚类4、预测。现有技术有以下不足：Existing technology predicts disease progression in a specific patient based on the patient's characteristics. Optimization is performed to determine the best type of therapy and the best timing of therapy for a given patient based on the predicted disease progression of the given patient. Main steps: 1. Obtain time series data 2. Build dynamic system (i.e. trajectory modeling, use dynamic system identification algorithm, neural network, etc.) 3. Trajectory clustering 4. Prediction. The existing technology has the following shortcomings:

1.用于预测疾病进展的普通循环神经网络(RNN等)是黑箱模型，不关注疾病潜在的状态空间，不能解释疾病的隐藏状态，不能提供可理解的疾病进展模型，局限于预测某一目标，虽然RNN也有hidden state，但不能映射到有临床意义的疾病状态。1. Ordinary recurrent neural networks (RNN, etc.) used to predict disease progression are black box models that do not pay attention to the potential state space of the disease, cannot explain the hidden state of the disease, cannot provide an understandable disease progression model, and are limited to predicting a certain target. ,Although RNN also has hidden state, it cannot be mapped to a clinically meaningful disease state.

2.经典的隐马尔可夫模型是一种过于简化的无记忆的概率模型，下一时间节点的状态仅取决于当前时间节点的状态，与之前时刻的状态无关。无法正确解释由于患者不同的临床病史或临床事件导致的患者的疾病进展轨迹的异质性。2. The classic hidden Markov model is an oversimplified memoryless probability model. The state of the next time node only depends on the state of the current time node and has nothing to do with the state of the previous moment. Heterogeneity in patients' disease progression trajectories due to their different clinical histories or clinical events cannot be properly accounted for.

3.先进行轨迹建模后进行轨迹聚类的结构框架，在轨迹建模时不能充分利用同一亚组中的数据(因为还不知道类别)，当出现离群点时聚类是不可靠的。且这种结构框架无法确定聚类的类别数目。 3. The structural framework of trajectory modeling first and then trajectory clustering cannot fully utilize the data in the same subgroup during trajectory modeling (because the category is not yet known), and clustering is unreliable when outliers occur. . And this structural framework cannot determine the number of clustering categories.

发明内容Contents of the invention

本发明目的在于针对现有技术的不足，提出一种基于个性化状态空间进展模型的疾病辅助决策***，能够完成疾病进展模型的三个任务：疾病轨迹建模、疾病进展预测、解释疾病异质性。本发明的目标是将若干患者分为具有相同疾病轨迹的若干类，同时可以获得患者分型和疾病隐藏状态。本发明采用状态空间模型(深度概率模型)进行疾病轨迹建模，能够映射到有临床意义的疾病状态，且能够对疾病进展进行预测。本发明基于中餐馆过程将若干患者分为具有相同疾病轨迹的若干类。患者聚类和轨迹建模同时进行，并不断互相修正。不需要定义类别数目。The purpose of the present invention is to address the shortcomings of the existing technology and propose a disease auxiliary decision-making system based on a personalized state space progression model, which can complete the three tasks of the disease progression model: disease trajectory modeling, disease progression prediction, and explanation of disease heterogeneity. sex. The goal of the present invention is to classify several patients into several categories with the same disease trajectory, while obtaining patient classification and disease hidden status. The present invention uses a state space model (deep probability model) to model disease trajectories, which can be mapped to clinically significant disease states and can predict disease progression. The present invention divides several patients into several categories with the same disease trajectory based on the Chinese restaurant process. Patient clustering and trajectory modeling are performed simultaneously and continuously modify each other. There is no need to define the number of categories.

本发明的目的是通过以下技术方案来实现的：一种基于个性化状态空间进展模型的疾病辅助决策***，该***包括数据获取模块、个性化状态空间进展模型模块和辅助决策模块；The object of the present invention is achieved through the following technical solutions: a disease auxiliary decision-making system based on a personalized state space progression model, which system includes a data acquisition module, a personalized state space progression model module and an auxiliary decision-making module;

所述数据获取模块用于获取患者的电子病历记录，第一次就诊获得的基线数据和之后多次就诊获得的随访记录数据；The data acquisition module is used to obtain the patient's electronic medical records, baseline data obtained at the first visit and follow-up record data obtained at multiple subsequent visits;

所述个性化状态空间进展模型模块用于将患者聚类和疾病进展轨迹识别嵌套在一起，更新迭代直至收敛得到个性化状态空间进展模型，包括解释患者异质性子模块和构建疾病进展模型子模块；The personalized state space progression model module is used to nest patient clustering and disease progression trajectory identification together, update and iterate until convergence to obtain a personalized state space progression model, including a submodule for explaining patient heterogeneity and a submodule for constructing a disease progression model. module;

所述解释患者异质性子模块用于将疾病进展轨迹相似的患者聚类到同一亚型；The explained patient heterogeneity submodule is used to cluster patients with similar disease progression trajectories into the same subtype;

所述构建疾病进展模型子模块用于通过构建状态空间对疾病发展轨迹建模，状态空间中状态变量为患者的健康状态，状态空间中观测变量为患者的随访记录数据；由对应患者健康状态的观测变量的概率分布得到疾病进展模型的发射分布，并由当前时刻前所有就诊时的观测变量和状态变量得到疾病进展模型的状态转移分布；The submodule of constructing a disease progression model is used to model the disease development trajectory by constructing a state space. The state variable in the state space is the patient's health status, and the observation variable in the state space is the patient's follow-up record data; The probability distribution of the observed variables is used to obtain the emission distribution of the disease progression model, and the state transition distribution of the disease progression model is obtained from the observed variables and state variables at all visits before the current moment;

所述辅助决策模块用于基于个性化状态空间进展模型预测患者未来的疾病进展，帮助临床医生进行辅助决策。The auxiliary decision-making module is used to predict the patient's future disease progression based on a personalized state space progression model and help clinicians make auxiliary decisions.

进一步地，随访记录数据包含人口统计学数据、生物标志物和临床事件信息。Further, follow-up record data include demographic data, biomarkers, and clinical event information.

进一步地，所述数据获取模块获取的数据经过预处理后再输入个性化状态空间进展模型模块，数据预处理包括特征筛选、时序对齐、填充缺失值和数据标准化。Further, the data obtained by the data acquisition module is input into the personalized state space progression model module after preprocessing. The data preprocessing includes feature screening, time series alignment, filling of missing values and data standardization.

进一步地，所述解释患者异质性子模块将所有患者分为若干个亚型，如果新的患者被分配到现有的亚型中，则更新对应亚型的疾病进展模型参数，假设疾病进展模型参数在每个亚型上满足高斯分布，采用蒙特卡洛采样方法计算每个患者属于某亚型的概率，完成患者聚类。Further, the explanation patient heterogeneity submodule divides all patients into several subtypes. If a new patient is assigned to an existing subtype, the disease progression model parameters of the corresponding subtype are updated, assuming that the disease progression model The parameters satisfy Gaussian distribution for each subtype, and the Monte Carlo sampling method is used to calculate the probability that each patient belongs to a certain subtype to complete patient clustering.

进一步地，患者的随访记录数据为状态空间的观测变量，包括连续变量和分类变量，假设连续变量为高斯分布，分类变量为伯努利分布，基于两类观测变量，得到疾病进展模型的发射分布。 Furthermore, the patient's follow-up record data are observation variables in the state space, including continuous variables and categorical variables. It is assumed that the continuous variables are Gaussian distribution and the categorical variables are Bernoulli distribution. Based on the two types of observation variables, the emission distribution of the disease progression model is obtained. .

进一步地，基于注意力机制计算权重获得状态转移分布，注意力权重通过线性动力学模拟过去状态对未来状态的影响，通过注意力机制将每个时间步中患者的观测变量映射到一组注意力权重，用隐马尔可夫模型中的t时刻的状态转移分布乘以映射的注意力权重并求和后表示疾病进展模型的状态转移分布。Furthermore, the state transfer distribution is obtained by calculating the weight based on the attention mechanism. The attention weight simulates the influence of the past state on the future state through linear dynamics. The patient's observed variables in each time step are mapped to a set of attention through the attention mechanism. The weight represents the state transition distribution of the disease progression model after multiplying the state transition distribution at time t in the hidden Markov model by the mapped attention weight and summing it up.

进一步地，注意力机制通过序列对序列Seq2Seq模型实现，Seq2Seq模型使用LSTM编码器-解码器体系结构，将每个时间步患者的观测变量输入LSTM编码器，LSTM编码器的最终状态和最终输出一起被传递到LSTM解码器；在LSTM解码器中使用LSTM编码器的最终状态作为LSTM解码器的初始状态，使用LSTM解码器的最终输出作为下一个时间步Seq2Seq模型的输入，在t-1时刻的解码迭代后，通过softmax输出层收集t时刻前所有时刻的注意力权重。Further, the attention mechanism is implemented through the sequence-to-sequence Seq2Seq model. The Seq2Seq model uses an LSTM encoder-decoder architecture to input the observed variables of the patient at each time step into the LSTM encoder. The final state of the LSTM encoder is together with the final output. is passed to the LSTM decoder; in the LSTM decoder, the final state of the LSTM encoder is used as the initial state of the LSTM decoder, and the final output of the LSTM decoder is used as the input of the Seq2Seq model at the next time step, at time t-1 After the decoding iteration, the attention weights of all moments before time t are collected through the softmax output layer.

进一步地，使用变分推断得到疾病进展模型参数的后验分布，用于学习疾病进展模型参数并估计患者的实时健康状态。Furthermore, variational inference is used to obtain the posterior distribution of the disease progression model parameters, which is used to learn the disease progression model parameters and estimate the real-time health status of the patient.

进一步地，使用最大化证据下界的方法进行变分推断，将疾病进展模型参数的后验分布推断问题转变为优化问题，优化问题的模型参数使用随机梯度下降法进行学习。Furthermore, the method of maximizing the evidence lower bound is used for variational inference, and the posterior distribution inference problem of disease progression model parameters is transformed into an optimization problem. The model parameters of the optimization problem are learned using the stochastic gradient descent method.

进一步地，辅助决策包括基于个性化状态空间进展模型预测患者未来的各项指标变化情况，预测未来疾病发生、复发或死亡的风险，以及协助临床医生对新的患者进行分型，针对不同分型的患者给予不同的对症治疗手段。Furthermore, auxiliary decision-making includes predicting the patient's future changes in various indicators based on the personalized state space progression model, predicting the risk of future disease occurrence, recurrence or death, and assisting clinicians to classify new patients and target different classifications. patients were given different symptomatic treatments.

本发明的有益效果：Beneficial effects of the present invention:

1)本发明没有使用缺乏可解释性的神经网络黑箱模型挖掘疾病进展轨迹，而是使用了深度概率模型(状态空间模型)，关注疾病潜在的状态空间，有效解释了疾病的隐藏状态，提供了可供理解的疾病进展模型。1) This invention does not use a neural network black box model that lacks interpretability to mine disease progression trajectories, but uses a deep probability model (state space model) to focus on the potential state space of the disease, effectively explain the hidden state of the disease, and provide Comprehensive models of disease progression.

2)本发明采用了迭代聚类的方法，充分利用了同一类别中的患者数据，在挖掘疾病进展轨迹的同时将患者分为若干个亚型，且随着亚型内患者丰富不断修正该亚型的疾病进展轨迹。2) The present invention adopts an iterative clustering method, makes full use of patient data in the same category, and divides patients into several subtypes while mining disease progression trajectories, and continuously corrects the subtypes as the number of patients within the subtypes increases. type of disease progression trajectory.

3)本发明可以基于个性化状态空间进展模型直接估计出最佳的聚类数目。3) The present invention can directly estimate the optimal number of clusters based on the personalized state space progression model.

附图说明Description of drawings

图1为本发明实施例提供的整体***框图。Figure 1 is an overall system block diagram provided by an embodiment of the present invention.

图2为本发明实施例提供的构建个性化状态空间进展模型过程的示意图。Figure 2 is a schematic diagram of the process of constructing a personalized state space progress model provided by an embodiment of the present invention.

图3为本发明实施例提供的疾病进展模型的结构图。Figure 3 is a structural diagram of a disease progression model provided by an embodiment of the present invention.

具体实施方式Detailed ways

以下结合附图对本发明具体实施方式作进一步详细说明。The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

如图1所示，本发明提供的一种基于个性化状态空间进展模型的疾病辅助决策***，包含数据获取模块、个性化状态空间进展模型模块和辅助决策模块三个部分。As shown in Figure 1, the invention provides a disease auxiliary decision-making system based on a personalized state space progression model, including It contains three parts: data acquisition module, personalized state space progress model module and auxiliary decision-making module.

所述数据获取模块用于获取真实世界的电子病历记录，将所有患者在医院多次就诊期间收集的连续随访记录数据的集合记为将患者s的随访记录数据集合表示为其中是患者s第t次就诊的随访记录数据，T为患者s总就诊次数。随访记录是一个多维向量，包含人口统计学数据(年龄、性别、家族史等)、生物标志物(在细胞、组织、器官、***等级别上的各种测量结果)和临床事件信息(ICD-10诊断代码和治疗手段等)。The data acquisition module is used to obtain real-world electronic medical records, and record the set of continuous follow-up record data collected by all patients during multiple visits to the hospital as The follow-up record data set of patient s is expressed as in is the follow-up record data of the tth visit of patient s, and T is the total number of visits of patient s. Follow-up records is a multidimensional vector containing demographic data (age, gender, family history, etc.), biomarkers (various measurements at the cell, tissue, organ, system, etc. levels) and clinical event information (ICD-10 diagnoses codes and treatments, etc.).

具体地，以帕金森病为例，收集的数据集合包括第一次就诊获得的基线数据和之后多次就诊获得的随访记录数据。随访记录数据包括每一次就诊时通过MRI或PET/CT获得的影像报告数据、阶段性服药记录和通过医生问诊或其他方式获得的开关期量表评分(帕金森病统一评分量表、爱泼沃斯嗜睡量表、老年抑郁量表、蒙特利尔认知评估量表、冲动-强迫障碍问卷、自主神经量表、Hoehn-Yahr分级表等)。基线数据除以上随访检查项目外还包括人口统计学信息如患者的年龄、性别、帕金森病家族史、起病侧、接受教育时间等。Specifically, taking Parkinson's disease as an example, the data collection collected Includes baseline data obtained at the first visit and follow-up recorded data obtained at subsequent visits. Follow-up record data include imaging report data obtained through MRI or PET/CT at each visit, periodic medication records, and switching period scale scores (Unified Parkinson's Disease Rating Scale, Epstein-Barr's Disease Rating Scale, Epstein-Barr's Disease Unified Rating Scale, Epstein-Barr's Disease Rating Scale, Epstein-Barr's Disease Unified Rating Scale, Epstein-Barr's Disease Rating Scale, Epstein-Barr's Disease Unified Rating Scale, Epstein-Barr's Disease Rating Scale, etc.) during each visit. Voss Sleepiness Scale, Geriatric Depression Scale, Montreal Cognitive Assessment Scale, Impulse-Obsessive Disorder Questionnaire, Autonomic Scale, Hoehn-Yahr Scale, etc.). In addition to the above follow-up examination items, the baseline data also includes demographic information such as the patient's age, gender, family history of Parkinson's disease, side of onset, time of education, etc.

所述数据获取模块获取的数据经过预处理后再输入个性化状态空间进展模型模块，数据预处理主要包括特征筛选、时序对齐、填充缺失值和数据标准化。The data obtained by the data acquisition module are pre-processed and then input into the personalized state space progression model module. The data pre-processing mainly includes feature screening, time series alignment, filling of missing values and data standardization.

特征筛选：由于纳入特征维度较高，需要进行特征筛选以降低特征冗余，排除噪声干扰。可以使用主成分分析、潜变量模型等降维方法提取特征的有效信息。Feature screening: Due to the high dimensionality of the included features, feature screening needs to be performed to reduce feature redundancy and eliminate noise interference. Dimensionality reduction methods such as principal component analysis and latent variable models can be used to extract effective information about features.

时序对齐：不同患者的随访频次不一致，要将所有患者的时序对齐到相同的频次。以帕金森病为例，一般患者在确诊第一年内每三个月随访一次，确诊第二年内每六个月随访一次，确诊第三年之后每年随访一次。Time series alignment: The follow-up frequency of different patients is inconsistent, and the time series of all patients must be aligned to the same frequency. Taking Parkinson's disease as an example, patients are generally followed up every three months in the first year after diagnosis, every six months in the second year after diagnosis, and annually after the third year after diagnosis.

填充缺失值：在时序对齐后会出现大量缺失值，可以采用前向插值的方法填充缺失值。Filling missing values: After time series alignment, a large number of missing values will appear, and forward interpolation can be used to fill the missing values.

数据标准化：为了使各项特征处于同一量纲级别，可以采用Z-score标准化法或极差标准化法进行数据标准化。Data standardization: In order to keep each feature at the same dimensional level, the Z-score standardization method or the range standardization method can be used for data standardization.

所述个性化状态空间进展模型包括解释患者异质性子模块和构建疾病进展模型子模块两个部分，如图2所示：The personalized state space progression model includes two parts: a submodule for explaining patient heterogeneity and a submodule for constructing a disease progression model, as shown in Figure 2:

总的来说，将患者聚类和疾病进展轨迹识别两个过程嵌套在一起，不断进行更新和迭代直至收敛。具体地，从一个空的聚类开始，对于第一个患者，以一个随机概率被分配到第一个亚型中，并得到该亚型的疾病进展模型。基于这个亚型，之后的患者聚类分配取决于它们疾病进展轨迹上的相似性，如果新的患者被分配到现有的亚型中，则更新该亚型的疾病进展模型参数，这个过程将在所有患者被聚类完成后结束。In summary, the two processes of patient clustering and disease progression trajectory identification are nested together and continuously updated and iterated until convergence. Specifically, starting from an empty cluster, the first patient is assigned to the first subtype with a random probability, and the disease progression model of this subtype is obtained. Based on this subtype, the subsequent cluster assignment of patients depends on the similarity in their disease progression trajectories. If a new patient is assigned to an existing subtype, the disease progression model parameters of the subtype are updated. This process will It ends after all patients have been clustered.

所述解释患者异质性子模块用于将疾病进展轨迹相似或相同的患者聚类到同一亚型，本发明利用中餐馆过程(Chinese Restaurant Process,CRP)作为个性化状态空间进展模型构建框架。CRP是一个离散事件的随机过程，由狄利克雷过程扩展获得。在这个过程中，一个顾客坐在一张桌子旁的概率是由已经坐在这张桌子旁的其他顾客的数量计算出来的。本发明将患者s认为是一个顾客，共有N个顾客，c_s表示第s个顾客选择的桌子，n_k表示坐在第k张桌子旁的顾客数量。假设前s-1个顾客共占据了K张桌子，第s个顾客选择桌子的概率可以描述为：
The patient heterogeneity explanation sub-module is used to cluster patients with similar or identical disease progression trajectories into the same subtype. The present invention uses the Chinese Restaurant Process (CRP) as a personalized state space progression model building frame. shelf. CRP is a discrete event stochastic process, which is obtained by extending the Dirichlet process. In this process, the probability of a customer sitting at a table is calculated from the number of other customers who have been seated at this table. In the present invention, patient s is considered as a customer, and there are N customers in total. c _s represents the table chosen by the s-th customer, and n _k represents the number of customers sitting at the k-th table. Assuming that the first s-1 customers occupy a total of K tables, the probability of the s-th customer choosing a table can be described as:

其中a是一个给定的参数。在本发明中，被分配到同一桌子的患者代表拥有相同的疾病进展轨迹。因此，第s个患者被划分到第k个亚型的概率可以描述为：
where a is a given parameter. In the present invention, patients assigned to the same table represent patients with the same disease progression trajectory. Therefore, the probability that the s-th patient is classified into the k-th subtype can be described as:

其中基于疾病进展模型获得，其中包含了疾病进展轨迹信息，通过估计可以将所有患者分为q个亚型。假设θ为疾病进展模型的所有参数集合，由于同一亚型中不同患者的疾病进展轨迹是相同或相似的，令θ^(k)为第k个亚型的疾病进展模型的参数集合。考虑到亚型间的差异及同一亚型中患者的相似性，假设参数在每个亚型上满足高斯分布，即
in Obtained based on the disease progression model, which contains disease progression trajectory information, by estimating All patients can be divided into q subtypes. Assume θ is the set of all parameters of the disease progression model. Since the disease progression trajectories of different patients in the same subtype are the same or similar, let θ ^(k) be the parameter set of the disease progression model of the k-th subtype. Considering the differences between subtypes and the similarity of patients in the same subtype, it is assumed that the parameters satisfy a Gaussian distribution on each subtype, that is,

其中，表示高斯分布，表示第k个亚型中所有患者数据的均值和方差；因此，根据下式可以计算每个患者属于第k个亚型的概率：
in, represents Gaussian distribution, represents the mean and variance of all patient data in the k-th subtype; therefore, the probability that each patient belongs to the k-th subtype can be calculated according to the following formula:

由于上式积分没有解析解，可以采用蒙特卡洛采样等方法计算积分。Since there is no analytical solution to the integral of the above formula, methods such as Monte Carlo sampling can be used to calculate the integral.

所述构建疾病进展模型子模块通过构建状态空间对目标疾病的发展轨迹建模，将患者第t次就诊认为是一个时间步，即假设在t时刻的患者的健康状态处于z_t∈Z，其中状态变量可以通过随访记录数据x_t体现，x_t表示所有患者在t时刻的随访记录数据，如图3所示，状态空间是疾病发展的所有可能疾病阶段的离散集合Z＝{1，…，M}，M表示第M个疾病阶段。一般来说，病程的发展阶段与不同的疾病表型相对应。例如，阿尔兹海默症的进展一般分为7个阶段，每个阶段对应了不同程度的认知能力衰退和痴呆症状。且由于患者的真实健康状态是未知的，可以用z_t表示，假设z_t是隐藏状态，将在无监督的情况下学习到。将疾病进展模型表示为状态和观测变量的联合分布 The disease progression model construction sub-module models the development trajectory of the target disease by constructing a state space, and considers the patient's tth visit as a time step, that is, assuming that the patient's health status at time t is z _t ∈ Z, where The state variable can be reflected by the follow-up record data x _t , which represents the follow-up record data of all patients at time t, as _shown in Figure 3. The state space is a discrete set of all possible disease stages of disease development Z = {1,..., M}, M represents the Mth disease stage. Generally, the developmental stages of the disease course correspond to different disease phenotypes. For example, the progression of Alzheimer's disease is generally divided into seven stages, each corresponding to a different degree of cognitive decline and dementia symptoms. And since the patient's true health status is unknown, it can be represented by z _t . Assuming that z _t is a hidden state, it will be learned without supervision. Represent disease progression model as a joint distribution of state and observed variables

其中 in

其中，表示所有患者状态变量的集合，疾病进展模型的发射分布p_θ(x_t|z_t)表示在t时刻，当患者的健康状态处于z_t时，观测变量(即随访记录)为x_t的概率分布。观测变量既包含连续变量(如生物标志物、年龄等)又包含分类变量(例如临床事件、ICD-10代码等)。为了同时得到两类观测变量，将发射分布分解为：
in, Represents the set of all patient state variables. The emission distribution p _θ (x _t |z _t ) of the disease progression model represents the probability that the observed variable (i.e., follow-up record) is x _t when the patient's health state is z _t at time t. distributed. Observed variables Contains both continuous variables (such as biomarkers, age, etc.) and contains categorical variables (e.g. clinical events, ICD-10 codes, etc.). In order to obtain two types of observation variables at the same time, the emission distribution is decomposed into:

其中
in

即假设连续变量为高斯分布，表示当患者健康状态处于z_t时的均值和方差，分类变量为伯努利分布，用逻辑斯蒂分布表示伯努利分布取1的概率，表示当患者健康状态处于z_t时的逻辑斯谛分布的散布程度。That is, assuming that the continuous variable is Gaussian distributed, Represents the mean and variance when the patient's health state is z _t , the categorical variable is Bernoulli distribution, and the logistic distribution is used to represent the probability of Bernoulli distribution taking 1, Indicates the spread degree of the logistic distribution when the patient's health status is z _t .

疾病进展模型的状态转移分布表示t时刻的状态分布由之前所有时刻的观测变量和状态变量决定。该状态转移分布可以使用隐马尔可夫模型的转移矩阵，即当前时刻的状态只与前一时刻的状态有关。隐马尔可夫模型(及其变体)的缺点是无记忆性，使得患者的疾病进展轨迹的异质性无法被正确解释。本发明中选择使用一种基于注意力机制计算权重的方法来获取状态转移分布：
State transition distribution for disease progression models Indicates that the state distribution at time t is determined by the observed variables and state variables at all previous moments. The state transition distribution can use the transition matrix of the hidden Markov model, that is, the state at the current moment is only related to the state at the previous moment. The disadvantage of hidden Markov models (and their variants) is their memorylessness, which prevents the heterogeneity of patients' disease progression trajectories from being correctly explained. In this invention, a method of calculating weights based on the attention mechanism is chosen to obtain the state transition distribution:

其中，为t时刻状态变量的注意力权重：
in, is the attention weight of the state variable at time t:

注意力权重通过线性动力学模拟了过去状态对未来状态的影响，因此状态转移分布可以表示为：
Attention weight simulates the influence of past states on future states through linear dynamics, so the state transition distribution can be expressed as:

其中P(z_t′，z_t)是隐马尔可夫模型中的t时刻的状态转移分布，将该状态转移分布乘以t-1个权重并求和后用于表示疾病进展模型的状态转移分布。通过注意力机制A分配t时刻前所有状态的注意力权重，即其中，表示t时刻之前的所有患者观测变量集合。where P(z _t′ , z _t ) is the state transition distribution at time t in the hidden Markov model. This state transition distribution is multiplied by t-1 weights and summed to represent the state transition of the disease progression model. distributed. The attention weight of all states before time t is allocated through the attention mechanism A, that is, in, Represents the set of all patient observation variables before time t.

注意力机制A是一个确定性算法，它生成了一个函数序列{A_t}_t，将每个时间步中患者的 t时刻之前的观测变量集合映射到一组注意力权重由于注意力机制A在每个时间步中输出一个完整的注意力权重序列，本发明中通过一个序列对序列(Seq2Seq)模型实现注意力机制A。Seq2Seq模型可以使用LSTM编码器-解码器体系结构，对于时间步t，患者的t时刻之前的观测变量集合被输入LSTM编码器，LSTM编码器的最终状态和最终输出一起被传递到LSTM解码器。在LSTM解码器中使用LSTM编码器的最终状态作为LSTM解码器的初始状态，使用LSTM解码器的最终输出作为下一个时间步Seq2Seq模型的输入，在t-1时刻的解码迭代后，通过softmax输出层收集t时刻前所有时刻的注意力权重。Attention mechanism A is a deterministic algorithm that generates a function sequence {A _t } _t that combines the patient’s The set of observed variables before time t mapped to a set of attention weights Since the attention mechanism A outputs a complete sequence of attention weights in each time step, the present invention implements the attention mechanism A through a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model can use the LSTM encoder-decoder architecture. For time step t, the set of observed variables before the patient's time t. is fed into the LSTM encoder, and the final state of the LSTM encoder is passed to the LSTM decoder along with the final output. Use the final state of the LSTM encoder in the LSTM decoder as the initial state of the LSTM decoder, use the final output of the LSTM decoder as the input of the Seq2Seq model at the next time step, and output it through softmax after the decoding iteration at time t-1 The layer collects the attention weights of all moments before time t.

为了得到状态和观测变量的联合分布的模型参数(疾病进展模型参数)，使用变分推断或其他贝叶斯推断算法推断后验分布得到疾病进展模型参数集合θ并估计患者的健康状态z_t。具体来说，在变分推断中最大化数据似然的证据下界(Evidence lower bound,ELBO)，即：
In order to obtain the joint distribution of state and observation variables model parameters (disease progression model parameters), use variational inference or other Bayesian inference algorithms to infer the posterior distribution Obtain the disease progression model parameter set θ and estimate the patient's health status z _t . Specifically, the evidence lower bound (ELBO) that maximizes the likelihood of data in variational inference is:

其中是近似后验分布的一个变分分布，表示求q_φ分布的期望。in is the approximate posterior distribution A variational distribution of Represents the expectation of finding the distribution of _qφ .

将后验分布推断问题转变为以下优化问题，对变分分布进行建模：
Transform the posterior distribution inference problem into the following optimization problem, for the variational distribution To model:

其中，θ^*表示求解后的疾病进展模型参数集合，φ^*表示求解后的q_φ分布的参数集合。Among them, θ ^* represents the parameter set of the solved disease progression model, and φ ^* represents the parameter set of the solved q _φ distribution.

该优化问题中的模型参数可以使用随机梯度下降法进行学习。随机梯度下降法的基本步骤可以概括为：The model parameters in this optimization problem can be learned using stochastic gradient descent. The basic steps of the stochastic gradient descent method can be summarized as:

1、从中随机采样获得q_φ分布下的患者的实时健康状态 1. From Obtain real-time health status of patients under _qφ distribution by random sampling

2、估计N个患者的ELBO值
2. Estimate the ELBO values of N patients

其中，l_θ，φ表示优化问题的目标函数，即 Among them, l _{θ, φ} represent the objective function of the optimization problem, that is

3、估计参数θ和参数φ的梯度和 3. Estimate the gradient of parameter θ and parameter φ and

4、更新数θ和参数φ。可以使用自适应动量估计(Adaptive Moment Estimation,ADAM)或其他优化算法对参数进行更新。4. Update the number θ and parameter φ. Parameters can be updated using Adaptive Moment Estimation (ADAM) or other optimization algorithms.

所述辅助决策模块用于基于个性化状态空间进展模型可以预测未来的疾病进展并输出给临床医生进行辅助决策。辅助决策模块包含以下几个方面：The auxiliary decision-making module is used to predict future disease progression based on a personalized state space progression model and output it to clinicians for auxiliary decision-making. The assisted decision-making module includes the following aspects:

①基于个性化状态空间进展模型预测某一患者未来的各项指标变化情况：对已有公认进展指标的疾病，可以基于个性化状态空间进展模型预测某一患者未来该进展指标的值。以帕金森病为例，帕金森病统一评分量表第三部分评分是公认的评估帕金森病运动症状的指标，基于该个性化状态空间进展模型可以预测未来一段时间内的评分值。① Predict the future changes of various indicators of a certain patient based on the personalized state space progression model: based on the recognized progress For diseases with progression indicators, the future value of the progression indicator for a patient can be predicted based on a personalized state space progression model. Taking Parkinson's disease as an example, the score of the third part of the Unified Parkinson's Disease Rating Scale is a recognized indicator for evaluating motor symptoms of Parkinson's disease. Based on this personalized state space progression model, the score value in the future can be predicted.

②可以预测未来疾病(并发症)发生、复发或死亡的风险，对临床医生及患者起到警示作用。以帕金森病为例，基于该个性化状态空间进展模型可以预测患者何时出现认知功能障碍，提示用药或使用其他能够改善症状的治疗手段。② It can predict the risk of future disease (complication) occurrence, recurrence or death, and serve as a warning to clinicians and patients. Taking Parkinson's disease as an example, this personalized state space progression model can predict when patients will develop cognitive dysfunction and prompt medication or other treatments that can improve symptoms.

③可以协助临床医生对(新的)患者进行分型，针对不同分型的患者给予不同的对症治疗手段。以帕金森病为例，根据疾病进展速度不同可以将患者划分为进展快和进展慢两组，预测新确诊的帕金森病患者进展速度可以有效辅助医生对其病情进行干预。③It can assist clinicians in classifying (new) patients and provide different symptomatic treatments for patients with different classifications. Taking Parkinson's disease as an example, patients can be divided into two groups: fast-progressing and slow-progressing according to the different rates of disease progression. Predicting the progression rate of newly diagnosed Parkinson's disease patients can effectively assist doctors in intervening on their condition.

以上所述仅是本发明的优选实施方式，虽然本发明已以较佳实施例披露如上，然而并非用以限定本发明。任何熟悉本领域的技术人员，在不脱离本发明技术方案范围情况下，都可利用上述揭示的方法和技术内容对本发明技术方案做出许多可能的变动和修饰，或修改为等同变化的等效实施例。因此，凡是未脱离本发明技术方案的内容，依据本发明的技术实质对以上实施例所做的任何的简单修改、等同变化及修饰，均仍属于本发明技术方案保护的范围内。 The above are only preferred embodiments of the present invention. Although the present invention has been disclosed above with preferred embodiments, they are not intended to limit the present invention. Any person familiar with the art can make many possible changes and modifications to the technical solution of the present invention using the methods and technical content disclosed above without departing from the scope of the technical solution of the present invention, or modify it into equivalent changes. Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments based on the technical essence of the present invention without departing from the content of the technical solution of the present invention still fall within the protection scope of the technical solution of the present invention.

Claims

一种基于个性化状态空间进展模型的疾病辅助决策***，其特征在于，该***包括数据获取模块、个性化状态空间进展模型模块和辅助决策模块；A disease auxiliary decision-making system based on a personalized state space progression model, characterized in that the system includes a data acquisition module, a personalized state space progression model module and an auxiliary decision-making module;

所述数据获取模块用于获取患者的电子病历记录，第一次就诊获得的基线数据和之后多次就诊获得的随访记录数据；The data acquisition module is used to obtain the patient's electronic medical records, baseline data obtained at the first visit and follow-up record data obtained at multiple subsequent visits;

所述个性化状态空间进展模型模块用于将患者聚类和疾病进展轨迹识别嵌套在一起，更新迭代直至收敛得到个性化状态空间进展模型，包括解释患者异质性子模块和构建疾病进展模型子模块；The personalized state space progression model module is used to nest patient clustering and disease progression trajectory identification together, update and iterate until convergence to obtain a personalized state space progression model, including a submodule for explaining patient heterogeneity and a submodule for constructing a disease progression model. module;

所述解释患者异质性子模块用于将疾病进展轨迹相似的患者聚类到同一亚型；The explained patient heterogeneity submodule is used to cluster patients with similar disease progression trajectories into the same subtype;

所述构建疾病进展模型子模块用于通过构建状态空间对疾病发展轨迹建模，状态空间中状态变量为患者的健康状态，状态空间中观测变量为患者的随访记录数据；由对应患者健康状态的观测变量的概率分布得到疾病进展模型的发射分布，并由当前时刻前所有就诊时的观测变量和状态变量得到疾病进展模型的状态转移分布；The submodule of constructing a disease progression model is used to model the disease development trajectory by constructing a state space. The state variable in the state space is the patient's health status, and the observation variable in the state space is the patient's follow-up record data; The probability distribution of the observed variables is used to obtain the emission distribution of the disease progression model, and the state transition distribution of the disease progression model is obtained from the observed variables and state variables at all visits before the current moment;

所述辅助决策模块用于基于个性化状态空间进展模型预测患者未来的疾病进展，帮助临床医生进行辅助决策。The auxiliary decision-making module is used to predict the patient's future disease progression based on a personalized state space progression model and help clinicians make auxiliary decisions.
根据权利要求1所述的一种基于个性化状态空间进展模型的疾病辅助决策***，其特征在于，随访记录数据包含人口统计学数据、生物标志物和临床事件信息。A disease auxiliary decision-making system based on a personalized state space progression model according to claim 1, wherein the follow-up record data includes demographic data, biomarkers and clinical event information.
根据权利要求1所述的一种基于个性化状态空间进展模型的疾病辅助决策***，其特征在于，所述数据获取模块获取的数据经过预处理后再输入个性化状态空间进展模型模块，数据预处理包括特征筛选、时序对齐、填充缺失值和数据标准化。A disease auxiliary decision-making system based on a personalized state space progression model according to claim 1, characterized in that the data acquired by the data acquisition module are preprocessed and then input into the personalized state space progression model module. Processing includes feature filtering, time series alignment, filling missing values, and data normalization.
根据权利要求1所述的一种基于个性化状态空间进展模型的疾病辅助决策***，其特征在于，所述解释患者异质性子模块将所有患者分为若干个亚型，如果新的患者被分配到现有的亚型中，则更新对应亚型的疾病进展模型参数，假设疾病进展模型参数在每个亚型上满足高斯分布，采用蒙特卡洛采样方法计算每个患者属于某亚型的概率，完成患者聚类。A disease auxiliary decision-making system based on a personalized state space progression model according to claim 1, characterized in that the explaining patient heterogeneity sub-module divides all patients into several subtypes. If a new patient is assigned into the existing subtypes, the disease progression model parameters of the corresponding subtype are updated. Assuming that the disease progression model parameters satisfy Gaussian distribution in each subtype, the Monte Carlo sampling method is used to calculate the probability that each patient belongs to a certain subtype. , complete patient clustering.
根据权利要求1所述的一种基于个性化状态空间进展模型的疾病辅助决策***，其特征在于，患者的随访记录数据为状态空间的观测变量，包括连续变量和分类变量，假设连续变量为高斯分布，分类变量为伯努利分布，基于两类观测变量，得到疾病进展模型的发射分布。A disease auxiliary decision-making system based on a personalized state space progression model according to claim 1, characterized in that the patient's follow-up record data are observed variables in the state space, including continuous variables and categorical variables, and it is assumed that the continuous variable is Gaussian Distribution, the categorical variable is Bernoulli distribution, and based on the two types of observation variables, the emission distribution of the disease progression model is obtained.
根据权利要求1所述的一种基于个性化状态空间进展模型的疾病辅助决策***，其特征在于，基于注意力机制计算权重获得状态转移分布，注意力权重通过线性动力学模拟过去状态对未来状态的影响，通过注意力机制将每个时间步中患者的观测变量映射到一组注意力权重，用隐马尔可夫模型中的t时刻的状态转移分布乘以映射的注意力权重并求和后表示疾病进展模型的状态转移分布。A disease auxiliary decision-making system based on a personalized state space progression model according to claim 1, characterized in that the state transition distribution is obtained by calculating the weight based on the attention mechanism, and the attention weight is simulated through linear dynamics. The influence of the state on the future state is mapped to a set of attention weights by the patient's observation variables at each time step through the attention mechanism, and the state transition distribution at time t in the hidden Markov model is multiplied by the mapped attention weight. The summation represents the state transition distribution of the disease progression model.
根据权利要求6所述的一种基于个性化状态空间进展模型的疾病辅助决策***，其特征在于，注意力机制通过序列对序列Seq2Seq模型实现，Seq2Seq模型使用LSTM编码器-解码器体系结构，将每个时间步患者的观测变量输入LSTM编码器，LSTM编码器的最终状态和最终输出一起被传递到LSTM解码器；在LSTM解码器中使用LSTM编码器的最终状态作为LSTM解码器的初始状态，使用LSTM解码器的最终输出作为下一个时间步Seq2Seq模型的输入，在t-1时刻的解码迭代后，通过softmax输出层收集t时刻前所有时刻的注意力权重。A disease auxiliary decision-making system based on a personalized state space progression model according to claim 6, characterized in that the attention mechanism is implemented through a sequence-to-sequence Seq2Seq model, and the Seq2Seq model uses an LSTM encoder-decoder architecture to The observed variables of the patient at each time step are input to the LSTM encoder, and the final state of the LSTM encoder is passed to the LSTM decoder together with the final output; in the LSTM decoder, the final state of the LSTM encoder is used as the initial state of the LSTM decoder, The final output of the LSTM decoder is used as the input of the Seq2Seq model at the next time step. After the decoding iteration at time t-1, the attention weights of all moments before time t are collected through the softmax output layer.
根据权利要求1所述的一种基于个性化状态空间进展模型的疾病辅助决策***，其特征在于，使用变分推断得到疾病进展模型参数的后验分布，用于学习疾病进展模型参数并估计患者的实时健康状态。A disease auxiliary decision-making system based on a personalized state space progression model according to claim 1, characterized in that variational inference is used to obtain the posterior distribution of disease progression model parameters for learning disease progression model parameters and estimating patients real-time health status.
根据权利要求8所述的一种基于个性化状态空间进展模型的疾病辅助决策***，其特征在于，使用最大化证据下界的方法进行变分推断，将疾病进展模型参数的后验分布推断问题转变为优化问题，优化问题的模型参数使用随机梯度下降法进行学习。A disease auxiliary decision-making system based on a personalized state space progression model according to claim 8, characterized in that variational inference is performed using a method of maximizing the evidence lower bound to transform the posterior distribution inference problem of disease progression model parameters. For the optimization problem, the model parameters of the optimization problem are learned using stochastic gradient descent.
根据权利要求1所述的一种基于个性化状态空间进展模型的疾病辅助决策***，其特征在于，辅助决策包括基于个性化状态空间进展模型预测患者未来的各项指标变化情况，预测未来疾病发生、复发或死亡的风险，以及协助临床医生对新的患者进行分型，针对不同分型的患者给予不同的对症治疗手段。 A disease auxiliary decision-making system based on a personalized state space progression model according to claim 1, characterized in that the auxiliary decision-making includes predicting changes in various indicators of the patient in the future based on the personalized state space progression model, and predicting future disease occurrence. , risk of recurrence or death, and assist clinicians in classifying new patients, and provide different symptomatic treatments for patients with different classifications.