WO2022227294A1 - 一种基于多模态融合的疾病风险预测方法和*** - Google Patents

一种基于多模态融合的疾病风险预测方法和*** Download PDF

Info

Publication number
WO2022227294A1
WO2022227294A1 PCT/CN2021/106860 CN2021106860W WO2022227294A1 WO 2022227294 A1 WO2022227294 A1 WO 2022227294A1 CN 2021106860 W CN2021106860 W CN 2021106860W WO 2022227294 A1 WO2022227294 A1 WO 2022227294A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
features
risk prediction
disease risk
unstructured
Prior art date
Application number
PCT/CN2021/106860
Other languages
English (en)
French (fr)
Inventor
刘治
李玉军
胡喜风
胡伟风
Original Assignee
山东大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东大学 filed Critical 山东大学
Priority to US17/910,556 priority Critical patent/US20240203599A1/en
Publication of WO2022227294A1 publication Critical patent/WO2022227294A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present application relates to the field of medical big data information processing, and in particular, to a disease risk prediction method and system based on multimodal fusion.
  • EHRs Electronic health records
  • Digitization and subsequent analysis of medical records constitute an area of digital transformation aimed at collecting multiple medical information about patients in the form of EHRs, including digitized measurements (lab results), verbal descriptions (symptoms and notes, vital signs, etc.) , images (X-rays, CT and MR scans, etc.) and record the patient's treatment process. This digitization creates opportunities to mine health records to improve quality of care and clinical outcomes.
  • Electronic health records contain structured and unstructured data with important research and clinical value.
  • EHR data With the standardization and digitization of a large amount of EHR data, it is necessary to mine a large amount of multi-source heterogeneous data to establish a risk prediction model to achieve personalized medicine. much needed.
  • Most previous attempts have been built on structured EHR fields, and a lot of information in unstructured text data is lost.
  • the problem of unbalanced number and distribution of data sets Data collection without purpose often results in that the integrity, accuracy and granularity of recorded data cannot form a systematic system, resulting in missing and irregular data. Therefore, it takes a certain amount of manpower and material resources to collect data. Limited by time and financial resources, the number of good samples that can be obtained is limited. For example, in some embodiments of the present invention, the number of good samples is only 1300, and the distribution of positive and negative samples is not balanced, which will greatly affect the deep neural network. Learning and training of networks.
  • the present invention uses the stacked Transformer encoder modules to effectively vectorize the text medical records, which can effectively capture the rich semantic relations contained in the word sequence before and after the long text, and provide medical entities. make the correct representation. Then, the multi-source heterogeneous data is fused at the feature level, and the characteristics of different modal data are fully considered, and then the patient outcome is predicted.
  • the present invention provides a method for processing EHR data (including structured data and unstructured data), constructs a disease risk prediction model based on multimodal fusion, a method and system for prediction using the model, and Software devices that implement these functions, etc.
  • the invention improves the predictability of the patient's outcome by fully integrating and mining the information of the patient's demographic information, treatment information, diagnostic information, laboratory information and related text treatment medical records, and can effectively help doctors to provide effective reference information. Predict the development of the patient's condition, assist the doctor to formulate the corresponding treatment plan, help the treatment in time, and prevent the disease from developing in the direction of deterioration. At the same time, patients can be shown the development direction of the disease after personalized treatment at each clinical visit to improve their enthusiasm for treatment.
  • Multimodal data refers to data collected on multiple different devices or scenarios. Data sets in the real world are often multimodal, for example: a story can be described by text narration as well as images or audio; a document can be represented by multiple different languages or user reviews, etc. .
  • the establishment of the multimodal database aims to obtain its important features and representative retrieval labels by analyzing and processing multimodal data, and based on this, establish a database that is convenient for subsequent data retrieval.
  • Unstructured data refers to data without a fixed structure, such as office documents, text, pictures, various reports, images, and audio and video information in all formats.
  • Unstructured data in medicine includes medical images, electrocardiograms, text medical records, etc.
  • Structured data traditional relational data model, row data, data stored in the database, data that can be represented by a two-dimensional table structure, for example, data stored in csv, excel, two-dimensional table.
  • the present invention provides the following technical features, and the combination of one or more of the following technical features constitutes the technical solution of the present invention.
  • the present invention provides a disease risk prediction method based on multimodal fusion, the method comprising:
  • the data includes structured data and unstructured data; in the embodiment of the present invention, the unstructured data especially refers to text;
  • the disease risk prediction model further includes the step of performing data cleaning before extracting the structured data features and the unstructured data features;
  • the data cleaning includes replacing outliers, using the mean to complete missing values, and deleting dirty data.
  • a fully convolutional network (Fully Convolutional Networks, FCN) is used to extract structured data features.
  • BERT Bidirectional Encoder Representations from Transformers
  • the operation of extracting the fusion feature includes: parallelizing the unstructured data feature and the structured data feature along a specified dimension, and adopting a synthetic minority oversampling technique (Synthetic Minority Oversampling Technique, SMOTE)
  • SMOTE Synthetic Minority Oversampling Technique
  • the fused features are input as input to fully connected dence layers, and then disease risk prediction is performed by Softmax classifier.
  • the present invention adopts the weighting of the cross-entropy loss and the hinge loss to jointly constrain the model.
  • Cross-entropy loss can measure the degree of difference between two different probability distributions in the same random variable. The smaller the value of cross-entropy loss, the closer the two probability distributions are.
  • Hinge loss is specially used for binary classification problems. It not only requires the classification to be correct, but also the loss will be as small as possible when the confidence is high enough. Since the hinge loss not only measures the fitting degree of the model to the training data, but also measures the complexity of the model itself by adding a regularization term, so the fitting risk can be greatly reduced.
  • the present invention provides a method for processing EHR data, comprising:
  • EHR data including structured data and unstructured data
  • the extracted fusion feature data is used as the data to be identified for medical purposes.
  • the data cleaning includes the replacement of outliers, the use of mean values to complete missing values, and the deletion of dirty data; preferably, the unstructured data is text.
  • FCN is used to extract structured data features
  • BERT is used to extract unstructured features
  • the operation of extracting the fusion feature includes: parallelizing the unstructured data feature and the structured data feature along a specified dimension, and using SMOTE to analyze the minority class sample data and newly generate the class The method of sample to reduce the imbalance rate, and then use the segmentation pooling operation to extract the fusion features.
  • a method for constructing a disease risk prediction model of the present invention includes:
  • EHR data including structured and unstructured data, of patients with known disease risk outcomes; construct datasets, including structured and unstructured data, with known EHR data The final result builds the label set;
  • Constructing a disease risk prediction network including: constructing a feature extraction module for structured data extraction, a feature extraction module for unstructured data extraction, and a feature fusion module, structured data feature extraction module and unstructured data feature extraction After the modules are connected in parallel, they are connected in series at the decision layer of the feature fusion module; the disease risk prediction network is implemented based on the Pytorch framework;
  • the label set as the label, use the data set (structured data set and unstructured data set) to train the constructed disease risk prediction network, and construct the disease risk prediction model;
  • the data cleaning before constructing the data set, it further includes a step of data cleaning on the acquired EHR data, and the data cleaning includes replacing outliers, complementing missing values with mean, and deleting dirty data.
  • the feature extraction module for extracting structured data is an FCN module; the feature extraction module for extracting unstructured data is a BERT module (transformer module).
  • the feature fusion module executes: parallelizing unstructured data features and structured data features along a specified dimension, and using SMOTE to analyze minority class sample data and generate new samples of this class. Reduce the imbalance rate, and then use the segmented pooling operation to extract the fusion features;
  • the fused feature when using the dataset for training, is used as input to the fully connected layer to train the Softmax classifier.
  • the present invention also includes the multimodal fusion-based disease risk prediction model constructed by the third aspect.
  • the present invention provides a risk prediction system based on multimodal fusion, the system comprising:
  • a feature extraction module which is used to perform feature extraction on EHR data to obtain unstructured data features and structured data features
  • a feature fusion module which is used to fuse unstructured data features and structured data features and extract fused features
  • Classification module which takes the extracted fusion features as input, and obtains disease risk prediction results.
  • the feature extraction module includes a structured data feature extraction module and an unstructured data feature extraction module
  • the structured data feature extraction module uses the preprocessed structured data as the input of the FCN, maps the data to each latent semantic node, and obtains the structured data features.
  • the unstructured data feature extraction module uses BERT to perform feature extraction on the preprocessed unstructured data; preferably, the BERT is composed of a BERT Encoder, and the BERT Encoder is composed of multiple layers of BERT Layers, and the BERT Layer of each layer is composed of Both are Encoder Blocks in Transformer; each encoder layer contains two layers, namely the self-attention mechanism layer and the feedforward neural network layer.
  • the feature fusion module parallelizes the unstructured data features and the structured data features along a specified dimension, and adopts SMOTE to analyze the minority class sample data and generate new samples of this class. Reduce the imbalance rate, and then use the segmented pooling operation to extract the fusion features.
  • the classification module inputs the fusion feature as an input to the fully connected layer, and then performs classification through the Softmax classifier to obtain a disease risk prediction result.
  • the system further includes a data acquisition module for acquiring EHR data.
  • the system further includes a data cleaning module, which is configured to preprocess the EHR data after acquiring the EHR data and before performing feature extraction on the EHR data, the preprocessing includes The EHR data cleaning module performs the operations of replacing outliers and using the mean to complete missing values and delete dirty data.
  • the system further includes a result output module for outputting disease risk prediction results.
  • the present invention provides a computer device, comprising a memory and a processor, the memory stores a computer program, and the processor implements the above-mentioned first aspect and the present invention when the processor executes the computer program. /or the steps of the method of any one of the second aspect and/or the third aspect.
  • the present invention provides a computer-readable storage medium on which computer program instructions are stored, and when the program instructions are executed by a processor, implement the above-mentioned first and/or second aspects of the present invention and/or the steps of any of the methods of the third aspect.
  • the invention provides an end-to-end patient outcome prediction model.
  • the read data is used as the input of the model, and the corresponding data is mined and analyzed in combination with the deep learning method, and the output is the prediction. event outcomes of patients. It can effectively help doctors provide effective reference information, predict the development of the patient's condition, and help in timely treatment. At the same time increase the enthusiasm of patients to cooperate with treatment.
  • the invention adopts the bidirectional language model BERT to perform feature extraction on medical texts, and can perform parallel computation on multiple sets of inputs to capture different subspace information.
  • the attention mechanism is introduced to help the model obtain contextual information more effectively, learn the word dependencies within the sentence, and capture the internal structure of the sentence.
  • data such as Chinese medical question and answer, Chinese medical encyclopedia and Chinese electronic medical record are used, and medical entities such as "abdominal pain" can be represented by more effective vectorization.
  • the invention adopts multimodal fusion technology to preprocess, analyze and mine data such as electronic medical records of patients, past medical history information, and text records of patient medical records, and constructs a risk prediction model for predicting patient outcomes, which is for the utilization of clinical real data, Disease outcome assessment provides an aid to help physicians personalize treatment options for each patient.
  • FIG. 1 is a flowchart of a method for processing EHR data in a first embodiment of the present invention.
  • FIG. 2 is a structural diagram of a system for processing EHR data in the first embodiment of the present invention.
  • FIG. 3 is a functional flowchart of a feature fusion module in one or more embodiments of the present invention.
  • FIG. 4 is a flowchart of a method for predicting disease risk based on multimodal fusion in a third embodiment of the present invention.
  • Figure 5 is a functional flow diagram of a disease risk prediction model in one or more embodiments of the present invention.
  • FIG. 6 is a structural diagram of a risk prediction system based on multimodal fusion in a fourth embodiment of the present invention.
  • FIG. 7 is a structural diagram of a risk prediction system based on multimodal fusion in a fourth embodiment of the present invention.
  • FIG. 8 is a structural diagram of a risk prediction system based on multimodal fusion in a fourth embodiment of the present invention.
  • first, second, third, etc. may be used in this application to describe various information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present application.
  • word “if” as used herein can be interpreted as "at the time of” or “when” or “in response to determining.”
  • first embodiment may be combined with a second embodiment, so long as the particular features, structures, functions or characteristics associated with these embodiments or specific implementations are not mutually exclusive.
  • the present invention provides a method for processing EHR data, comprising: acquiring EHR data, the data includes structured data and unstructured data;
  • EHR data The processing flow of EHR data is shown in Figure 1, including: data processing of structured data and unstructured data respectively, including data cleaning of structured data and unstructured data respectively, and cleaning of the cleaned data.
  • Feature extraction is performed on structured data and unstructured data respectively, and the unstructured data features and structured data features extracted respectively are fused to extract fusion features;
  • the extracted fusion feature data is used as the data to be identified for medical purposes.
  • the present invention also provides a system for processing EHR data, the core modules of which include: a feature extraction module and a feature fusion module;
  • the system may further include a data cleaning module, as shown in FIG. 2 .
  • the data cleaning module performs the operations of replacing outliers and using the mean to complete missing values and deleting dirty data. For example, you can first screen the data for outliers, replace the outliers with null values, then perform a weighted average of the data, and use the average to replace outliers and missing values, and spss can be used to clean the data.
  • the feature extraction module performs feature extraction on structured data and unstructured data (such as text) contained in the EHR data; optionally, the feature extraction module includes a structured data feature extraction module and an unstructured data feature extraction module.
  • the structured data feature extraction module uses the cleaned structured data as the input of the FCN, maps the data to each implicit semantic node, and obtains the structured data features; in this embodiment, the structured data feature extraction module passes through the Dence layer. To learn the weight W, and then obtain the reset feature of the structured data, due to the discreteness of the data, the position information between the features has little influence on the decision-making, so in this process, the position information can be chosen to be discarded.
  • the unstructured data feature extraction module uses BERT to extract features from the cleaned unstructured text data.
  • the BERT is composed of BERT Encoder, BERT Encoder is composed of multiple layers of BERT Layer, and the BERT Layer of each layer is the Encoder Block in the Transformer; each encoder layer contains two layers, which are the self-attention mechanism layer and the feedforward neural layer. Network layer.
  • the stacked Transformer encoder module is used, and the word embedding tensor, sentence block tensor and position encoding tensor are obtained respectively to extract the semantics of medical text data. information, sentence information and location information, and the vectorized representation of the text medical records is calculated.
  • connection layer parallels the structured data features and unstructured data features along the specified dimension, and uses SMOTE to analyze the minority class sample data and generate new samples to reduce the Unbalance rate, and extract important information from different structured data according to different data types by adding segmentation pooling operation. Since medical data usually has a small sample size, batch normalization will be affected by the size of batch_size. Therefore, in the embodiment of the present invention, the output of each sub-layer adopts layer normalization.
  • the present invention provides a method for constructing a disease risk prediction model, comprising:
  • EHR data of patients with known disease risk outcomes includes structured data and unstructured data, and unstructured data mainly refers to text; construct data sets (structured data sets and text data sets) from their EHR data ), construct the label set with its final outcome;
  • data cleaning is performed on the obtained EHR data, and the data cleaning includes the replacement of outliers, the use of mean values to complete missing values, and the deletion of dirty data;
  • Constructing a disease risk prediction network including: constructing a feature extraction module (FCN) for extracting structured data, a feature extraction module (transformer module) for extracting unstructured data, a feature fusion module, structured data feature extraction module and After the unstructured data feature extraction modules are connected in parallel, they are connected in series at the decision layer of the feature fusion module, and the model architecture is implemented based on the Pytorch framework;
  • FCN feature extraction module
  • transformer module for extracting unstructured data
  • feature fusion module for extracting unstructured data
  • structured data feature extraction module After the unstructured data feature extraction modules are connected in parallel, they are connected in series at the decision layer of the feature fusion module, and the model architecture is implemented based on the Pytorch framework
  • the disease risk prediction network constructed above is trained with the data set, and the disease risk prediction model is constructed; in this embodiment, the disease risk outcome is used as the label, and the fusion feature is used as input to the fully connected layer to train Softmax Classifiers to build disease risk prediction models.
  • cross-entropy loss can measure the degree of difference between two different probability distributions in the same random variable. The smaller the value of cross-entropy loss, the closer the two probability distributions are.
  • Hinge loss is specially used for binary classification problems. It not only requires the classification to be correct, but also the loss will be as small as possible when the confidence is high enough. Since the hinge loss not only measures the fitting degree of the model to the training data, but also measures the complexity of the model itself by adding a regularization term, so the fitting risk can be greatly reduced.
  • the present invention provides a disease risk prediction method based on multimodal fusion, as shown in FIG. 4 , which includes:
  • EHR data of patients to be predicted can include structured data and unstructured data (text);
  • the execution steps of the disease risk prediction model include:
  • the fusion features are fusion features of unstructured data features and structured data features
  • a weighting of the cross-entropy loss and the hinge loss is employed to jointly constrain the model.
  • Cross-entropy loss can measure the degree of difference between two different probability distributions in the same random variable. The smaller the value of cross-entropy loss, the closer the two probability distributions are.
  • Hinge loss is specially used for binary classification problems. It not only requires the classification to be correct, but also the loss will be as small as possible when the confidence is high enough. Since the hinge loss not only measures the fitting degree of the model to the training data, but also measures the complexity of the model itself by adding a regularization term, so the fitting risk can be greatly reduced.
  • the present invention provides a risk prediction system based on multimodal fusion, as shown in FIG. 6 , including: a feature extraction module, a feature fusion module and a classification module.
  • the feature extraction module includes: a structured data extraction module and an unstructured data extraction module, as shown in Figure 7.
  • the risk prediction system based on multimodal fusion may further include a data acquisition module and/or a data cleaning module and/or a result output module.
  • the system may be as shown in FIG. 8 .
  • the data cleaning module preprocesses the EHR data, including the analysis of outliers. Replacing and taking the mean fills in missing values and removes dirty data.
  • the cleaned and processed unstructured data such as text data
  • the core of the model consists of BERT Encoder.
  • BERT Encoder consists of multiple layers of BERT Layer.
  • the BERT Layer of each layer is actually an Encoder Block in Transformer.
  • Each encoder layer contains two layers, a self-attention mechanism layer and a feed-forward neural network layer.
  • the cleaned structured data is subjected to feature extraction in the structured data feature extraction module, wherein the cleaned structured data is used as the input of the FCN, and the original data is mapped to each latent semantic node to obtain structured data features.
  • the fusion module splices and parallelizes the features of the structured data and the text data along the specified dimension, and uses SMOTE to analyze the minority class sample data and generate this class of samples to reduce the imbalance rate. Then, the segmentation pooling operation is used to extract important information of different structural data to obtain fusion features.
  • the classification module takes the fused features extracted after fusion as input to the fully connected layer, and then predicts the patient's outcome through the Softmax classifier.
  • the predicted solution obtained by the classification module can be output through the result output module.
  • the system described in this embodiment can implement the disease risk prediction method based on multimodal fusion described in the third embodiment.
  • the present invention provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the computer program described in the first embodiment when the processor executes the computer program. the steps of the method;
  • the present invention provides a computer-readable storage medium on which computer program instructions are stored, and when the program instructions are executed by a processor, implement the steps of the method described in the first embodiment;
  • program instructions when executed by the processor, implement the steps of the method described in the third embodiment.
  • the device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

一种基于多模态融合的疾病风险预测方法,获取患者的EHR数据,包括结构化数据和非结构化数据,将EHR数据输入疾病风险预测模型,得到疾病风险预测结果,输出疾病风险预测结果;其中,疾病风险预测模型执行步骤包括:提取结构化数据特征和非结构化数据特征,融合结构化数据特征和非结构化数据特征,提取融合特征,对融合特征进行决策,得到疾病风险预测结果。此外,还公开了实现方法的基于多模态融合的疾病风险预测***、处理EHR数据的方法、疾病风险预测模型的构建方法,以及可实现方法的计算机设备、计算机可读存储介质。

Description

一种基于多模态融合的疾病风险预测方法和*** 技术领域
本申请涉及医学大数据信息处理领域,尤其涉及一种基于多模态融合的疾病风险预测方法和***。
背景技术
公开该背景技术部分的信息仅仅旨在增加对本申请的总体背景的理解,而不必然被视为承认或以任何形式暗示该信息构成已经成为本领域一般技术人员所公知的现有技术。
电子健康记录(EHR)为健康研究创建了大量廉价的数据,其涵盖电子病历、既往病史信息、患者病历的文本记录等数据。数字化和对病历的后续分析构成了一种数字化转换领域,旨在以EHR的形式收集有关患者的多种医学信息,包括数字化测量(实验室结果)、口头描述(症状和便笺、生命体征等)、图像(X射线、CT和MR扫描等)并记录患者的治疗过程。这种数字化为挖掘健康记录创造了机会,以提高护理质量和临床结果。
然而临床医生只有有限的时间来处理所有可用数据并检测类似病历中的模式。电子健康记录包含具有重要研究和临床价值的结构化和非结构化数据,随着大量EHR数据的标准化和数字化,通过对大量多源异构数据进行挖掘进而建立风险预测模型来实现个性化医疗是亟需的。以往的大多数尝试都是建立在结构化的EHR字段上,非结构化文本数据中的大量信息被丢失。
发明内容
本发明发明人在了解现有技术存在的缺陷的基础上,发现通过对医学文本进行有效挖掘,并通过有效的数据融合手段将多源异构数据进行深层次的融合研究,能够避免单一数 据所导致的局限性和片面性。因此,发明人在进一步地将深度学习与疾病预测相结合进行研究。然而,两者的结合伴随着以下问题,包括:
数据集数量和分布不均衡的问题:不带目的的数据收集往往会造成记录数据的完整度、准确度和颗粒度无法形成***的体系,造成数据的缺失和不规范。因此,需要耗费一定的人力和物力进行数据采集。限于时间和财力,能够得到的良好的样本数量有限,比如,在本发明的一些实施方式中,得到良好的样本数量仅有1300例,且正负样本分布不均衡,这会极大地影响深度神经网络的学习和训练。
医学文本数据不能直接用于计算的问题:在现有的处理方式中,医学文本往往首先需要进行数字化表示。但是,这些文本数据通常是长文本且带有医学实体,采用CNN(Convolutional Neural Network,卷积神经网络)、word2vec(词向量产生模型)、LSTM(Long-Short Term Memory,长短期记忆网络)、Bi-LSTM(Bi-directional Long-Short Term Memory)等进行医学文本数据的向量表示时不尽人意。
以及,目前临床的真实数据大多是以多模态的形式存在,然而现在对于多模态方面的研究较少,单点突破已经做了很多事情,只考虑单模态因素不能对潜在风险进行综合评估,临床数据也未被充分挖掘利用。
为了解决现有研究中的不足以及上述问题,本发明通过堆叠的Transformer编码器模块对文本病历进行有效的向量化表示,其能有效捕获长文本前后语序包含的丰富的语义关系,并对医疗实体进行正确表示。接着将多源异构数据进行特征级融合,充分考虑到不同模态数据的特点,进而对患者结局进行预测。本发明提供了一种处理EHR数据(包含结构化数据和非结构化数据)的方法、并构建了一种基于多模态融合的疾病风险预测模型、使用该模型进行预测的方法和***,及实现这些功能的软件设备等。本发明通过对患者的人口统计学信息、治疗信息、诊断信息化验信息和相关文本治疗病历进行信息的充分的融合 挖掘提高对患者结局的预判性,能有效的帮助医生提供有效的参考信息,预判患者病情的发展情况,辅助医生制定相应的治疗方案,及时帮助救治,防止病情往恶化的方向发展。同时可以在临床每次访问时向患者展示个性化治疗后的疾病发展方向以提高其治疗的积极性。
多模态数据是指在多种不同设备或场景下采集到的数据。现实世界中的数据集往往是多模态的,例如:一个的故事可以由文本叙述也能用图像或者音频来描述;一个文档可以由多种不同的语言表示也能用用户评价来表示等等。多模态数据库的建立旨在通过分析和处理多模态数据得到其重要特征和代表性检索标签,并以此为基础建立便于后续数据检索的数据库。
非结构化数据是指没有固定结构的数据,例如,所有格式的办公文档、文本、图片、各类报表、图像和音频、视频信息。医学中的非结构化数据包含有医疗影像、心电图、文本病历等。
结构化数据:传统的关系数据模型、行数据,存储于数据库,可用二维表结构表示的数据,例如,存储于csv,excel的数据、二维表。
具体地,本发明提供了下述的技术特征,以下技术特征的一个或多个的结合构成本发明的技术方案。
在本发明的第一方面,本发明提供了一种基于多模态融合的疾病风险预测方法,所述方法包括:
获取待预测患者的EHR数据,所述数据包括结构化数据和非结构化数据;在本发明的实施方式中,所述非结构化数据尤其指文本;
将EHR数据输入疾病风险预测模型,得到疾病风险预测结果;
输出疾病风险预测结果。
其中,疾病风险预测模型执行步骤:
提取结构化数据特征和非结构化数据特征;
融合结构化数据特征和非结构化数据特征,提取融合特征;
对融合特征进行决策,得到疾病风险预测结果。
在本发明的一些实施方式中,所述疾病风险预测模型在提取结构化数据特征和非结构化数据特征前还包括执行数据清洗的步骤;
其中,所述数据清洗包括对异常值的替换、采用均值对缺失值进行补全,以及删除脏数据。
在本发明的一些实施方式中,采用全卷积网络(Fully Convolutional Networks,FCN)提取结构化数据特征。
在本发明的一些实施方式中,采用BERT(Bidirectional Encoder Representations from Transformers)提取非结构化特征。
在本发明的一些实施方式中,所述提取融合特征的操作包括:将非结构化数据特征和结构化数据特征沿指定维度进行并联,采用合成少数类过采样技术(Synthetic Minority Oversampling Technique,SMOTE)通过对少数类样本数据进行分析并新生成该类样本的方法来降低不平衡率,然后采用分段池化操作,提取得到融合特征。
在本发明的一些实施方式中,进行预测时,将融合特征作为input输入到全连接层(Fully connected dence layers),然后通过Softmax分类器进行疾病风险预测。
以及,在本发明的实施方式中,本发明采用交叉熵损失和合页损失的加权来共同约束模型。交叉熵损失能够衡量同一随机变量中的两个不同概率分布的差异程度,交叉熵损失值越小,两个概率分布越接近。然而单独使用交叉熵损失易导致边界变量分类混淆,合页损失专用于二分类问题,它不仅要求分类正确,而且确信度足够高时损失才会尽可能的小。 由于合页损失不仅度量了模型对训练数据的拟合程度,而且通过加入正则化项度量了模型自身的复杂度,因此可以大大降低拟合风险。
在本发明的第二方面,本发明提供了一种处理EHR数据的方法,其包括:
获取EHR数据,所述数据包括结构化数据和非结构化数据;
对结构化数据和非结构化数据分别进行数据处理,包括对结构化数据和非结构化数据分别进行数据清洗,对清洗后的结构化数据和非结构化数据分别进行特征提取,将分别提取得到的非结构化数据特征和结构化数据特征进行融合处理后提取融合特征;
以提取的融合特征数据作为待识别数据用于医疗用途。
在本发明的一些实施方式中,所述数据清洗包括对异常值的替换、采用均值对缺失值进行补全,以及删除脏数据;优选地,所述非结构化数据为文本。
在本发明的一些实施方式中,提取结构化数据特征采用FCN;提取非结构化特征采用BERT。
在本发明的一些实施方式中,所述提取融合特征的操作包括:将非结构化数据特征和结构化数据特征沿指定维度进行并联,采用SMOTE通过对少数类样本数据进行分析并新生成该类样本的方法来降低不平衡率,然后采用分段池化操作,提取得到融合特征。
在本发明的第三方面,本发明一种疾病风险预测模型的构建方法,其包括:
获取已知疾病风险结局的患者的EHR数据,所述数据包括结构化数据和非结构化数据;以获取的EHR数据构建数据集,包括结构化数据集和非结构化数据集,以已知的最终结局构建标签集;
构建疾病风险预测网络,包括:构建对于结构化数据进行提取的特征提取模块、对于非结构化数据进行提取的特征提取模块、和特征融合模块,结构化数据特征提取模块和非结构化数据特征提取模块并联连接后在特征融合模块决策层进行串联连接;所述疾病风险 预测网络基于Pytorch框架实现;
以标签集为标签,利用数据集(结构化数据集和非结构化数据集)训练构建的疾病风险预测网络,构建得到疾病风险预测模型;
以及,采用交叉熵损失和合页损失的加权来共同约束模型。
在本发明的一些实施方式中,构建数据集前还包括对获取的EHR数据进行数据清洗的步骤,数据清洗包括对异常值的替换、采用均值对缺失值进行补全,以及删除脏数据。
在本发明的一些实施方式中,对于结构化数据进行提取的特征提取模块为FCN模块;对于非结构化数据进行提取的特征提取模块为BERT模块(transformer模块)。
在本发明的一些实施方式中,特征融合模块执行:将非结构化数据特征和结构化数据特征沿指定维度进行并联,采用SMOTE通过对少数类样本数据进行分析并新生成该类样本的方法来降低不平衡率,然后采用分段池化操作,提取得到融合特征;
在本发明的一些实施方式中,利用数据集训练时,以融合特征作为input输入到全连接层,训练Softmax分类器。
以及,本发明还包括由上述第三方面构建得到的基于多模态融合的疾病风险预测模型。
在本发明的第四方面,本发明提供了一种基于多模态融合的风险预测***,所述***包括:
特征提取模块,其用于对EHR数据进行特征提取,得到非结构化数据特征和结构化数据特征;
特征融合模块,其用于对非结构化数据特征和结构化数据特征进行融合处理并提取得到融合特征;
分类模块,其以提取的融合特征作为input,得到疾病风险预测结果。
在本发明的一些实施方式中,所述特征提取模块包括结构化数据特征提取模块和非结构化数据特征提取模块;
其中,所述结构化数据特征提取模块以预处理后的结构化数据作为FCN的input,将数据映射到各个隐语义节点,得到结构化数据特征。
其中,所述非结构化数据特征提取模块采用BERT对预处理后的非结构化数据进行特征提取;优选地,BERT由BERT Encoder组成,BERT Encoder由多层BERT Layer组成,每一层的BERT Layer均为Transformer中的Encoder Block;每一个encoder层包含两层,分别为自注意力机制层和前馈神经网络层。
在本发明的一些实施方式中,所述特征融合模块将非结构化数据特征和结构化数据特征沿指定维度进行并联,采用SMOTE通过对少数类样本数据进行分析并新生成该类样本的方法来降低不平衡率,然后采用分段池化操作,提取得到融合特征。
在本发明的一些实施方式中,分类模块将融合特征作为input输入到全连接层,然后通过Softmax分类器分类,得到疾病风险预测结果。
在本发明的一些实施方式中,所述***还包括数据获取模块,其用于获取EHR数据。
在本发明的一些实施方式中,所述***还包括数据清洗模块,其用于在获取EHR数据后、在对EHR数据进行特征提取前对EHR数据进行预处理,所述预处理包括对所述EHR数据清洗模块执行对异常值替换和采用均值对缺失值进行补全并删除脏数据的操作。
在本发明的一些实施方式中,所述***还包括结果输出模块,其用于输出疾病风险预测结果。
在本发明的第五方面,本发明提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现如本发明上述第一方面和/或第二方面和/或第三方面中任一项所述方法的步骤。
在本发明的第六方面,本发明提供了一种计算机可读存储介质,其上存储有计算机程序指令,该程序指令被处理器执行时实现如本发明上述第一方面和/或第二方面和/或第三方面中任一项所述方法的步骤。
通过上述一个或多个技术手段,可实现以下有益效果:
本发明提供了一种端到端的患者结局预测模型,通过自动读取患者的EHR数据,将读取到的数据作为模型的输入,结合深度学习方法对相应数据进行挖掘分析后,输出即为预测的患者的事件结局。其能有效的帮助医生提供有效的参考信息,预判患者病情的发展情况,及时帮助救治。同时增加患者配合治疗的积极性。
本发明采用了双向语言模型BERT对医学文本进行特征提取,可以对多组输入做并行计算,捕获不同的子空间信息。引入注意力机制帮助模型更有效的获取上下文信息,学习到句子内部的词依赖关系,捕获句子的内部结构。对于模型的预训练采用中文医疗问答、中文医疗百科和中文电子病历等数据,类似于“腹痛”等医疗实体可以得到更有效的向量化表示。
本发明采用多模态融合技术,对患者的电子病历、既往病史信息、患者病历的文本记录等数据进行预处理、分析和挖掘,构建预测患者结局的风险预测模型,为临床真实数据的利用、疾病结局评估提供辅助手段,帮助医生为每位患者提供个性化的治疗方案。
附图说明
构成本申请的一部分的说明书附图用来提供对本申请的进一步理解,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。以下,结合附图来详细说明本申请的实施方案,其中:
图1为本发明第一实施例中处理EHR数据的方法的流程图。
图2为本发明第一实施例中处理EHR数据的***结构图。
图3为本发明的一个或多个实施例中特征融合模块的功能流程图。
图4为本发明第三实施例中基于多模态融合的疾病风险预测方法的流程图。
图5为本发明一个或多个实施例中疾病风险预测模型的功能流程图。
图6为本发明第四实施例中一种基于多模态融合的风险预测***的结构图。
图7为本发明第四实施例中一种基于多模态融合的风险预测***的结构图。
图8为本发明第四实施例中一种基于多模态融合的风险预测***的结构图。
具体实施方式
下面结合具体实施例,进一步阐述本申请。应理解,这些实施例仅用于说明本申请而不用于限制本申请的范围。下列实施例中未注明具体条件的实验方法,通常按照常规条件或按照制造厂商所建议的条件。
文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,单独存在B,同时存在A和B三种情况,本文中术语“/和”是描述另一种关联对象关系,表示可以存在两种关系,例如,A/和B,可以表示:单独存在A,单独存在A和B两种情况,另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
本文使用的术语仅用于描述特定实施例,并且不意在限制本申请的示例实施例。如本文所使用的,单数形式“一”、“一个”以及“该”意在包括复数形式,除非上下文明确指示相反意思。还应当理解术语“包括”、“包括了”、”包含”、和/或“包含了”当在本文中使用时,指定所声明的特征、整数、步骤、操作、单元和/或组件的存在性,并且不排除一个或多个其他特征、数量、步骤、操作、单元、组件和/或他们的组合的存在或增加。
应当理解,尽管在本申请可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离 本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
此外,特定特征、结构、功能或特性可以以任何适合的方式组合到一个或多个实施例中。例如,第一实施例可以结合第二实施例,只要与这些实施例或具体实施方式相关联的特定特征、结构、功能或特性不互相排斥。
在本发明的第一实施例中,本发明提供了一种处理EHR数据的方法,包括:获取EHR数据,所述数据包括结构化数据和非结构化数据;
对EHR数据进行处理,处理流程如图1中所示,包括:对结构化数据和非结构化数据分别进行数据处理,包括对结构化数据和非结构化数据分别进行数据清洗,对清洗后的结构化数据和非结构化数据分别进行特征提取,将分别提取得到的非结构化数据特征和结构化数据特征进行融合处理后提取融合特征;
以提取的融合特征数据作为待识别数据用于医疗用途。
以及,基于第一实施例中的方法,本发明还提供了一种处理EHR数据的***,其核心模块包括:特征提取模块和特征融合模块;
可选地,获取待处理EHR数据后可对数据进行数据清洗,因此,所述***中还可包括数据清洗模块,如图2所示。
其中,数据清洗模块执行对异常值替换和采用均值对缺失值进行补全并删除脏数据的操作。比如可采用首先对于数据进行异常值筛选,使用空值替换该异常值,接着对数据进行加权平均,使用平均值替换异常值和缺失值,可采用spss对数据进行清洗操作。
特征提取模块对EHR数据中包含的结构化数据和非结构化数据(比如文本)进行特征提取;可选地,所述特征提取模块包括结构化数据特征提取模块和非结构化数据特征提取 模块。
其中,结构化数据特征提取模块以清洗后的结构化数据作为FCN的input,将数据映射到各个隐语义节点,得到结构化数据特征;在该实施方式中,结构化数据特征提取模块通过Dence层来学习权重W,进而得到结构化数据的重置特征,由于数据的离散的,特征之间的位置信息对于决策影响很小,因此在这个过程中可选择舍弃位置信息。
非结构化数据特征提取模块采用BERT对清洗后的非结构化文本数据进行特征提取。所述BERT由BERT Encoder组成,BERT Encoder由多层BERT Layer组成,每一层的BERT Layer均为Transformer中的Encoder Block;每一个encoder层包含两层,分别为自注意力机制层和前馈神经网络层。在该实施方式中,对于非结构化文本数据挖掘的模块,采用的是堆叠的Transformer编码器模块,分别得到词嵌入张量、语句分块张量和位置编码张量来提取到医学文本数据语义信息、句子信息和位置信息,计算得到文本病历的向量化表示。
对于特征融合模块,如图3所示,连接层将结构化数据特征与非结构化数据特征沿指定维度进行并联,采用SMOTE通过对少数类样本数据进行分析并新生成该类样本的方法来降低不平衡率,并通过添加分段池化操作按照数据类型的不同分别提取到不同结构数据的重要信息。由于医学数据通常样本量较少,批归一化会受到batch_size大小的影响,因此在本发明的实施方式中,每个子层的输出采用层归一化。
在本发明的第二实施例中,本发明提供了一种构建疾病风险预测模型的方法,包括:
获取已知疾病风险结局的患者的EHR数据(所述数据包括结构化数据和非结构化数据,非结构化数据主要指文本);以其EHR数据构建数据集(结构化数据集和文本数据集),以其最终结局构建标签集;
可选地,对获取的EHR数据进行数据清洗,数据清洗包括对异常值的替换、采用均值 对缺失值进行补全,以及删除脏数据;
构建疾病风险预测网络,包括:构建对于结构化数据进行提取的特征提取模块(FCN)、对于非结构化数据进行提取的特征提取模块(transformer模块)、特征融合模块,结构化数据特征提取模块和非结构化数据特征提取模块并联连接后在特征融合模块决策层进行串联连接,所述模型架构基于Pytorch框架实现;
以标签集为标签,利用数据集训练上述构建的疾病风险预测网络,构建疾病风险预测模型;在该实施例中,以疾病风险结局为标签,以融合特征作为input输入到全连接层,训练Softmax分类器,构建疾病风险预测模型。
将待预测患者的EHR数据输入至训练完成的疾病风险预测模型中,即可输出患者的结局属性情况。
进一步地,采用交叉熵损失和合页损失的加权来共同约束疾病风险预测模型。交叉熵损失能够衡量同一随机变量中的两个不同概率分布的差异程度,交叉熵损失值越小,两个概率分布越接近。然而单独使用交叉熵损失易导致边界变量分类混淆,合页损失专用于二分类问题,它不仅要求分类正确,而且确信度足够高时损失才会尽可能的小。由于合页损失不仅度量了模型对训练数据的拟合程度,而且通过加入正则化项度量了模型自身的复杂度,因此可以大大降低拟合风险。
在本发明的第三实施例中,基于第二实施例构建的疾病风险预测模型,本发明提供了一种基于多模态融合的疾病风险预测方法,如图4所示,其包括:
获取待预测患者的EHR数据,EHR数据可以包括结构化数据和非结构化数据(文本);
将获取的EHR数据输入疾病风险预测模型,得到疾病风险预测结果;
输出疾病风险预测结果。
其中,所述疾病风险预测模型执行步骤,如图5所示,包括:
提取结构化数据特征和非结构化数据特征;
提取融合特征,所述融合特征为非结构化数据特征和结构化数据特征的融合特征;
对融合特征进行决策,得到疾病风险预测结果。
在该实施例中,采用交叉熵损失和合页损失的加权来共同约束模型。交叉熵损失能够衡量同一随机变量中的两个不同概率分布的差异程度,交叉熵损失值越小,两个概率分布越接近。然而单独使用交叉熵损失易导致边界变量分类混淆,合页损失专用于二分类问题,它不仅要求分类正确,而且确信度足够高时损失才会尽可能的小。由于合页损失不仅度量了模型对训练数据的拟合程度,而且通过加入正则化项度量了模型自身的复杂度,因此可以大大降低拟合风险。
在本发明的第四实施例中,本发明提供了一种基于多模态融合的风险预测***,如图6所示,包括:特征提取模块、特征融合模块和分类模块。
其中,特征提取模块包括:结构化数据提取模块和非结构化数据提取模块,如图7所示。
在该实施例的基础上,所述基于多模态融合的风险预测***中还可以包括数据获取模块和/或数据清洗模块和/或结果输出模块。
比如,在该实施例中,所述***可如图8中所示。
如图8所示,所述***获取待预测患者的EHR数据(包括结构化数据和非结构化数据,非结构化数据比如文本)后,数据清洗模块对EHR数据进行预处理包括对异常值的替换和采用均值对缺失值进行补全并删除脏数据。
清洗处理后的非结构化数据比如文本数据在文本特征提取模块内进行特征提取,该模块内应用双向语言模型BERT,对医疗文本数据进行特征提取。模型的核心由BERT Encoder组成,BERT Encoder由多层BERT Layer组成,每一层的BERT Layer其实都是Transformer 中的Encoder Block。每一个encoder层包含两层,一个自注意力机制层一个前馈神经网络层。
清洗处理后的结构化数据在结构化数据特征提取模块内进行特征提取,其中,清洗处理后的结构化数据作为FCN的输入,将原始数据映射到各个隐语义节点,得到结构化数据特征。
如图3所示,融合模块将结构化数据的特征与文本数据的特征沿指定维度进行拼接并联,采用SMOTE通过对少数类样本数据进行分析,新生成该类样本的方法来降低不平衡率。接着采用分段池化操作,提取不同结构数据的重要信息,得到融合特征。
分类模块将融合后提取的融合特征作为input输入到全连接层,然后通过Softmax分类器进行患者的结局预测。
进一步,分类模块得到的预测解决可通过结果输出模块输出。
医疗患者可根据输出的结果结合自己的判断得出最终的结论。
该实施例中所述的***可实现第三实施例中所述的基于多模态融合的疾病风险预测方法。
在本发明的第五实施例中,本发明提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现第一实施例中所述方法的步骤;
和/或,所述处理器执行所述计算机程序时实现第二实施例中所述方法的步骤;
和/或,所述处理器执行所述计算机程序时实现第三实施例中所述方法的步骤。
在本发明的第六实施例中,本发明提供了一种计算机可读存储介质,其上存储有计算机程序指令,该程序指令被处理器执行时实现第一实施例中所述方法的步骤;
和/或,该程序指令被处理器执行时实现第二实施例中所述方法的步骤;
和/或,该程序指令被处理器执行时实现第三实施例中所述方法的步骤。
以上所描述的装置实施例仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助加必需的通用硬件平台的方式来实现,当然也可以通过硬件和软件结合的方式来实现。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以计算机产品的形式体现出来,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质,包括但不限于磁盘存储器、CD-ROM、光学存储器等上实施的计算机程序产品的形式。
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来说,其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种基于多模态融合的疾病风险预测方法,其特征在于,所述方法包括:
    获取患者的EHR数据,包括结构化数据和非结构化数据;
    将EHR数据输入疾病风险预测模型,得到疾病风险预测结果;
    输出疾病风险预测结果;
    其中,疾病风险预测模型执行步骤,包括:
    提取结构化数据特征和非结构化数据特征;
    融合结构化数据特征和非结构化数据特征,提取融合特征;
    对融合特征进行决策,得到疾病风险预测结果。
  2. 根据权利要求1所述的方法,其特征在于,采用全卷积网络提取结构化数据特征;
    优选地,采用BERT提取非结构化特征。
  3. 根据权利要求1或2所述的方法,其特征在于,所述提取融合特征的操作包括:将非结构化数据特征和结构化数据特征沿指定维度进行并联,采用SMOTE通过对少数类样本数据进行分析并新生成该类样本的方法来降低不平衡率,然后采用分段池化操作,提取得到融合特征;
    优选地,进行预测时,将融合特征作为input输入到全连接层,然后通过Softmax分类器进行疾病风险预测;
    优选地,采用交叉熵损失和合页损失的加权来共同约束疾病风险预测模型。
  4. 根据权利要求1所述的方法,其特征在于,所述疾病风险预测模型在提取结构化数据特征和非结构化数据特征前还包括执行数据清洗的步骤;
    优选地,所述数据清洗包括对异常值的替换、采用均值对缺失值进行补全,以及删除脏数据;
    优选地,所述非结构化数据为文本。
  5. 一种基于多模态融合的风险预测***,其特征在于,所述***包括:
    特征提取模块,其用于对EHR数据进行特征提取,得到非结构化数据特征和结构化数据特征;
    特征融合模块,其用于对非结构化数据特征和结构化数据特征进行融合处理并提取得到融合特征;
    分类模块,其以提取的融合特征作为input,得到疾病风险预测结果。
  6. 根据权利要求5所述的***,其特征在于,所述特征提取模块包括结构化数据特征提取模块和非结构化数据特征提取模块;
    其中,所述结构化数据特征提取模块以结构化数据作为FCN的input,将数据映射到各个隐语义节点,得到结构化数据特征;
    其中,所述非结构化数据特征提取模块采用BERT对非结构化数据进行特征提取;优选地,BERT由BERT Encoder组成,BERT Encoder由多层BERT Layer组成,每一层的BERT Layer均为Transformer中的Encoder Block;每一个encoder层包含两层,分别为自注意力机制层和前馈神经网络层;
    优选地,所述特征融合模块将非结构化数据特征和结构化数据特征沿指定维度进行并联,采用SMOTE通过对少数类样本数据进行分析并新生成该类样本的方法来降低不平衡率,然后采用分段池化操作,提取得到融合特征;
    优选地,分类模块将融合特征或结构化数据作为input输入到全连接层,然后通过Softmax分类器对患者的结局进行预测;
    优选地,所述***还包括数据获取模块,其用于获取EHR数据;
    优选地,所述***还包括数据清洗模块,其用于在获取EHR数据后、在对EHR数据进行特征提取前预处理EHR数据,所述预处理包括对所述EHR数据清洗模块执行对异常值替 换和采用均值对缺失值进行补全并删除脏数据的操作;
    优选地,所述***还包括结果输出模块,其用于输出疾病风险预测结果。
  7. 一种处理EHR数据的方法,其特征在于,包括:
    获取EHR数据,所述数据包括结构化数据和非结构化数据;
    对结构化数据和非结构化数据分别进行数据处理,包括对结构化数据和非结构化数据分别进行数据清洗,对清洗后的结构化数据和非结构化数据分别进行特征提取,将分别提取得到的非结构化数据特征和结构化数据特征进行融合处理后提取融合特征;
    以提取的融合特征数据作为待识别数据用于医疗用途;
    优选地,所述数据清洗包括对异常值的替换、采用均值对缺失值进行补全,以及删除脏数据;优选地,所述非结构化数据为文本;
    优选地,提取结构化数据特征采用FCN;
    优选地,提取非结构化特征采用BERT;
    优选地,所述提取融合特征的操作包括:将非结构化数据特征和结构化数据特征沿指定维度进行并联,采用SMOTE通过对少数类样本数据进行分析并新生成该类样本的方法来降低不平衡率,然后采用分段池化操作,提取得到融合特征。
  8. 一种疾病风险预测模型的构建方法,其特征在于,包括:
    获取已知疾病风险结局的患者的EHR数据,所述数据包括结构化数据和非结构化数据;以获取的EHR数据构建数据集,包括结构化数据集和非结构化数据集,以已知的最终结局构建标签集;
    构建疾病风险预测网络,包括:构建对于结构化数据进行提取的特征提取模块、对于非结构化数据进行提取的特征提取模块、和特征融合模块,结构化数据特征提取模块和非结构化数据特征提取模块并联连接后在特征融合模块决策层进行串联连接;所述疾病风险 预测网络基于Pytorch框架实现;
    以标签集为标签,利用数据集训练构建的疾病风险预测网络,构建疾病风险预测模型;
    优选地,构建数据集前包括对获取的EHR数据进行数据清洗的步骤,数据清洗包括对异常值的替换、采用均值对缺失值进行补全,以及删除脏数据;
    优选地,对于结构化数据进行提取的特征提取模块为FCN模块;
    优选地,对于非结构化数据进行提取的特征提取模块为BERT模块;
    优选地,特征融合模块执行:将非结构化数据特征和结构化数据特征沿指定维度进行并联,采用SMOTE通过对少数类样本数据进行分析并新生成该类样本的方法来降低不平衡率,然后采用分段池化操作,提取得到融合特征;
    优选地,利用数据集训练时,以融合特征作为input输入到全连接层,训练Softmax分类器。
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至4中任一项所述方法的步骤;
    和/或,所述处理器执行所述计算机程序时实现权利要求7中所述方法的步骤;
    和/或,所述处理器执行所述计算机程序时实现权利要求8中所述方法的步骤。
  10. 一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,该程序指令被处理器执行时实现权利要求1至4中任一项所述方法的步骤;
    和/或,该程序指令被处理器执行时实现权利要求7中所述方法的步骤;
    和/或,该程序指令被处理器执行时实现权利要求8中所述方法的步骤。
PCT/CN2021/106860 2021-04-30 2021-07-16 一种基于多模态融合的疾病风险预测方法和*** WO2022227294A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/910,556 US20240203599A1 (en) 2021-04-30 2021-07-16 Method and system of for predicting disease risk based on multimodal fusion

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110486200.2 2021-04-30
CN202110486200.2A CN113241135B (zh) 2021-04-30 2021-04-30 一种基于多模态融合的疾病风险预测方法和***

Publications (1)

Publication Number Publication Date
WO2022227294A1 true WO2022227294A1 (zh) 2022-11-03

Family

ID=77131993

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/106860 WO2022227294A1 (zh) 2021-04-30 2021-07-16 一种基于多模态融合的疾病风险预测方法和***

Country Status (3)

Country Link
US (1) US20240203599A1 (zh)
CN (1) CN113241135B (zh)
WO (1) WO2022227294A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424724A (zh) * 2022-11-04 2022-12-02 之江实验室 一种多模态图森林的肺癌***转移辅助诊断***
CN116049397A (zh) * 2022-12-29 2023-05-02 北京霍因科技有限公司 基于多模态融合的敏感信息发现并自动分类分级方法
CN116246774A (zh) * 2023-03-15 2023-06-09 北京医准智能科技有限公司 一种基于信息融合的分类方法、装置及设备
CN117409930A (zh) * 2023-12-13 2024-01-16 江西为易科技有限公司 基于ai技术实现的医疗康复数据处理方法及***
CN117438023A (zh) * 2023-10-31 2024-01-23 灌云县南岗镇卫生院 基于大数据的医院信息管理方法及***
CN117992925A (zh) * 2024-04-03 2024-05-07 成都新希望金融信息有限公司 基于多源异构数据和多模态数据的风险预测方法及装置

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113241135B (zh) * 2021-04-30 2023-05-05 山东大学 一种基于多模态融合的疾病风险预测方法和***
CN113707309A (zh) * 2021-08-31 2021-11-26 平安科技(深圳)有限公司 基于机器学习的疾病预测方法及装置
CN114067935B (zh) * 2021-11-03 2022-05-20 广西壮族自治区通信产业服务有限公司技术服务分公司 一种流行病调查方法、***、电子设备及存储介质
CN114203295B (zh) * 2021-11-23 2022-05-20 国家康复辅具研究中心 脑卒中风险预测干预方法及***
TWI829065B (zh) * 2022-01-06 2024-01-11 沐恩生醫光電股份有限公司 資料融合系統及其操作方法
CN114463825B (zh) * 2022-04-08 2022-07-15 北京邮电大学 基于多模态融合的人脸预测方法及相关设备
CN114822880B (zh) * 2022-06-30 2023-02-28 北京超数时代科技有限公司 一种基于国产自主可控的医院诊疗信息***
CN115131642B (zh) * 2022-08-30 2022-12-27 之江实验室 一种基于多视子空间聚类的多模态医学数据融合***
CN115844348A (zh) * 2023-02-27 2023-03-28 山东大学 一种基于可穿戴设备的心脏骤停分级响应预警方法和***
CN115862875B (zh) * 2023-02-27 2024-02-09 四川大学华西医院 基于多类型特征融合的术后肺部并发症预测方法及***
CN116612886A (zh) * 2023-05-06 2023-08-18 广东省人民医院 一种脑卒中早期辅助诊断方法、***、装置及存储介质
CN117217807B (zh) * 2023-11-08 2024-01-26 四川智筹科技有限公司 一种基于多模态高维特征的不良资产估值方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160342764A1 (en) * 2015-05-19 2016-11-24 Universidad De Vigo System, computer-implemented method and computer program product for individualized multiple-disease quantitative risk assessment
CN109117864A (zh) * 2018-07-13 2019-01-01 华南理工大学 基于异构特征融合的冠心病风险预测方法、模型及***
CN111916207A (zh) * 2020-08-07 2020-11-10 杭州深睿博联科技有限公司 一种基于多模态融合的疾病识别方法及装置
CN113241135A (zh) * 2021-04-30 2021-08-10 山东大学 一种基于多模态融合的疾病风险预测方法和***

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108428478B (zh) * 2018-02-27 2022-03-29 东北师范大学 基于异质医疗数据挖掘的甲状腺癌风险预测方法
CN109119130A (zh) * 2018-07-11 2019-01-01 上海夏先机电科技发展有限公司 一种基于云计算的大数据健康管理***及方法
CN111260209B (zh) * 2020-01-14 2022-03-11 山东大学 电子病历与医学影像结合的心血管疾病风险预测评估***
CN111680169A (zh) * 2020-06-03 2020-09-18 国网内蒙古东部电力有限公司 一种基于bert模型技术的电力科技成果数据抽取方法
CN112199425A (zh) * 2020-09-16 2021-01-08 北京好医生云医院管理技术有限公司 基于混合数据库结构的医疗大数据中心及其建设方法
CN112182243B (zh) * 2020-09-27 2023-11-28 中国平安财产保险股份有限公司 基于实体识别模型构建知识图谱的方法、终端及存储介质
CN112365987B (zh) * 2020-10-27 2023-06-06 平安科技(深圳)有限公司 诊断数据异常检测方法、装置、计算机设备及存储介质
CN112463922A (zh) * 2020-11-25 2021-03-09 中国测绘科学研究院 一种风险用户识别方法及存储介质
CN112652386A (zh) * 2020-12-25 2021-04-13 平安科技(深圳)有限公司 分诊数据处理方法、装置、计算机设备及存储介质
CN112633426B (zh) * 2021-03-11 2021-06-15 腾讯科技(深圳)有限公司 处理数据类别不均衡的方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160342764A1 (en) * 2015-05-19 2016-11-24 Universidad De Vigo System, computer-implemented method and computer program product for individualized multiple-disease quantitative risk assessment
CN109117864A (zh) * 2018-07-13 2019-01-01 华南理工大学 基于异构特征融合的冠心病风险预测方法、模型及***
CN111916207A (zh) * 2020-08-07 2020-11-10 杭州深睿博联科技有限公司 一种基于多模态融合的疾病识别方法及装置
CN113241135A (zh) * 2021-04-30 2021-08-10 山东大学 一种基于多模态融合的疾病风险预测方法和***

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LAILA RASMY; YANG XIANG; ZIQIAN XIE; CUI TAO; DEGUI ZHI: "Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction", ARXIV.ORG, 22 May 2020 (2020-05-22), pages 1 - 23, XP081683869 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424724A (zh) * 2022-11-04 2022-12-02 之江实验室 一种多模态图森林的肺癌***转移辅助诊断***
CN116049397A (zh) * 2022-12-29 2023-05-02 北京霍因科技有限公司 基于多模态融合的敏感信息发现并自动分类分级方法
CN116049397B (zh) * 2022-12-29 2024-01-02 北京霍因科技有限公司 基于多模态融合的敏感信息发现并自动分类分级方法
CN116246774A (zh) * 2023-03-15 2023-06-09 北京医准智能科技有限公司 一种基于信息融合的分类方法、装置及设备
CN116246774B (zh) * 2023-03-15 2023-11-24 浙江医准智能科技有限公司 一种基于信息融合的分类方法、装置及设备
CN117438023A (zh) * 2023-10-31 2024-01-23 灌云县南岗镇卫生院 基于大数据的医院信息管理方法及***
CN117438023B (zh) * 2023-10-31 2024-04-26 灌云县南岗镇卫生院 基于大数据的医院信息管理方法及***
CN117409930A (zh) * 2023-12-13 2024-01-16 江西为易科技有限公司 基于ai技术实现的医疗康复数据处理方法及***
CN117409930B (zh) * 2023-12-13 2024-02-13 江西为易科技有限公司 基于ai技术实现的医疗康复数据处理方法及***
CN117992925A (zh) * 2024-04-03 2024-05-07 成都新希望金融信息有限公司 基于多源异构数据和多模态数据的风险预测方法及装置

Also Published As

Publication number Publication date
US20240203599A1 (en) 2024-06-20
CN113241135A (zh) 2021-08-10
CN113241135B (zh) 2023-05-05

Similar Documents

Publication Publication Date Title
WO2022227294A1 (zh) 一种基于多模态融合的疾病风险预测方法和***
CN111316281B (zh) 基于机器学习的自然语言情境中数值数据的语义分类方法以及***
Pereira et al. COVID-19 identification in chest X-ray images on flat and hierarchical classification scenarios
Tayarani Applications of artificial intelligence in battling against covid-19: A literature review
CN108831559B (zh) 一种中文电子病历文本分析方法与***
CN109460473B (zh) 基于症状提取和特征表示的电子病历多标签分类方法
US20200303072A1 (en) Method and system for supporting medical decision making
US11610678B2 (en) Medical diagnostic aid and method
CN113015977A (zh) 使用自然语言处理的对疾病和病症的基于深度学习的诊断和转诊
WO2023078025A1 (zh) 一种基于任务分解策略的发热待查辅助鉴别诊断***
Li et al. Intelligent diagnosis with Chinese electronic medical records based on convolutional neural networks
Carchiolo et al. Medical prescription classification: a NLP-based approach
CN111666477A (zh) 一种数据处理方法、装置、智能设备及介质
CN113284572B (zh) 多模态异构的医学数据处理方法及相关装置
Motwani et al. Enhanced framework for COVID-19 prediction with computed tomography scan images using dense convolutional neural network and novel loss function
Kaswan et al. AI-based natural language processing for the generation of meaningful information electronic health record (EHR) data
Cao et al. Automatic ICD code assignment based on ICD’s hierarchy structure for Chinese electronic medical records
JP2023510667A (ja) キャラクタ取得、ページ処理と知識グラフ構築方法及び装置、媒体
CN114191665A (zh) 机械通气过程中人机异步现象的分类方法和分类装置
CN117542467A (zh) 基于患者数据的专病标准数据库自动构建方法
CN116884612A (zh) 疾病风险等级的智能分析方法、装置、设备及存储介质
CN116313141A (zh) 一种基于知识图谱的不明原因发热智能问诊方法
Agoston Big data, artificial intelligence, and machine learning in neurotrauma
Basha et al. Deep learning neural network (DLNN)-based classification and optimization algorithm for organ inflammation disease diagnosis
CN113990489A (zh) 一种中医药临床证候诊疗智能数据处理和分析挖掘***

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 17910556

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21938761

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21938761

Country of ref document: EP

Kind code of ref document: A1