CN115171905A

CN115171905A - Tumor patient similarity calculation method based on one-hot coding unsupervised clustering

Info

Publication number: CN115171905A
Application number: CN202210695043.0A
Authority: CN
Inventors: 张如奎; 刘雷; 朱超宇
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-10-11
Anticipated expiration: 2042-06-20
Also published as: CN115171905B

Abstract

The invention discloses a tumor patient similarity calculation method based on one-hot coding unsupervised clustering; firstly, uniformly adopting one-hot coding to each observation index of clinical data to obtain a characteristic embedded matrix; then KMeans unsupervised clustering is carried out on the characteristic embedding matrix to generate a Patient Similarity Network (PSN); then, based on the overall survival time (OS) of the tumor patients, carrying out clinical outcome correlation analysis, and checking and evaluating the statistical difference of the survival curves of the patients after different clustering groups to obtain the cPSN with highly correlated clinical outcomes; and finally, for a target tumor patient to be evaluated, acquiring a group of patients which are most similar to the target patient in the cpnn by using a KNN algorithm, and selecting the range and the fineness of the target patient by adjusting the K value. The method can overcome the difficulties that the multi-modal medical data is difficult to encode and integrate and the algorithm depends on the marking of doctors, and constructs the cpns to effectively restore the similarity of patients.

Description

Tumor patient similarity calculation method based on one-hot coding unsupervised clustering

Technical Field

The invention belongs to the technical field of intelligent medicine, is applied to the fields of intelligent medical treatment and precise medical treatment, and relates to a tumor patient similarity calculation method based on one-hot coding unsupervised clustering.

Background

Clinical data of tumors have significantly different characteristics compared to clinical data of other diseases. The tumor clinical data with high information density and large clinical value are histopathology data and molecular genetics data, and the histopathology data and the molecular genetics data are basically unrelated to other diseases, so that the modeling/algorithm based on other conventional diseases has poor effect and insufficient granularity in the tumor field, and is difficult to bring clinical benefits. The pathological examination is mainly used for judging benign and malignant tumor lesions, determining the stage and pathological type of the tumor and the like, and the molecular detection can be used for determining the occurrence reason of the tumor, carrying out molecular classification of the tumor, determining the expression of a marker and the like. The pathological examination and the molecular detection can comprehensively describe the tumor characteristics and the tumor microenvironment to make scientific diagnosis, thereby providing important basis for doctors to select treatment methods, formulate reasonable treatment schemes, evaluate treatment effects, judge prognosis and the like.

Although different tumor patients show great tumor heterogeneity, there are always some patients who are similar, but at present, no matter in clinical practice or in medical research, how to define and how to evaluate the similarity of patients still remains a problem. Patient similarity calculation, which evaluates the similarity between patients by mathematically calculating multi-modal heterogeneity data for the patients, appears to be a solution. Generally, the first step in the patient similarity calculation is to determine a multi-modal data integration processing strategy; the second step is to define a patient Similarity metric (Similarity metrics) to compute the distance or Similarity score between patients in a systematic and consistent manner; the third step is to establish a Patient Similarity Network (PSN), and carry out cluster analysis, characteristic analysis and the like in a PSN system; finally, for a new patient to be evaluated, a group of patients most similar to the target patient is located or defined in the PSN based on the patient's similarity score.

Current patient similarity calculations are almost entirely applicable to non-neoplastic diseases, typically by drawing hospitalization, diagnosis, treatment, prescription drugs, laboratory test data, physiological monitoring data, etc. from Electronic Medical Records (EMRs). At present, the similarity calculation of patients only adopts continuous numerical variable parameters to calculate the Euclidean distance; some methods can calculate the distance between a father node and each child node by ICD (interface control document) level coding on disease characteristics so as to evaluate the similarity, or convert medical record information into a medical knowledge map, perform vectorization representation on entity nodes, and calculate the path similarity of the nodes. The method for code conversion has obvious defects, needs to convert into other systems such as ICD codes, knowledge maps and the like, and has the defects of indirect calculation, various influencing factors, complex operation process and influence on the accuracy of results.

With the development of the deep learning discipline, similarity learning prediction is performed using a Convolutional Neural Network (CNN) by representing a disease as a vector or a matrix. The model obtained by the method is highly personalized, and the generalization capability is weak although the model is excellent in experimental data set; in addition, CNN belongs to supervised learning, and in fact, all similarity networks only need to adopt supervised or semi-supervised learning, the similarity of a part of patients needs to be marked in advance, and the weight of parameters and the threshold value of similarity need to be trained, which is influenced by the subjective of doctors and the quality of neural network algorithm in actual operation, and the reliability of results is insufficient. These two features lead to a large discount in clinical application value of deep learning represented by neural networks in tumor patient similarity assessment.

Disclosure of Invention

In order to overcome the difficulties that the multi-modal medical data coding integration is difficult and the algorithm depends on the doctor labeling, the invention uniformly adopts one-hot encoding (one-hot encoding) to all clinical data, adopts an unsupervised method to carry out the patient similarity calculation, and finally constructs a set of highly relevant patient similarity network cPNS (clinical PSN) of clinical outcome so as to effectively restore the similarity of patients. The cPSN can be used for accurately positioning tumor patients, quickly making effective treatment intervention schemes by referring to past cases, accurately predicting clinical outcomes and the like.

The technical scheme of the invention is specifically described as follows.

The invention provides a tumor patient similarity calculation method based on one-hot coding unsupervised clustering, which comprises the following steps of:

(1) Uniformly encoding each observation index of the clinical data by one-hot coding to obtain a characteristic embedding matrix;

(2) Performing KMeans clustering on the feature embedding matrix to generate a patient similarity network PSN;

(3) Carrying out clinical outcome correlation analysis on the clustered grouped patients based on the overall survival time OS of the tumor patients, and evaluating the statistical difference of the survival curves of the clustered different grouped patients to obtain a patient similarity network cPSN with highly relevant clinical outcomes;

(4) For a target tumor patient to be evaluated, a group of patients most similar to the target patient is obtained in the cPNS by using a K neighbor algorithm based on distance calculation, and the range and fineness of the target patient are selected by adjusting the K value.

In the invention, in the step (1), the observation indexes of the clinical data comprise numerical variables, classification variables and clinical qualitative descriptions. An independent variable can be made for histopathological data such as superior mesenteric vein/portal vein involvement, qualitative description of surgical margin status, etc.; for molecular genetic data such as gene mutation data of clinical gene detection, immunohistochemical data and the like, each gene mutation and the expression level of each gene can be used as an independent variable.

In the invention, in the step (1), when one-hot coding is adopted, the classification variables are directly coded, numerical variables and clinical qualitative description are firstly converted into classification variables, and each classification state of each variable is marked as one-hot characteristic; suppose there are M observation targets in a set of samples, denoted as

Each observation index

Is provided with

Various classification states, recorded as

In all, have

One-hot characteristics. Preferably, for numerical variables, dividing the numerical values in a group of samples into 4 parts according to a quartile method to form 4 classification variables; for clinical qualitative profiling, there are N states that form N categorical variables.

In the invention, when a missing value appears in an observation index of clinical data, the missing value is taken as an independent one-hot coding type, and a null value does not need to be filled.

In the invention, in the step (2), the KMeans clustering algorithm is Lloyd-Forgy.

In the invention, in the step (2), when KMans are clustered, the clustering effect is evaluated by using a Silhouette score method or a gap statistical method for the number K of clusters selected each time.

In the invention, a patient similarity network PSN is a coding clustering set of all patients, is an M' -dimensional high-dimensional network, embodies the similarity distance between the patients, and is the sum of classification states; further preferably, the high-dimensional network can be visualized in a reduced dimension by using a t-SNE method, and two-dimensional or three-dimensional display is adopted.

In the invention, in the step (3), the Kaplan-Meier method is used for carrying out the correlation analysis of clinical outcome; the statistical differences in the survival curves of the patients in the different groups after clustering were evaluated using the log-rank test, and cPSN, which is highly relevant for clinical outcome, was obtained based on the significance of the p-values.

Compared with the prior art, the invention has the beneficial effects that: a group of patients can be embedded and represented in a high-dimensional space according to clinical characteristics, and similarity calculation is carried out on tumor patients; the method can efficiently encode any clinical data, and has strong data processing capability and good robustness.

The method can overcome the difficulties that the multi-modal medical data is difficult to encode and integrate and the algorithm depends on the marking of doctors, and construct the cPSN to truly restore the similarity of patients.

Aiming at missing values in different observation indexes of clinical data, the missing value is used as an independent one-hot coding type without filling a null value, so that the classification error caused by coding filling of other coding modes can be reduced.

The method carries out KMeans unsupervised clustering and evaluates K through a statistical algorithm to obtain the optimal K, and the whole process is unsupervised and unsupervised without human intervention; the invention carries out the survival analysis on the constructed patient network model by using the 'gold standard' OS for evaluating the clinical prognosis of the tumor for carrying out the clinical relevance evaluation, thereby ensuring the clinical significance and the clinical practical value of the released PSN.

The invention uniformly adopts single-hot coding for multi-modal and highly heterogeneous clinical data, flexibly compatible with clinical data of observation indexes and observation state changes caused by different medical institutions, different doctors and medical development stages, has simple and effective data processing method, wide application range and strong expansibility, and can carry out high-precision similarity grouping on heterogeneous tumor patients.

Drawings

Fig. 1 is a two-dimensional visualization of cpns of 9 classification clusters of 114 gastric cancer patients.

Fig. 2 is a graph of survival for 9 taxonomic clusters of 114 patients with gastric cancer.

Detailed Description

The technical scheme of the invention is explained in detail by combining the drawings and the embodiment.

The invention develops a tumor patient similarity calculation method and system based on single-hot-code unsupervised clustering, and the method and system are compatible with histopathology data, molecular genetics data and other data which are considered to be brought in clinically. The patient similarity calculation is carried out by adopting an unsupervised method, and finally a set of cpns highly related to clinical outcomes is constructed, so that the similarity of patients is effectively restored.

The clinical data of the primary tumors are multi-modal, highly heterogeneous, with some numerical variables (e.g., age, TMB value), some classification variables (e.g., clinical stage, pathological type), and some qualitative description of the clinical findings being observed(e.g., tumor resectability, driver mutations, surgical paradigm), we used one-hot coding uniformly for all clinical data types. For numerical variables, the numerical values in a set of samples are divided into 4 parts according to a quartile method, namely 4 classification variables are formed. For clinical qualitative profiling, there are N states that form N categorical variables. Suppose there are M observation targets in a set of samples, denoted as

Each observation index

Is provided with

Various states are recorded

In all, have

One-hot characteristics.

One-hot encoding works for all data types, is simple and efficient, and although the gradual change between the variable states of the classification variables is ignored, does not affect patient similarity calculation based on large samples. The similarity between the numerical value and the classification variable is considered (the numerical value is only a measurement method can give the numerical value), the numerical value variable quartiles is converted into the classification variable, and the method is scientifically fitted to clinic. One-hot encoding treats the missing value as an independent classification state without padding null values. The single hot code has strong expansibility, can efficiently process the fusion of numerical values, images, texts and gene detection data, and can be flexibly compatible even if the observation indexes and the observation states of different medical institutions, different doctors and the medical development stages are changed.

After the original data processing is finished, the original data becomes a characteristic embedding matrix, and then KMeans clustering is carried out. The clustering algorithm is Lloyd-Forgy and comprises the following steps:

1) And setting an initialized random seed to ensure that the clustering result can be repeated each time.

2) Coefficient for assigning Z points of total number of patients to K classes

The values belonging to the kth class are marked as 1, otherwise, the values are 0.

3) The objective of the iteration is to minimize the loss function:

。

4) Calculating the distance from each point to the central point, calculating

：

If, if

Otherwise

。

5) Recalculating the center point for each class:

。

6) Repeating the steps 4) and 5) until convergence.

Selecting K as 2 to 10 respectively generates clustering results (10 can be replaced by a larger integer), and evaluating the clustering effect by using a silouette score method or a gap stability method for each selected clustering number K. The maximum score of Silhauette score is optimal, and the Gap stability method Gap (k) ≧ Gap (k \8197; + \8197; 1) -S _(k + 1) For optimization, S _(k + 1) Represents the standard deviation. And (4) performing dimensionality reduction visualization on the clustering result by using a t-SNE method, and displaying in two dimensions or three dimensions.

After clustering by KMeans, a patient similarity network PSN was generated, and in order to examine the actual clinical significance of this PSN, we used the Kaplan-Meier method to perform correlation analysis of clinical outcome using a "gold standard" tumor patient Overall Survival (OS) that assesses tumor clinical prognosis and clinical benefit. And the statistical differences in survival curves of these different groups of patients were evaluated using the log-rank test, and according to the significance of the p-value, cPSN was published with a high correlation in clinical outcome.

For a target tumor patient to be evaluated, a group of patients most similar to the target patient is obtained in the cPSN by using a K-Nearest Neighbor (KNN) algorithm, the range and fineness of the target patient are selected by adjusting the K value (the K value can be manually adjusted when the target patient is positioned in a clinical practical operation), and according to clinical characteristics of the group of patients, a treatment scheme, a clinical outcome prediction and the like are quickly formulated by referring to past cases.

Example 1

We collected clinical data for 114 gastric cancers, containing 15 features.

1. Quantitative characterization

1) The method comprises 4 steps: [ 'HRD _ sum', 'Ploid', 'Age', 'TMB' ].

2) The numerical values of each quantitative feature are sorted and divided into four equal parts, and the quartile Q is calculated by using a formula i/4 x (n-1) +1, wherein i is the fourth quartile point, and n is the number of statistical data.

3) The numerical value of the quantitative feature is converted into the qualitative feature.

2. Qualitative features

1) The method comprises 11 steps: 'digital _ sequence', 'Differentiation', 'Lauren _ classification', 'Prognosis _ stage', 'Lymph _ node _ status', 'Metastatis', 'Recurrence', 'ERBB2_ amp _ IHC', 'CDH1_ mut', 'TP53_ mut', 'Histologic _ diagnosis' ].

2) Null values are treated as a special eigenvalue.

3) One-hot encoding is performed for each feature, e.g., the partitioned _ sequenza will be decomposed into 4 features [ 'partitioned _ sequenza _ CIN', 'partitioned _ sequenza _ CS', 'partitioned _ Seza _ CS/CIN',

' Diploid_sequenza_#UNK']。

4) The final 15 original clinical features were decomposed into a total of 65 one-hot features.

3. KMeans clustering

1) After the raw data processing is completed, it becomes a feature matrix of 114 × 65, where each value is between [0,1], kmans clustering is performed using the skleren.

2) And respectively selecting K from 2 to 10 to generate clustering results, and performing dimension reduction visualization by using a sklern. By evaluation 9 as relatively best cluster, we therefore obtained a patient similarity network PSN for one 9 classification clusters, as shown in fig. 1.

4. Correlation analysis of clinical outcome

The OS survival analysis was performed on the 9 patients classified above using R-packs survivval and survivmini, with significant statistical differences in survival curves for the different groups of patients by log-rank test (p = 2 e-08). This is a cPSN with a high correlation with clinical outcome, as shown in fig. 2, it can be seen that the survival curve distances of different groups are very different, indicating that there is a significant difference in the survival status of patients between different groups, and the difference is caused by clustering after similarity calculation.

Claims

1. A tumor patient similarity calculation method based on one-hot coding unsupervised clustering is characterized by comprising the following steps:

(1) Uniformly adopting one-hot coding for each observation index of clinical data to obtain a characteristic embedding matrix;

(2) Performing KMeans unsupervised clustering on the feature embedding matrix to generate a patient similarity network PSN;

(3) Carrying out clinical outcome correlation analysis on the clustered patients based on the overall survival time OS of the tumor patients, and evaluating the statistical difference of survival curves of the patients in different groups to obtain a patient similarity network cPNS with highly relevant clinical outcomes;

(4) For a target tumor patient to be evaluated, a group of patients most similar to the target patient are obtained in the cpnn by using a K-nearest neighbor algorithm based on Euclidean distance calculation, and the range and the fineness of the target patient are selected by adjusting the K value.

2. The tumor patient similarity calculation method based on one-hot coded unsupervised clustering according to claim 1, wherein in step (1), the observed indicators of the clinical data comprise numerical variables, categorical variables and clinical qualitative descriptions.

3. The tumor patient similarity calculation method based on one-hot coding unsupervised clustering according to claim 2, wherein in the step (1), when one-hot coding is adopted, the classification variables are directly coded, numerical variables and clinical qualitative descriptions are respectively converted into classification variables, and each classification state of each variable is recorded as one-hot feature; assume that there are M observation targets in a set of samples, and record as

Each observation index

Is provided with

Various classification states, recorded as

In all, have

One-hot characteristics.

4. The tumor patient similarity calculation method based on one-hot coded unsupervised clustering according to claim 3, wherein for numerical variables, dividing the numerical values in a group of samples into 4 parts according to a quartile method to form 4 classification variables; for clinical qualitative profiling, there are N states that form N categorical variables.

5. The tumor patient similarity calculation method based on unsupervised clustering by one-hot coding according to claim 1, wherein in the step (1), when a missing value appears in the observation index of the clinical data, the missing value is used as a one-hot coding type without filling in empty values.

6. The tumor patient similarity calculation method based on one-hot coded unsupervised clustering according to claim 1, wherein in the step (2), in KMeans unsupervised clustering, for each selected clustering number K, the clustering effect is evaluated by using a silouette score method or a gap statistical method, so as to determine the optimal K value.

7. The tumor patient similarity calculation method based on one-hot coded unsupervised clustering according to claim 1, wherein, in step (2), the generated patient similarity network PSN represents the similarity distance between patients, which is a set of encoded clusters of all patients, and is an M 'dimensional high-dimensional network, and M' is the sum of classification states.

8. The tumor patient similarity calculation method based on one-hot coded unsupervised clustering according to claim 7, wherein in the step (2), the generated high-dimensional network PSN is visualized in a dimension reduction manner by using a t-SNE method, and two-dimensional or three-dimensional display is adopted.

9. The tumor patient similarity calculation method based on one-hot coded unsupervised clustering according to claim 1, wherein in step (3), the Kaplan-Meier method is used for clinical outcome correlation analysis; the statistical differences in survival curves of these different groups of patients were assessed using the log-rank test, and cPSN, which is highly relevant for clinical outcome, was obtained based on the significance of the p-value.