CN112735529A - Breast cancer prognosis model construction method, application method and electronic equipment - Google Patents

Breast cancer prognosis model construction method, application method and electronic equipment Download PDF

Info

Publication number
CN112735529A
CN112735529A CN202110061949.2A CN202110061949A CN112735529A CN 112735529 A CN112735529 A CN 112735529A CN 202110061949 A CN202110061949 A CN 202110061949A CN 112735529 A CN112735529 A CN 112735529A
Authority
CN
China
Prior art keywords
breast cancer
risk
model
cancer sample
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110061949.2A
Other languages
Chinese (zh)
Inventor
王一澎
张毅
冯林
程书钧
张开泰
肖汀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cancer Hospital and Institute of CAMS and PUMC
Original Assignee
Cancer Hospital and Institute of CAMS and PUMC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cancer Hospital and Institute of CAMS and PUMC filed Critical Cancer Hospital and Institute of CAMS and PUMC
Priority to CN202110061949.2A priority Critical patent/CN112735529A/en
Publication of CN112735529A publication Critical patent/CN112735529A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Genetics & Genomics (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Analytical Chemistry (AREA)
  • Primary Health Care (AREA)
  • Pathology (AREA)
  • Biomedical Technology (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A method for constructing a breast cancer prognosis model, an application method, an electronic device and a storage medium are provided. The construction method comprises the following steps: obtaining transcriptome expression profile data for a plurality of breast cancer sample sets, wherein each breast cancer sample set comprises a plurality of breast cancer samples; analyzing the immune infiltration condition of each breast cancer sample in each breast cancer sample set based on the transcriptome expression profile data of each breast cancer sample set, and determining a first immune infiltration group and a second immune infiltration group in each breast cancer sample set; determining differentially expressed genes between the first and second immunoinfiltration groups in each breast cancer sample set based on transcriptome expression data of the first and second immunoinfiltration groups in each breast cancer sample set; determining candidate genes based on differentially expressed genes between the first and second immunoinfiltration groups in each breast cancer sample set; and constructing a risk scoring model based on the candidate genes.

Description

Breast cancer prognosis model construction method, application method and electronic equipment
Technical Field
The embodiment of the disclosure relates to a method for constructing a breast cancer prognosis model, an application method, electronic equipment and a storage medium.
Background
Prognosis refers to empirically predicted disease progression. Prognosis is primarily related to three aspects, what outcome will occur, the likelihood of a poor outcome, and the point in time. The purpose of research and grading prognosis is to facilitate understanding of the degree of harm of diseases to humans, to explore factors affecting prognosis, and to research specific measures for improving prognosis. Prognostic analysis is a clinical study that is very practical and has a guiding role in clinical practice.
Breast cancer (BRCA) is one of the most common malignancies and is also the second leading cause of cancer death in women worldwide. In china alone, BRCA is expected to account for 15% of all new cancer cases in women, and is the leading cause of cancer death in women under the age of 45. Breast cancer is a heterogeneous disease with different biological phenotypes, different treatment regimens and prognosis. The clinical pathological characteristics of age, molecular subtype, tumor AJCC stage and the like are related to prognosis and subsequent treatment schemes. Although many molecular classes have been widely used in clinical diagnostics and are important indicators for guiding the selection of treatment regimens, most studies, including those based on the types of breast cancer molecules such as Estrogen Receptor (ER), Progesterone Receptor (PR), Ki-67 and HER2, tend to focus on tumor characteristics, while the study of tumor microenvironment is rare.
Disclosure of Invention
At least some embodiments of the present disclosure provide a method of constructing a breast cancer prognosis model. The construction method comprises the following steps: obtaining transcriptome expression profile data for a plurality of breast cancer sample sets, wherein each breast cancer sample set of the plurality of breast cancer sample sets comprises a plurality of breast cancer samples; analyzing the immune infiltration condition of each breast cancer sample in each breast cancer sample set based on the transcriptome expression profile data of each breast cancer sample set, and determining a first immune infiltration group and a second immune infiltration group in each breast cancer sample set; determining differentially expressed genes between the first and second immunoinfiltration groups in the each breast cancer sample set based on transcriptome expression data of the first and second immunoinfiltration groups in the each breast cancer sample set; determining candidate genes based on differentially expressed genes between the first and second immunoinfiltration groups in each breast cancer sample set; and constructing a risk score model based on the candidate genes, wherein the breast cancer prognosis model comprises the risk score model.
For example, in some embodiments of the present disclosure, a construction method is provided, in which, based on the transcriptome expression profile data of each breast cancer sample set, the immune infiltration status of each breast cancer sample in each breast cancer sample set is analyzed, and a first immune infiltration group and a second immune infiltration group in each breast cancer sample set are determined, including: quantifying the immune infiltration condition of multiple immune infiltration cells of each breast cancer sample in each breast cancer sample set by adopting single-sample gene set enrichment analysis based on the transcriptome expression profile data of each breast cancer sample set, and analyzing the similarity of all breast cancer samples in each breast cancer sample set based on the quantification result so as to determine a first immune infiltration group and a second immune infiltration group in each breast cancer sample set.
For example, in some embodiments of the present disclosure provided construction methods, determining the candidate genes based on differentially expressed genes between the first and second immunoinfiltration groups in each breast cancer sample set comprises: intersecting the differentially expressed genes between the first and second immunoinfiltration groups in the plurality of breast cancer sample sets to obtain the candidate gene.
For example, in some embodiments of the disclosure, the constructing the risk scoring model based on the candidate genes includes: acquiring a training data set; and screening the candidate genes by LASSO-Cox regression analysis in combination with a ten-fold cross-validation method to determine the genes used to construct the risk scoring model and the risk scoring model, wherein the risk scoring model is represented as:
RS=c1E1+…+cNEN
wherein RS represents a risk score, EiRepresenting the expression value of the i-th gene used to construct the risk scoring model, ciCoefficients representing the ith gene used to construct the risk scoring model, and N represents the number of genes used to construct the risk scoring model.
For example, in some embodiments of the present disclosure, the number of genes used to construct the risk scoring model is 10, and the genes used to construct the risk scoring model include C14orf79, C1orf168, C1orf226, CELSR2, FABP7, FGFBP1, IL-10, KLRB1, PLEKHO1, and RAC 2; the risk scoring model is expressed as:
RS=EC14orf79×(-0.114731735)+EC1orf168×(-0.019429183)
+EC1orf226×(-0.049258060)+ECELSR2×(-0.055863001)
+EFABP7×(-0.028295228)+EFGFBP1×(-0.008174118)
+EIL-10×0.020753075+EKLRB1×(-0.121245004)
+EPLEKHO1×(-0.049187024)+ERAC2×(-0.003657534),
wherein E isC14orf79Represents the expression value of gene C14orf79, EC1orf168Represents the expression value of the gene C1orf168, EC1orf226Represents the expression value of gene C1orf226, ECELSR2Represents the expression value of the gene CELSR2, EFABP7Expression value of the gene FABP7, EFGFBP1Indicates the expression value of the gene FGFBP1, EIL-10Expressing the expression value of the gene IL-10, EKLRB1Expression of the Gene KLRB1Value, EPLEKHO1Represents the expression value of the gene PLEKHO1, ERAC2The expression value of the RAC2 gene is shown.
For example, in some embodiments of the disclosure, the constructing the risk scoring model based on the candidate genes further includes: evaluating the predictive performance of the risk scoring model based on the training dataset.
For example, in some embodiments of the present disclosure, a method of constructing a risk score model is provided, wherein evaluating the predictive performance of the risk score model based on the training dataset comprises: calculating a risk score for each subject in the training dataset based on the risk score model; determining a group cutoff value according to the risk scores of all the subjects in the training data set, and dividing the subjects in the training data set into a first high risk group and a first low risk group according to the group cutoff value; and assessing whether the first high-risk group and the first low-risk group have a significant difference in survival using a Kaplan-Meier curve of the training dataset.
For example, in some embodiments of the present disclosure, the constructing method further includes, based on the training data set, evaluating the predictive performance of the risk scoring model, and further including: performing multifactor Cox regression analysis on the training dataset to evaluate a risk score for predicting robustness of survival; and evaluating goodness-of-fit of a risk scoring model using subject operational characteristic curve analysis of the training dataset.
For example, in some embodiments of the disclosure, the constructing the risk scoring model based on the candidate genes further includes: obtaining a verification dataset; and verifying the efficacy of the risk scoring model based on the verification dataset.
For example, in some embodiments of the present disclosure, the constructing method further includes verifying the efficacy of the risk scoring model based on the verification dataset, including: calculating a risk score for each subject in the validation dataset based on the risk score model; and, classifying all subjects in the validation dataset into a second high risk group and a second low risk group according to the group cutoff value, and validating whether the second high risk group and the second low risk group have a significant difference in survival using a Kaplan-Meier curve of the validation dataset.
For example, some embodiments of the present disclosure provide a method of construction, further comprising: combining the risk score, the pathological stage and the age prognostic indicators, and constructing a histogram model by utilizing multi-factor Cox regression analysis; wherein the breast cancer prognosis model further comprises the histogram model.
At least some embodiments of the present disclosure further provide an application method of a breast cancer prognosis model, wherein the breast cancer prognosis model includes the risk score model constructed by the construction method provided in any embodiment of the present disclosure, and the application method includes: obtaining tumor tissue transcript expression data of a subject, wherein the tumor tissue transcript expression data of the subject comprises expression values of genes used to construct the risk score model; and calculating a risk score for the subject according to the risk score model based on tumor tissue transcript expression data for the subject.
At least some embodiments of the present disclosure further provide another application method of a breast cancer prognosis model, wherein the breast cancer prognosis model includes the histogram model constructed by the construction method provided in any embodiment of the present disclosure, and the application method includes: obtaining age, pathological stage, and tumor tissue transcript expression data of a subject, wherein the tumor tissue transcript expression data of the subject comprises expression values of genes used to construct the risk score model; calculating a risk score for the subject according to the risk score model based on tumor tissue transcript expression data for the subject; and predicting survival of the subject according to the nomogram model based on the age, pathological stage, and risk score of the subject.
At least some embodiments of the present disclosure also provide an electronic device, comprising: a memory for non-transitory storage of computer readable instructions; and a processor for executing the computer readable instructions, wherein when the computer readable instructions are executed by the processor, the building method provided by any embodiment of the disclosure is executed or the application method provided by any embodiment of the disclosure is executed.
At least some embodiments of the present disclosure also provide a storage medium that stores non-transitory computer-readable instructions, wherein the non-transitory computer-readable instructions, when executed by a computer, perform instructions of a construction method provided by any embodiment of the present disclosure or perform instructions of an application method provided by any embodiment of the present disclosure.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.
Fig. 1 is a block flow diagram of a method for constructing a breast cancer prognosis model according to some embodiments of the present disclosure;
fig. 2 is a schematic diagram illustrating the infiltration of immune cells of a breast cancer sample set according to some embodiments of the present disclosure;
FIG. 3 is a schematic illustration of candidate gene identification based on differentially expressed genes in two breast cancer sample sets according to some embodiments of the present disclosure;
fig. 4 is an exemplary flowchart corresponding to step S500 shown in fig. 1 provided by some embodiments of the present disclosure;
FIG. 5 is a graph illustrating partial likelihood deviation-Log (λ) relationship in a LASSO-Cox regression analysis according to some embodiments of the present disclosure;
fig. 6 is an exemplary flowchart corresponding to step S530 shown in fig. 4 according to some embodiments of the disclosure;
fig. 7 is a schematic diagram illustrating a visualization of risk scores obtained by training a data set (TCGA _ BRCA data set) according to some embodiments of the present disclosure;
FIG. 8 is a schematic diagram of a Kaplan-Meier curve for a training data set (TCGA _ BRCA data set) according to some embodiments of the present disclosure;
fig. 9 is a Cox regression forest map obtained by performing a multi-factor Cox regression analysis based on a training data set (TCGA _ BRCA data set) according to some embodiments of the present disclosure;
FIG. 10 is a schematic diagram of a ROC curve for a training data set (TCGA _ BRCA data set) provided by some embodiments of the present disclosure;
fig. 11 is an exemplary flowchart corresponding to step S550 shown in fig. 4 according to some embodiments of the present disclosure;
FIG. 12 is a schematic diagram of a Kaplan-Meier curve for a validation dataset (METABRIC dataset) provided by some embodiments of the present disclosure;
fig. 13 is a block flow diagram of another method for constructing a breast cancer prognosis model according to some embodiments of the present disclosure;
FIG. 14 is a schematic diagram of a histogram model according to some embodiments of the present disclosure;
FIG. 15 is a block flow diagram of a method for applying a breast cancer prognosis model according to some embodiments of the present disclosure;
FIG. 16 is a block flow diagram of another method for applying a breast cancer prognosis model according to some embodiments of the present disclosure;
fig. 17 is a schematic block diagram of an electronic device provided by some embodiments of the present disclosure; and
fig. 18 is a schematic block diagram of a storage medium provided in some embodiments of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.
Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Also, the use of the terms "a," "an," or "the" and similar referents do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items.
To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of known functions and known components have been omitted from the present disclosure. When any component of an embodiment of the present disclosure appears in more than one drawing, that component is represented by the same or similar reference numeral in each drawing.
The tumor microenvironment is mainly composed of tumor-associated fibroblasts, immune cells, extracellular matrix, various growth factors, inflammatory factors, special physicochemical characteristics (such as hypoxia and low pH), cancer cells and the like. Cells in the microenvironment can aggregate into different classes, and each cell has complex and significant interactions with other cells, and some robust cell infiltration modes. The research on tumor molecular biology in recent years shows that besides cancer cells, immune cells in the tumor microenvironment play a key role in the biology of tumor occurrence and development, and meanwhile, the clinical treatment effect is also obviously influenced. Therefore, the study of the number of immune cells responding to the immune response at the tumor site is of great significance for improving the understanding of tumor-host biology.
At least some embodiments of the present disclosure provide a method of constructing a breast cancer prognosis model. The construction method comprises the following steps: obtaining transcriptome expression profile data for a plurality of breast cancer sample sets, wherein each breast cancer sample set of the plurality of breast cancer sample sets comprises a plurality of breast cancer samples; analyzing the immune infiltration condition of each breast cancer sample in each breast cancer sample set based on the transcriptome expression profile data of each breast cancer sample set, and determining a first immune infiltration group and a second immune infiltration group in each breast cancer sample set; determining differentially expressed genes between the first and second immunoinfiltration groups in each breast cancer sample set based on transcriptome expression data of the first and second immunoinfiltration groups in each breast cancer sample set; determining candidate genes based on differentially expressed genes between the first and second immunoinfiltration groups in each breast cancer sample set; and constructing a risk scoring model based on the candidate genes.
Some embodiments of the present disclosure also provide an application method, an electronic device and a storage medium of the breast cancer prognosis model corresponding to the above construction method.
According to the construction method of the breast cancer prognosis model provided by the embodiment of the disclosure, a breast cancer sample is divided into a first immune infiltration group and a second immune infiltration group according to the immune infiltration condition in a tumor microenvironment, candidate genes are screened based on the differential expression genes between the first immune infiltration group and the second immune infiltration group, and then a risk score model is constructed, wherein the risk score model has good prediction precision in the aspect of breast cancer patient prognosis; based on the risk scoring model, a nomogram model with higher prediction precision can be further constructed, and the nomogram model can provide a more optimized quantitative method for the clinical prognosis evaluation of breast cancer patients, so that a reference can be provided for the breast cancer patients to improve the prognosis of the breast cancer patients.
Some embodiments of the present disclosure and examples thereof are described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Fig. 1 is a block flow diagram of a method for constructing a breast cancer prognosis model according to some embodiments of the present disclosure. For example, the method for constructing a breast cancer prognosis model may be applied to a computing device, where the computing device includes any electronic device with a computing function, such as a smart phone, a laptop, a tablet, a desktop, a server, a cloud service, and the like, and the embodiment of the disclosure is not limited thereto. For example, the computing device has a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU), and the computing device further includes a memory. The Memory is, for example, a nonvolatile Memory (e.g., a Read Only Memory (ROM)) on which codes of an operating system are stored. For example, the memory further stores codes or instructions, and the codes or instructions can be executed to implement the method for constructing the breast cancer prognosis model provided by the embodiment of the disclosure.
For example, as shown in fig. 1, the construction method includes the following steps S100 to S500.
Step S100: obtaining transcriptome expression profile data for a plurality of breast cancer sample sets, wherein each breast cancer sample set of the plurality of breast cancer sample sets comprises a plurality of breast cancer samples.
For example, in some embodiments, multiple breast cancer samples can be collected by themselves, each breast cancer sample including multiple breast cancer samples, then total RNA extraction and transcriptome sequencing are performed, and standard expression profile data is obtained after data comparison, filtering and normalization, so that transcriptome expression profile data of multiple breast cancer sample sets (one breast cancer sample set for each breast cancer sample) can be obtained. For example, the sample collection method and the sequencing method used should be the same for each breast cancer sample in the same batch, so that the transcriptome expression profile data of each breast cancer sample in the same batch is usually comparable, that is, the transcriptome expression profile data of each breast cancer sample in the same breast cancer sample set is comparable. For example, the sample collection method and the sequencing method used may be different for different batches of breast cancer samples, so that the transcriptome expression profile data of the breast cancer samples of different batches may not be comparable with each other, that is, the transcriptome expression profile data of the breast cancer samples in different breast cancer sample sets may not be comparable with each other.
For example, in other embodiments, public breast cancer data (including transcriptome expression profile data) may be downloaded from a public database in the Internet. For example, in a public database, data of the same data set is usually comparable to data of a different data set, and therefore, each data set can be used as a breast cancer sample set, so that transcriptome expression profile data of a plurality of breast cancer sample sets can also be obtained.
For example, in still other embodiments, a mixture of the two approaches can be used to obtain transcriptome expression profile data for a plurality of breast cancer sample sets.
For example, transcriptome expression profile data for a plurality of breast cancer sample sets may be entered into a computing device.
For example, in one embodiment, 43 tumor tissue samples from breast cancer patients are collected for total RNA extraction and transcriptome sequencing, and the data is aligned, filtered and normalized to obtain standard expression profile data, thereby obtaining transcriptome expression profile data for a breast cancer sample set (referred to as the "BRCA _ OURS" sample set). The raw data of transcriptome sequencing of this sample set was stored in human Genome Sequence Archive (GSA) with sequence number HRA 000272. At the same time, the breast cancer data of the TCGA _ BRCA dataset was downloaded from the public database cbioportal, thereby obtaining transcriptome expression profile data of another breast cancer sample set (referred to as "TCGA _ BRCA" sample set). Thus, in this particular example, transcriptome expression profile data for two breast cancer sample sets (BRCA _ OURS sample set and TCGA _ BRCA sample set) is obtained via step S100.
Step S200: analyzing the immune infiltration condition of each breast cancer sample in each breast cancer sample set based on the transcriptome expression profile data of each breast cancer sample set, and determining a first immune infiltration group and a second immune infiltration group in each breast cancer sample set.
For example, in some embodiments, step S200 may include: quantifying the immune infiltration condition of multiple immune infiltration cells of each breast cancer sample in each breast cancer sample set by single sample gene set enrichment analysis (ssGSEA) based on the transcriptome expression profile data of each breast cancer sample set, and performing similarity analysis on all the breast cancer samples in each breast cancer sample set based on the quantified result to determine a first immune infiltration group and a second immune infiltration group in each breast cancer sample set.
The currently accepted and most used immunocyte marker is the immunocyte marker gene published in Immunity by Bindea G et al in 2013, which enables the extraction of 24 immunocyte information, see literature, Bindea, G., Mlecik, B., Tosolini, M., et al (2013) spatial imaging Dynamics of organic Immune Cells derived from the Immune Cells in Human cancer in Immunity,39,782-795.https:// doi. org/10.1016/j. immunity.2013.10.003. This document is hereby incorporated by reference in its entirety as part of the present application. To assess the immune microenvironment of a tumor patient, the immune status of a breast cancer patient can be assessed by tumor tissue transcriptome expression profiling.
For example, in some implementations, the infiltration of the 24 immune cells described above can be quantified in the patient's tumor tissue transcriptome using ssGSEA. For example, in other implementations, the "GSVA" package of the R language software can also be used to quantify the infiltration of the 24 immune cells in the patient's tumor tissue transcriptome. FIG. 2 shows the infiltration of 24 immune cells (Mast cells, NK cells, etc. as shown on the right side of FIG. 2) from 43 breast cancer patient samples (samples) in the BRCA _ OURS sample set of the previous example.
For example, in some embodiments, the similarity between the breast cancer samples in each breast cancer sample set can be calculated by euclidean distance to obtain a distance matrix, and then supervised clustering is further performed by a minimum variance method (e.g., the ward.d method or the ward.d2 method, etc.) to determine two classes (i.e., two immunoinfiltration groups), thereby enabling similarity analysis and determining a first immunoinfiltration group and a second immunoinfiltration group in each breast cancer sample set. For example, as shown in fig. 2, for the BRCA _ outputs sample set in the foregoing specific example, two immune infiltration groups can be determined, and the number of immune cells infiltrated by the two immune infiltration groups is significantly different, indicating that the immune microenvironment is different, so that the differentially expressed genes between the two immune infiltration groups can be further determined. For example, as shown in fig. 2, an immunoinfiltration group with a high immune cell infiltration number (e.g., a first immunoinfiltration group) may be referred to as a "high immunoinfiltration group" (n ═ 16), and an immunoinfiltration group with a low immune cell infiltration number (e.g., a second immunoinfiltration group) may be referred to as a "low immunoinfiltration group" (n ═ 27).
For example, for the BRCA _ TCGA sample set in the foregoing specific example, the above operations may also be performed to determine the first immune-infiltration group and the second immune-infiltration group corresponding to the BRCA _ TCGA sample set, and specific details may refer to processing of the BRCA _ OURS sample set in the foregoing specific example, and are not repeated herein.
Step S300: determining differentially expressed genes between the first and second immunoinfiltration groups in each breast cancer sample set based on transcriptome expression data of the first and second immunoinfiltration groups in each breast cancer sample set.
For example, in some embodiments, for each breast cancer sample set, it may be assumed that there is no difference in the expression of a certain gene in its first and second immunoinfiltration groups (i.e., a null hypothesis), and then based on this hypothesis, the probability of occurrence of the expression value of the gene in its first and second immunoinfiltration groups is calculated by t-test (t test), thereby obtaining a P-value (P-value); if the P value is less than 0.05, indicating a small probability of an event occurring, the null hypothesis should be rejected, i.e., the first and second immunoinfiltration groups have significant differences in the expression of the gene, so that it can be determined that the gene belongs to a differentially expressed gene between the first and second immunoinfiltration groups.
For example, in some embodiments, the "limma" package of the R language software can be used to calculate differentially expressed genes between the first and second immunoinfiltration groups in each breast cancer sample set.
For example, for the specific example described above, fig. 3 shows the number of differentially expressed genes (i.e., the differential genes in fig. 3) for each of the two sample sets (BRCA _ OURS sample set and BRCA _ TCGA sample set). As shown in fig. 3, the BRCA _ outputs sample set includes 2512 differentially expressed genes, i.e., the number of differentially expressed genes between the first and second immunoinfiltration groups in the BRCA _ outputs sample set is 3444(2512+ 932); the BRCA TCGA sample set included 3414(2482+932) differentially expressed genes, i.e., 2482 the number of differentially expressed genes between the first and second immunoinfiltration groups in the BRCA TCGA sample set.
Step S400: candidate genes were determined based on differentially expressed genes between the first and second immunoinfiltration groups in each breast cancer sample set.
For example, in some embodiments, step S400 may include: and taking intersection of the differentially expressed genes between the first immune infiltration group and the second immune infiltration group in the plurality of breast cancer sample sets to obtain candidate genes. The candidate genes obtained by the method can show the commonalities of data of different breast cancer sample sets.
For example, for the specific example described above, the differentially expressed genes of the two sample sets (BRCA _ OURS sample set and BRCA _ TCGA sample set) may be intersected to obtain candidate genes. For example, as shown in fig. 3, 932 differentially expressed genes as candidate genes can be finally obtained for the foregoing specific example.
Step S500: and constructing a risk scoring model based on the candidate genes.
For example, breast cancer prognosis models include risk scoring models.
For example, in some embodiments, as shown in fig. 4, step S500 may include the following steps S510 to S530.
Step S510: a training data set (referred to as "training set") is obtained.
For example, in some embodiments, tumor tissue samples of several breast cancer patients can be collected, total RNA extraction and transcriptome sequencing can be performed, and standard expression profile data can be obtained after data comparison, filtering and standardization, so that transcriptome expression profile data of these breast cancer patients can be obtained; the transcriptome expression profile data of these breast cancer patients may be combined with their clinical information to form a training dataset. For example, clinical information for breast cancer patients typically includes survival information, and may also include age, pathology stage (e.g., tumor AJCC stage), and the like.
For example, in other embodiments, breast cancer data (including transcriptome expression profile data as well as survival information, etc.) in a common database may be employed as the training data set.
It should be noted that, the embodiment of the present disclosure does not limit the manner of acquiring the training data set.
For example, for the specific example previously described, the TCGA _ BRCA dataset (including transcriptome expression profile data and lifetime information, etc.) may be used as the training dataset.
Step S520: screening candidate genes by combining LASSO-Cox regression analysis with a ten-fold cross-validation method to determine genes for constructing a risk scoring model and the risk scoring model.
For example, the Lasso-Cox (Lasso normalized Cox regression) regression analysis method is a model construction method combining Least Absolute value convergence and Selection operators (LASSO) with Cox risk regression analysis.
For example, for the specific example described above, LASSO-Cox regression analysis can be performed using the "glmnet" package of the R language software to screen candidate genes, while using a ten-fold cross-validation method to determine the best regularization parameter λ to determine the genes and their corresponding coefficients for constructing the risk scoring model, thereby obtaining the risk scoring model.
Fig. 5 shows the partial likelihood deviation versus Log (λ) in modeling the foregoing specific example by LASSO-Cox regression analysis in combination with a ten-fold cross-validation method. The partial likelihood deviation is used for representing the performance of the model, and the smaller the partial likelihood deviation is, the better the performance of the model is. The two dashed lines in fig. 5 indicate two particular values of λ (lambda), respectively: lamda.min and lamda.1se (left to right); where lambda.min refers to the one of all lambda values that corresponds to the minimum of the partial likelihood deviations, and lambda.1se refers to the one that yields a model with good performance but the minimum number of independent variables within a Standard Error (SE) of lambda.min.
For example, in general, the risk scoring model may be expressed as:
RS=c1E1+…+cNEN
wherein RS represents a risk score, EiRepresenting the expression value of the ith gene used to construct the risk scoring model, ciCoefficients representing the ith gene used to construct the risk scoring model, and N represents the number of genes used to construct the risk scoring model.
For example, for the specific example described above, the regularization parameter λ is set to lambda.1s, and the genes determined by the final screening for constructing the risk score model include 10 genes: c14orf79, C1orf168, C1orf226, CELSR2, FABP7, FGFBP1, IL-10, KLRB1, PLEKHO1, and RAC2, whose corresponding risk score models are expressed as:
RS=EC14orf79×(-0.114731735)+EC1orf168×(-0.019429183)
+EC1orf226×(-0.049258060)+ECELSR2×(-0.055863001)
+EFABP7×(-0.028295228)+EFGFBP1×(-0.008174118)
+EIL-10×0.020753075+EKLRB1×(-0.121245004)
+EPLEKHO1×(-0.049187024)+ERAC2×(-0.003657534),
wherein E isC14orf79Represents the expression value of gene C14orf79, EC1orf168Represents the expression value of the gene C1orf168, EC1orf226Represents the expression value of gene C1orf226, ECELSR2Represents the expression value of the gene CELSR2, EFABP7Expression value of the gene FABP7, EFGFBP1Indicates the expression value of the gene FGFBP1, EIL-10Expressing the expression value of the gene IL-10, EKLRB1Expression value of the gene KLRB1, EPLEKHO1Represents the expression value of the gene PLEKHO1,ERAC2the expression value of the RAC2 gene is shown.
For example, in some embodiments, as shown in fig. 4, step S500 may further include the following step S530.
Step S530: based on the training dataset, the predictive performance of the risk scoring model is evaluated.
For example, in some implementations, as shown in fig. 6, step S530 may include the following steps S531-S533.
Step S531: a risk score is calculated for each subject in the training dataset based on the risk score model.
For example, for the specific example previously described, the risk score for each breast cancer sample (i.e., subject) in the training dataset (TCGA _ BRCA dataset) can be calculated based on its corresponding risk score model.
Step S532: determining a group cutoff value according to the risk scores of all subjects in the training data set, and dividing the subjects in the training data set into a first high risk group and a first low risk group according to the group cutoff value.
For example, in some embodiments, the median of the risk scores for all subjects in the training dataset may be taken as the packet cutoff value. According to the group cutoff value, the subjects in the training dataset may be divided into a first high risk group and a first low risk group, wherein the risk score of the subjects in the first high risk group is greater than the group cutoff value and the risk score of the subjects in the first low risk group is less than or equal to the group cutoff value.
Fig. 7 shows a visualization of the risk scores obtained from the training dataset (TCGA _ BRCA dataset) in the specific example described above. Fig. 7 includes three sub-diagrams of upper, middle and lower. As shown in fig. 7, the uppermost subgraph shows the distribution of the risk scores of the training data set (i.e., "risk scores" in fig. 7), with subjects classified into a first high risk group (i.e., "high risk group" in fig. 7) and a first low risk group (i.e., "low risk group" in fig. 7) according to the median of the risk scores of the training data set (i.e., group cutoff); the middle subgraph shows the survival data of the subjects of the training data set (i.e., "overall survival" in fig. 7), which can be used to plot a Kaplan-Meier curve; the bottom subgraph shows the expression heatmap of the 10 genes used to construct the risk scoring model, reflecting the expression of the 10 genes in the training dataset.
Step S533: the Kaplan-Meier curve of the training dataset was used to assess whether the first high risk group and the first low risk group had significant differences in survival.
For example, a Kaplan-Meier curve may be plotted based on the survival data of the subjects in the training dataset. Fig. 8 shows a Kaplan-Meier curve obtained from the training dataset (TCGA _ BRCA dataset) in the foregoing specific example, specifically including the survival curve of the first high-risk group (i.e., "high-risk group" in fig. 8) and the survival curve of the first low-risk group (i.e., "low-risk group" in fig. 8). As shown in fig. 8, the survival rate curve of the first high-risk group and the survival rate curve of the first low-risk group have significant difference, and the P value (P <0.0001) representing the significance of the difference is far less than 0.05, which also indicates that there is significant difference between the two, so that the Kaplan-Meier curve shown in fig. 8 verifies the significant difference in survival situation between the first high-risk group and the first low-risk group.
For example, in some embodiments, as shown in fig. 6, step S530 may include the following step S534.
Step S534: a multi-factor Cox regression analysis is performed on the training dataset to assess the robustness of the risk score for predicting survival.
For example, in some embodiments, a multi-factor Cox regression analysis may be performed on the training dataset to evaluate the robustness of the risk score for predicting survival, thereby determining the prognostic value of the risk score. Fig. 9 shows the results of a multi-factor Cox regression analysis based on the training data set in the foregoing specific example. As shown in fig. 9, when a multifactorial Cox regression analysis is performed in combination with factors such as risk score, age, AJCC stage, and molecular typing (including four molecular typing of Basal, Her2, lumia, and lumib), the risk score, which is one of the prognostic indicators, has the highest median risk ratio (hazard ratio, HR 15.975, 95% Confidence Interval (CI) ═ 7.643-33.39) as compared to other clinical characteristics (e.g., age, AJCC stage, and molecular typing) as the prognostic indicators. This demonstrates the robustness of the risk score to predict survival in breast cancer patients, i.e., the risk score has a higher prognostic value for breast cancer patients.
For example, in some embodiments, as shown in fig. 6, step S530 may include the following step S535.
Step S535: the goodness of fit of the risk score model was evaluated using ROC curve (receiver operating characteristic curve) analysis of the training dataset.
For example, in some embodiments, the ROC curve may be plotted based on a training data set (TCGA _ BRCA data set) using the "survivvalroc" package of the R language software. Fig. 10 shows four ROC curves based on different prognostic indicators obtained from the training data sets in the foregoing specific examples, specifically including a three-year survival rate ROC curve based on molecular typing, a three-year survival rate ROC curve based on tumor AJCC staging, a three-year survival rate ROC curve based on risk score, and a three-year survival rate ROC curve based on tumor AJCC staging in combination with risk score. For example, the area under the ROC curve (AUC) can be used to evaluate the goodness of fit of models constructed based on different prognostic indicators, and the AUC value ranges between 0.5 and 1. The closer the AUC value is to 1.0, the higher the reliability of the risk scoring model is; the closer the AUC value is to 0.5, the lower the reliability of the risk scoring model (the dashed line in fig. 10 corresponds to the case where the AUC value is equal to 0.5). As shown in fig. 10, the AUC value of the risk score-based three-year survival rate ROC curve is significantly greater than that of the molecular typing-based three-year survival rate ROC curve, and is also significantly greater than that of the tumor AJCC staging-combined risk score-based three-year survival rate ROC curve, thereby indicating that the risk score has higher reliability as an independent prognostic indicator. That is to say, the risk scoring model corresponding to the above specific example has high reliability and thus has high application value.
For example, in some embodiments, as shown in fig. 4, step S500 may further include the following step S540 and step S550.
Step S540: a verification data set is obtained.
For example, in some embodiments, tumor tissue samples of several breast cancer patients may be collected separately, and then total RNA extraction and transcriptome sequencing are performed, and standard expression profile data is obtained after data comparison, filtering and standardization, so as to obtain transcriptome expression profile data of these breast cancer patients; the transcriptome expression profile data of these breast cancer patients may together with their clinical information constitute a validation dataset. For example, clinical information for breast cancer patients typically includes survival information, and may also include age, pathology stage (e.g., tumor AJCC stage), and the like.
For example, in other embodiments, breast cancer data (including transcriptome expression profile data as well as clinical information, etc.) in a common database may be employed as the validation dataset.
It should be understood that the sample collection method, sequencing method, etc. used by the validation dataset should be the same as the sample collection method, sequencing method, etc. used by the training dataset, and that the samples in the validation dataset are typically different from the samples in the training dataset.
For example, for the specific example previously described, a METABRIC dataset (including transcriptome expression profile data as well as clinical information, etc.) may be downloaded from the public database cbioporal as a validation dataset.
Step S550: based on the validation dataset, the efficacy of the risk scoring model is validated.
For example, in some implementations, as shown in fig. 11, step S550 may include the following steps S551 to S552.
Step S551: a risk score is calculated for each subject in the validation dataset based on the risk score model.
For example, for the foregoing specific example, a risk score for each sample (i.e., subject) in the validation dataset (METABRIC dataset) may be calculated based on its corresponding risk score model.
Step S552: the subjects in the validation dataset are divided into a second high risk group and a second low risk group according to the group cutoff value, and the Kaplan-Meier curve of the validation dataset is used to validate whether the second high risk group and the second low risk group have a significant difference in survival.
For example, for the foregoing specific example, the subjects in the validation data set may be classified into a second high risk group and a second low risk group according to the group cutoff value determined in the foregoing step S532 (e.g., the median of the risk scores of the training data set), wherein the risk score of the subject in the second high risk group is greater than the group cutoff value and the risk score of the subject in the second low risk group is less than or equal to the group cutoff value.
For example, further, a Kaplan-Meier curve may be plotted based on survival data of the subjects in the validation dataset. Fig. 12 shows a Kaplan-Meier curve derived from the validation dataset in the foregoing specific example, specifically including a survival curve for the second high risk group (i.e., "high risk group" in 12) and a survival curve for the second low risk group (i.e., "low risk group" in 12). As shown in fig. 12, the survival rate curve of the second high risk group and the survival rate curve of the second low risk group have significant difference, and the P value (P <0.0001) representing the significance of the difference is far less than 0.05, which also indicates that there is significant difference between the two, so the Kaplan-Meier curve shown in fig. 12 verifies the significant difference in survival between the second high risk group and the second low risk group. That is, the risk scoring model has a good predictive effect.
For example, in some embodiments, step S550 may further include: the goodness-of-fit of the risk scoring model was verified using ROC curve analysis of the verification dataset. For example, for the foregoing specific example, an ROC curve may be drawn based on a verification data set by using a "survivvalroc" package of R language software, and the reliability and the application value of the risk score model corresponding to the foregoing specific example may be verified based on an AUC value of the ROC curve, and specific details may refer to the related description of step S535, and are not repeated herein.
Fig. 13 is a block flow diagram of another method for constructing a breast cancer prognosis model according to some embodiments of the present disclosure. For example, as shown in fig. 13, the method for constructing a breast cancer prognosis model further includes the following step S600 on the basis of the aforementioned steps S100 to S500.
Step S600: and (3) combining three prognostic indexes of risk score, pathological stage and age of the risk score model, and constructing a column line chart model by utilizing multi-factor Cox regression analysis.
For example, breast cancer prognosis models also include Nomogram (nomogrm) models. For example, referring to the Cox regression forest map shown in fig. 9, in addition to the P value (<0.001) corresponding to the risk score being less than 0.05, the P value (<0.001) corresponding to age is also less than 0.05, and the P values corresponding to Stage III and Stage IV of AJCC staging (i.e., pathological staging) are also less than 0.05, thus indicating that both age and pathological staging are also factors significantly related to prognosis. In addition, referring to fig. 10, the AUC value of the three-year survival ROC curve based on tumor AJCC staging in combination with risk score is slightly greater than the AUC value of the three-year survival ROC curve based on risk score, thus demonstrating that combining risk score with other prognostic indicators (e.g., age, tumor AJCC staging, etc.) can improve the reliability of the model. Therefore, a histogram model can be constructed by utilizing multi-factor Cox regression analysis by combining three prognostic indicators of risk score, pathological stage and age. The histogram model can set a scoring criterion based on the regression coefficients of all independent variables and then give a score value for each independent variable, thereby calculating a total score for each breast cancer patient. The transition between the occurrence probability and the prognosis is calculated by a function, and the prognosis probability is obtained for each breast cancer patient.
Fig. 14 shows a nomogram model constructed according to the foregoing specific example. According to the histogram model, the three-year and five-year survival rates of breast cancer patients can be predicted. For example, the method of using the histogram model includes: the risk score, pathological stage and age corresponding (vertical corresponding) scores of the breast cancer patients are summed to obtain a total score, and then the corresponding (vertical corresponding) three-year and five-year survival rates are determined according to the total score.
For example, in some embodiments, for example, an ROC curve or the like may be further used to evaluate whether the reliability of the histogram model is better than the reliability of the risk score, the pathological stage, and the age, which may specifically refer to the foregoing step S535 and the related description of fig. 10, and will not be repeated herein.
It should be noted that the method for constructing a breast cancer prognosis model provided in the embodiments of the present disclosure is mainly described based on a specific example, but the specific example should not be considered as a limitation to the embodiments of the present disclosure.
It should also be noted that, in the embodiment of the present disclosure, the flow of the above-mentioned construction method (for example, the construction method shown in fig. 1 and the construction method shown in fig. 13) may include more or less operations, and these operations may be performed sequentially or in parallel. Although the flow of the construction method described above includes a plurality of operations occurring in a certain order, it should be clearly understood that the order of the plurality of operations is not limited.
According to the construction method provided by the embodiment of the disclosure, a breast cancer sample is divided into a first immune infiltration group and a second immune infiltration group according to the immune infiltration condition in a tumor microenvironment, candidate genes are screened based on the differential expression genes between the first immune infiltration group and the second immune infiltration group, and then a risk scoring model is constructed, wherein the risk scoring model has good prediction precision in the aspect of prognosis of a breast cancer patient; based on the risk scoring model, a nomogram model with higher prediction precision can be further constructed, and the nomogram model can provide a more optimized quantitative method for the clinical prognosis evaluation of breast cancer patients, so that a reference can be provided for the breast cancer patients to improve the prognosis of the breast cancer patients.
At least some embodiments of the present disclosure also provide a method for applying a breast cancer prognosis model. For example, the breast cancer prognosis model includes a risk score model constructed according to the construction method shown in fig. 1. Fig. 15 is a block flow diagram of an application method of a breast cancer prognosis model according to some embodiments of the present disclosure. For example, as shown in fig. 15, the application method includes the following steps S710 to S520.
Step S710: obtaining tumor tissue transcript expression data of the subject, wherein the tumor tissue transcript expression data of the subject comprises expression values of genes used to construct a risk score model.
For example, in some embodiments, a tumor tissue sample of a subject (e.g., a breast cancer patient) can be collected and transcriptome sequencing performed to obtain transcriptome expression data. The transcription profile expression data can be entered into a computing device.
Step S720: calculating a risk score for the subject according to a risk score model based on the subject's tumor tissue transcript expression data.
For example, in some embodiments, the expression values of the genes used to construct the risk score model in the tumor tissue transcript expression data of the subject can be substituted into the risk score model to calculate the risk score for the subject.
For example, in some embodiments, a subject can be prognostically evaluated based on the subject's risk score. For example, whether a subject is in a "high risk" state or a "low risk" state can be assessed qualitatively based on the relative magnitude of the subject's risk score and the group cutoff value of the risk score model. Here, reference may be made to the related description in the foregoing step S532, and the description is not repeated here.
At least some embodiments of the present disclosure also provide methods of using another breast cancer prognosis model. For example, the breast cancer prognosis model includes a nomogram model constructed according to the construction method shown in fig. 13 (of course, an intermediate-process-derived risk score model is also included). Fig. 16 is a block flow diagram of an application method of another breast cancer prognosis model provided in some embodiments of the present disclosure. For example, as shown in fig. 16, the application method includes the following steps S810 to S830.
Step S810: acquiring age, pathological stage and tumor tissue transcription spectrum expression data of the subject, wherein the tumor tissue transcription spectrum expression data of the subject comprise expression values of genes for constructing a risk score model.
For example, in some embodiments, age information and pathological stage information of a subject may be collected, e.g., pathological stage information may be obtained by staging a tumor AJCC; tumor tissue samples of a subject (e.g., a breast cancer patient) can also be collected and transcriptome sequencing performed to obtain transcript expression data. The age information, pathological stage information and tumor tissue transcript expression data can be entered into a computing device.
Step S820: calculating a risk score for the subject according to a risk score model based on the subject's tumor tissue transcript expression data.
For example, in some embodiments, the expression values of the genes used to construct the risk score model in the tumor tissue transcript expression data of the subject can be substituted into the risk score model to calculate the risk score for the subject.
Step S830: based on the age, pathological stage, and risk score of the subject, the survival rate of the subject is predicted according to the nomogram model.
For example, the survival rate of the subject predicted according to the histogram model can be obtained by referring to the description of "the method for using the histogram model" in the step S600, and the description thereof is not repeated here. For example, for the specific example previously described, the three-year, five-year survival rate of a subject can be predicted according to the histogram model shown in fig. 14.
The application method provided by the embodiment of the disclosure can be used for performing prognostic evaluation on a subject (for example, a breast cancer patient) according to the risk evaluation model or nomogram model constructed by the aforementioned construction method, and the prognostic evaluation result of the prognostic evaluation model can provide a reference for the breast cancer patient to improve the prognosis of the breast cancer patient.
At least some embodiments of the present disclosure also provide an electronic device. Fig. 17 is a schematic block diagram of an electronic device provided in some embodiments of the present disclosure. For example, as shown in FIG. 17, the electronic device 100 includes one or more memories 110 and one or more processors 120.
For example, the memory 110 is used to non-transitory store computer readable instructions that the processor 120 is used to execute. The computer readable instructions, when executed by the processor 120, perform a construction method or an application method provided by any of the embodiments of the present disclosure, for example.
For example, the memory 110 and the processor 120 may be in direct or indirect communication with each other. For example, in some embodiments, as shown in fig. 17, the electronic device 100 may further include a system bus 130, and the memory 110 and the processor 120 may communicate with each other through the system bus 130, for example, the processor 120 may access the memory 110 through the system bus 130. For example, in other embodiments, components such as memory 110 and processor 120 may communicate over a network connection. The network may include a wireless network, a wired network, and/or any combination of wireless and wired networks. The network may include a local area network, the Internet, a telecommunications network, an Internet of Things (Internet of Things) based on the Internet and/or a telecommunications network, and/or any combination thereof, and/or the like. The wired network may communicate by using twisted pair, coaxial cable, or optical fiber transmission, for example, and the wireless network may communicate by using 3G/4G/5G mobile communication network, bluetooth, Zigbee, or WiFi, for example. The present disclosure is not limited herein as to the type and function of the network.
For example, the processor 120 may control other components in the electronic device to perform desired functions. The processor 120 may be a device having data processing capability and/or program execution capability, such as a Central Processing Unit (CPU), Tensor Processor (TPU), or Graphics Processor (GPU). The Central Processing Unit (CPU) may be an X86 or ARM architecture, etc. The GPU may be separately integrated directly onto the motherboard, or built into the north bridge chip of the motherboard. The GPU may also be built into the Central Processing Unit (CPU).
For example, memory 110 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like.
For example, one or more computer instructions may be stored on the memory 110 and executed by the processor 120 to implement various functions. Various applications and various data may also be stored in the computer readable storage medium, such as peripheral blood leukocyte transcript expression data of the subject, various clinical information, a risk scoring model, a nomogram model, and various data used and/or generated by the applications, among others.
For example, some of the computer instructions stored by the memory 110, when executed by the processor 120, may perform one or more steps according to the construction method described above. For example, other computer instructions stored by memory 110, when executed by processor 120, may perform one or more steps of the application method according to the description above.
For example, as shown in fig. 17, the electronic device 100 may further include an input interface 140 that allows an external device to communicate with the electronic device 100. For example, the input interface 140 may be used to receive instructions from an external computer device, from a user, and the like. The electronic device 100 may also include an output interface 150 that interconnects the electronic device 100 and one or more external devices. For example, the electronic device 100 may output the aforementioned risk score model, the histogram model, the risk score, the survival rate of the subject, and the like through the output interface 150. External devices that communicate with electronic device 100 through input interface 140 and output interface 150 may be included in an environment that provides any type of user interface with which a user may interact. Examples of user interface types include graphical user interfaces, natural user interfaces, and the like. For example, a graphical user interface may accept input from a user using an input device such as a keyboard, mouse, remote control, etc., and provide output on an output device such as a display. Furthermore, a natural user interface may enable a user to interact with the electronic device 100 in a manner that does not require the constraints imposed by input devices such as keyboards, mice, remote controls, and the like. Instead, natural user interfaces may rely on speech recognition, touch and stylus recognition, gesture recognition on and near the screen, air gestures, head and eye tracking, speech and semantics, vision, touch, gestures, and machine intelligence, among others.
In addition, although the electronic apparatus 100 is illustrated as a single system in fig. 17, it is understood that the electronic apparatus 100 may be a distributed system, and may be arranged as a cloud facility (including a public cloud or a private cloud). Thus, for example, several devices may communicate over a network connection and may collectively perform tasks described as being performed by electronic device 100.
For example, the detailed description of the specific implementation process and details of the construction method can refer to the related description in the foregoing embodiment of the construction method of the breast cancer prognosis model, and the repeated parts are not repeated herein. For example, the detailed description of the specific implementation process and details of the application method can refer to the related description in the foregoing embodiments of the application method of the breast cancer prognosis model, and the repetition points are not repeated herein.
For example, in some embodiments, the electronic device 100 may include, but is not limited to, a smartphone, a laptop, a tablet, a desktop computer, a server, a cloud service, and so forth.
It should be noted that the electronic device provided in the embodiments of the present disclosure is illustrative and not restrictive, and the electronic device may further include other conventional components or structures according to practical application needs, for example, in order to implement the necessary functions of the electronic device, a person skilled in the art may set other conventional components or structures according to a specific application scenario, and the embodiments of the present disclosure are not limited thereto.
For technical effects of the electronic device provided by the embodiment of the present disclosure, reference may be made to corresponding descriptions about a construction method or an application method in the above embodiments, and details are not repeated here.
At least some embodiments of the present disclosure also provide a storage medium. Fig. 18 is a schematic diagram of a storage medium provided in at least some embodiments of the present disclosure. For example, as shown in fig. 18, the storage medium 200 stores non-transitory computer-readable instructions 201, and when the non-transitory computer-readable instructions 201 are executed by a computer (including a processor), the instructions of the construction method or the application method provided by any embodiment of the disclosure may be executed.
For example, one or more computer instructions may be stored on the storage medium 200. Some of the computer instructions stored on the storage medium 200 may be, for example, instructions for implementing one or more steps in the aforementioned construction method. Some of the computer instructions stored on the storage medium 200 may be, for example, instructions for implementing one or more steps of the aforementioned application method.
For example, the storage medium 200 may include a storage component of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a compact disc read only memory (CD-ROM), a flash memory, or any combination of the above storage media, and may also be other suitable storage media. For example, the storage medium may also be the memory 110 shown in fig. 17, and reference may be made to the foregoing description for related descriptions, which are not described herein again.
For technical effects of the storage medium provided by the embodiments of the present disclosure, reference may be made to corresponding descriptions about a construction method or an application method in the foregoing embodiments, which are not described herein again.
For the present disclosure, there are the following points to be explained:
(1) in the drawings of the embodiments of the present disclosure, only the structures related to the embodiments of the present disclosure are referred to, and other structures may refer to general designs.
(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.
The above is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and shall be covered by the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (15)

1. A method for constructing a breast cancer prognosis model comprises the following steps:
obtaining transcriptome expression profile data for a plurality of breast cancer sample sets, wherein each breast cancer sample set of the plurality of breast cancer sample sets comprises a plurality of breast cancer samples;
analyzing the immune infiltration condition of each breast cancer sample in each breast cancer sample set based on the transcriptome expression profile data of each breast cancer sample set, and determining a first immune infiltration group and a second immune infiltration group in each breast cancer sample set;
determining differentially expressed genes between the first and second immunoinfiltration groups in the each breast cancer sample set based on transcriptome expression data of the first and second immunoinfiltration groups in the each breast cancer sample set;
determining candidate genes based on differentially expressed genes between the first and second immunoinfiltration groups in each breast cancer sample set; and
constructing a risk scoring model based on the candidate genes,
wherein the breast cancer prognosis model comprises the risk score model.
2. The constructing method according to claim 1, wherein analyzing the immune infiltration status of each breast cancer sample in each breast cancer sample set based on the transcriptome expression profile data of each breast cancer sample set, and determining a first immune infiltration group and a second immune infiltration group in each breast cancer sample set comprises:
quantifying the immune infiltration condition of multiple immune infiltration cells of each breast cancer sample in each breast cancer sample set by adopting single-sample gene set enrichment analysis based on the transcriptome expression profile data of each breast cancer sample set, and analyzing the similarity of all breast cancer samples in each breast cancer sample set based on the quantification result so as to determine a first immune infiltration group and a second immune infiltration group in each breast cancer sample set.
3. The construction method according to claim 1or 2, wherein determining the candidate genes based on the differentially expressed genes between the first and second immunoinfiltration groups in each breast cancer sample set comprises:
intersecting the differentially expressed genes between the first and second immunoinfiltration groups in the plurality of breast cancer sample sets to obtain the candidate gene.
4. The construction method according to claim 1or 2, wherein constructing the risk scoring model based on the candidate genes comprises:
acquiring a training data set; and
screening the candidate genes by LASSO-Cox regression analysis in combination with a ten-fold cross-validation method to determine the genes used to construct the risk scoring model and the risk scoring model, wherein the risk scoring model is represented as:
RS=c1E1+…+cNEN
wherein RS represents a risk score, EiRepresenting the expression value of the i-th gene used to construct the risk scoring model, ciCoefficients representing the ith gene used to construct the risk scoring model, and N represents the number of genes used to construct the risk scoring model.
5. The construction method according to claim 4, wherein the number of genes for constructing the risk score model is 10, and the genes for constructing the risk score model include C14orf79, C1orf168, C1orf226, CELSR2, FABP7, FGFBP1, IL-10, KLRB1, PLEKHO1 and RAC 2; the risk scoring model is expressed as:
RS=EC14orf79×(-0.114731735)+EC1orf168×(-0.019429183)+EC1orf226×(-0.049258060)+ECELSR2×(-0.055863001)+EFABP7×(-0.028295228)+EFGFBP1×(-0.008174118)+EIL-10×0.020753075+EKLRB1×(-0.121245004)+EPLEKHO1×(-0.049187024)+ERAC2×(-0.003657534),
wherein the content of the first and second substances,EC14orf79represents the expression value of gene C14orf79, EC1orf168Represents the expression value of the gene C1orf168, EC1orf226Represents the expression value of gene C1orf226, ECELSR2Represents the expression value of the gene CELSR2, EFABP7Expression value of the gene FABP7, EFGFBP1Indicates the expression value of the gene FGFBP1, EIL-10Expressing the expression value of the gene IL-10, EKLRB1Expression value of the gene KLRB1, EPLEKHO1Represents the expression value of the gene PLEKHO1, ERAC2The expression value of the RAC2 gene is shown.
6. The construction method of claim 4, wherein constructing the risk scoring model based on the candidate genes further comprises:
evaluating the predictive performance of the risk scoring model based on the training dataset.
7. The build method of claim 6, wherein evaluating the predictive performance of the risk scoring model based on the training dataset comprises:
calculating a risk score for each subject in the training dataset based on the risk score model;
determining a group cutoff value according to the risk scores of all the subjects in the training data set, and dividing the subjects in the training data set into a first high risk group and a first low risk group according to the group cutoff value; and
evaluating whether the first high-risk group and the first low-risk group have a significant difference in survival using a Kaplan-Meier curve of the training dataset.
8. The build method of claim 7, wherein evaluating predictive performance of the risk scoring model based on the training dataset further comprises:
performing multifactor Cox regression analysis on the training dataset to evaluate a risk score for predicting robustness of survival; and
assessing goodness-of-fit of a risk scoring model using subject operational characteristic curve analysis of the training dataset.
9. The construction method according to claim 7 or 8, wherein constructing the risk scoring model based on the candidate genes further comprises:
obtaining a verification dataset; and
verifying the efficacy of the risk scoring model based on the verification dataset.
10. The method of constructing as claimed in claim 9, wherein verifying the efficacy of the risk scoring model based on the verification dataset comprises:
calculating a risk score for each subject in the validation dataset based on the risk score model; and
dividing all subjects in the validation dataset into a second high risk group and a second low risk group according to the group cutoff value, and validating whether the second high risk group and the second low risk group have significant difference in survival using a Kaplan-Meier curve of the validation dataset.
11. The construction method according to claim 1or 2, further comprising:
combining the risk score, the pathological stage and the age prognostic indicators, and constructing a histogram model by utilizing multi-factor Cox regression analysis;
wherein the breast cancer prognosis model further comprises the histogram model.
12. A method for applying a breast cancer prognosis model, wherein the breast cancer prognosis model comprises the risk score model constructed according to the construction method of any one of claims 1 to 11, and the application method comprises the following steps:
obtaining tumor tissue transcript expression data of a subject, wherein the tumor tissue transcript expression data of the subject comprises expression values of genes used to construct the risk score model; and
calculating a risk score for the subject according to the risk score model based on tumor tissue transcript expression data for the subject.
13. An application method of a breast cancer prognosis model, wherein the breast cancer prognosis model comprises the histogram model constructed according to the construction method of claim 11, the application method comprises:
obtaining age, pathological stage, and tumor tissue transcript expression data of a subject, wherein the tumor tissue transcript expression data of the subject comprises expression values of genes used to construct the risk score model;
calculating a risk score for the subject according to the risk score model based on tumor tissue transcript expression data for the subject; and
predicting survival of the subject according to the nomogram model based on the age, pathology staging, risk score of the subject.
14. An electronic device, comprising:
a memory for non-transitory storage of computer readable instructions; and
a processor for executing the computer readable instructions,
wherein the computer readable instructions, when executed by the processor, perform a construction method according to any one of claims 1-11 or perform an application method according to any one of claims 12 or 13.
15. A storage medium storing non-transitory computer readable instructions, wherein the non-transitory computer readable instructions, when executed by a computer, perform instructions of a construction method according to any one of claims 1-11 or perform instructions of an application method according to any one of claims 12 or 13.
CN202110061949.2A 2021-01-18 2021-01-18 Breast cancer prognosis model construction method, application method and electronic equipment Pending CN112735529A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110061949.2A CN112735529A (en) 2021-01-18 2021-01-18 Breast cancer prognosis model construction method, application method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110061949.2A CN112735529A (en) 2021-01-18 2021-01-18 Breast cancer prognosis model construction method, application method and electronic equipment

Publications (1)

Publication Number Publication Date
CN112735529A true CN112735529A (en) 2021-04-30

Family

ID=75592006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110061949.2A Pending CN112735529A (en) 2021-01-18 2021-01-18 Breast cancer prognosis model construction method, application method and electronic equipment

Country Status (1)

Country Link
CN (1) CN112735529A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116313062A (en) * 2023-05-18 2023-06-23 四川省肿瘤医院 Lung adenocarcinoma prognosis model
WO2024021037A1 (en) * 2022-07-29 2024-02-01 京东方科技集团股份有限公司 Disease analysis method and apparatus, and disease analysis model training method and apparatus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110145176A1 (en) * 2008-05-30 2011-06-16 Perou Charles M Gene expression profiles to predict breast cancer outcomes
KR20130023312A (en) * 2013-01-28 2013-03-07 주식회사 젠큐릭스 Prognostic genes for early breast cancer and prognostic model for early breast cancer patients
CN111564214A (en) * 2019-02-14 2020-08-21 辽宁省肿瘤医院 Establishment and verification method of breast cancer prognosis evaluation model based on 7 special genes
CN111883209A (en) * 2020-07-02 2020-11-03 南京邮电大学 Method for screening immune infiltration related prognostic genes in breast cancer tumor microenvironment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110145176A1 (en) * 2008-05-30 2011-06-16 Perou Charles M Gene expression profiles to predict breast cancer outcomes
KR20130023312A (en) * 2013-01-28 2013-03-07 주식회사 젠큐릭스 Prognostic genes for early breast cancer and prognostic model for early breast cancer patients
CN111564214A (en) * 2019-02-14 2020-08-21 辽宁省肿瘤医院 Establishment and verification method of breast cancer prognosis evaluation model based on 7 special genes
CN111883209A (en) * 2020-07-02 2020-11-03 南京邮电大学 Method for screening immune infiltration related prognostic genes in breast cancer tumor microenvironment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024021037A1 (en) * 2022-07-29 2024-02-01 京东方科技集团股份有限公司 Disease analysis method and apparatus, and disease analysis model training method and apparatus
CN116313062A (en) * 2023-05-18 2023-06-23 四川省肿瘤医院 Lung adenocarcinoma prognosis model
CN116313062B (en) * 2023-05-18 2023-07-21 四川省肿瘤医院 Lung adenocarcinoma prognosis model

Similar Documents

Publication Publication Date Title
Sun et al. Identification of 12 cancer types through genome deep learning
US11636288B2 (en) Platform, device and process for annotation and classification of tissue specimens using convolutional neural network
JP6063446B2 (en) Analysis of biomarker expression in cells by product rate
CN112735592B (en) Construction method and application method of lung cancer prognosis model and electronic equipment
CN112289455A (en) Artificial intelligence neural network learning model construction system and construction method
CN111863159B (en) Establishment method of line chart model for predicting curative effect of tumor immunotherapy
CN112735529A (en) Breast cancer prognosis model construction method, application method and electronic equipment
Karagoz et al. Integration of multiple biological features yields high confidence human protein interactome
CN110993104A (en) Tumor patient life prediction system
CN115699204A (en) Clinical predictor based on multiple machine learning models
US10665347B2 (en) Methods for predicting prognosis
CN111440869A (en) DNA methylation marker for predicting primary breast cancer occurrence risk and screening method and application thereof
Carrillo-Perez et al. Non-small-cell lung cancer classification via RNA-Seq and histology imaging probability fusion
US20230383364A1 (en) Prognostic model of hepatocellular carcinoma based on ddr and icd gene expression and construction method and application thereof
CN115335912A (en) Relative synthetic feasibility of inverse synthesis
Rosati et al. Differential gene expression analysis pipelines and bioinformatic tools for the identification of specific biomarkers: A Review
US20220044762A1 (en) Methods of assessing breast cancer using machine learning systems
CN115620854A (en) Method, device, equipment and storage medium for establishing prognosis model
KR102305806B1 (en) Method for prodicting prognosis in lung cancer patient using clinical information and gene polymorphism information
KR102371655B1 (en) Device, Method of Calculating Prostate Cancer Genetic Risk Score Based on Individual Weights for each Genetic Variation and Recording Medium thereof
EP2710152A1 (en) Computer-implemented method and system for detecting interacting dna loci
Santos Breast Cancer Survival Prediction using Machine Learning and Gene Expression Profiles
CN112930573A (en) Disease type automatic determination method and electronic equipment
US11983099B1 (en) Graphical intervention test development system
Al-Ghafer et al. NMF-guided feature selection and genetic algorithm-driven framework for tumor mutational burden classification in bladder cancer using multi-omics data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination