WO2022214036A1 - Method for predicting drug sensitivity state, device, and storage medium - Google Patents

Method for predicting drug sensitivity state, device, and storage medium Download PDF

Info

Publication number
WO2022214036A1
WO2022214036A1 PCT/CN2022/085628 CN2022085628W WO2022214036A1 WO 2022214036 A1 WO2022214036 A1 WO 2022214036A1 CN 2022085628 W CN2022085628 W CN 2022085628W WO 2022214036 A1 WO2022214036 A1 WO 2022214036A1
Authority
WO
WIPO (PCT)
Prior art keywords
drug
information
gene
neural network
variation
Prior art date
Application number
PCT/CN2022/085628
Other languages
French (fr)
Chinese (zh)
Inventor
王凯
罗培韬
俞燕飞
Original Assignee
至本医疗科技(上海)有限公司
上海至本医学检验所有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 至本医疗科技(上海)有限公司, 上海至本医学检验所有限公司 filed Critical 至本医疗科技(上海)有限公司
Publication of WO2022214036A1 publication Critical patent/WO2022214036A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/40ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • the present disclosure relates generally to biological information processing, and in particular, to methods, devices, and storage media for predicting drug susceptibility states.
  • tumor cell heterogeneity often leads to unstable drug response, which presents a major challenge in the field of tumor drug development.
  • the traditional method for predicting drug susceptibility status is to use the combination of drug metabolism-related gene loci (DPYD*2A, DPYD*5A, DPYD*9A, MTHFR, TS, and GSTP1) mutation detection. Treat patients for drug susceptibility guidance.
  • drug metabolism-related gene loci DPYD*2A, DPYD*5A, DPYD*9A, MTHFR, TS, and GSTP1 mutation detection.
  • the traditional methods for predicting drug susceptibility status have the disadvantages that the generality is not ideal, and the accuracy of the predicted drug susceptibility is not high.
  • the present disclosure provides a method, a computing device and a computer storage medium for predicting a drug sensitivity state, which can accurately predict drug sensitivity and have good generality.
  • a method of predicting a drug susceptibility state includes: acquiring gene variation information of a sample to be tested and drug information about a drug, the drug information at least including drug identification and drug molecular formula structure information; acquiring drug sensitivity state data determined by a cell activity test about a cell and a corresponding drug; Gene variation information and drug information are preprocessed to generate multiple gene variation characterization data and multiple drug characterization data for combining into multiple sets of input sample sets; based on the first neural network model, extract the gene variants in the input sample set characterizing features of the data to generate gene variation features; extracting features of the drug characterization data in the input sample set based on the second neural network model to generate drug features; fusing the gene variation features and drug features; and based on the third neural network model, Extracting the features of the fused gene variation features and drug features to predict the drug sensitivity state of the sample to be tested for the corresponding drug, the first neural network model, the second neural network model and the
  • a computing device comprising: at least one processing unit; at least one memory coupled to the at least one processing unit and storing data for execution by the at least one processing unit Instructions that, when executed by the at least one processing unit, cause a computing device to perform the method of the first aspect of the present disclosure.
  • a computer-readable storage medium has stored thereon machine-executable instructions that, when executed, cause a machine to perform the method of the first aspect of the present disclosure.
  • the sample to be tested is a cell line or a primary cell
  • the drug susceptibility status data is determined through a cell activity test on the cell line and the corresponding drug.
  • generating a plurality of gene variation characterization data and a plurality of drug characterization data for combining into multiple sets of input samples includes: based on the preprocessed gene variation information, generating a one-dimensional gene variation characterization feature and a two-dimensional gene variation characterization feature, respectively a two-dimensional gene variation characterizing feature, a one-dimensional gene variation characterizing feature indicating cell line identification information, gene identification information, and variation impact type information, and a two-dimensional gene variation characterizing feature indicating cell line identification information and microsatellite instability state information of the cell line; and Based on the two-dimensional gene variation characteristic feature and the corresponding two-dimensional weight data, a third gene variation characteristic feature is generated.
  • multiple gene variant characterization data and multiple drug characterization data are generated for combining into sets of input samples: based on preprocessed drug information, generating drug characterization data in a simplified molecular linear input canonical format, The drug characterization data in the chemical fingerprint format and the drug characterization data in the adjacency matrix structure graph format; The drug characterization data in the simplified molecular linear input canonical format, the drug characterization data in the chemical fingerprint format, and the drug characterization data in the adjacency matrix structure graph format are combined to generate sets of input samples, one for each input
  • the sample set includes a gene variant characterization feature and a drug characterization data.
  • preprocessing the gene variation information and drug information further includes: selecting gene variation information associated with genes belonging to a predetermined set from the acquired gene variation information of the cell line; information to generate variant effect type information; and drug susceptibility status data determined based on selected cellular viability assays on cell lines and corresponding drugs, to remove gene variant information and drug information that meet at least one of the following: all The obtained drug susceptibility status data are unstable cell lines and corresponding drugs; and corresponding drugs lack drug molecular formula structure information.
  • the variant impact type information includes: information on gene activation, gene inactivation, gene rearrangement, potential clinical significance, unclear clinical significance, and drug resistance
  • the microsatellite instability status information includes: information on microsatellite stability , Microsatellite Low Instability, Microsatellite High Instability, and Microsatellite Stability Uncertain.
  • the number of eigenvalues of the one-dimensional genetic variation characterization feature is equal to the number of genes multiplied by the number of gene mutation states plus the number of microsatellite instability states of the cell line, and the row of the two-dimensional genetic variation characterization feature indicates the cell line For the corresponding gene, the columns of the 2D gene variant characterization feature indicate variant effect type information or microsatellite instability status information.
  • the method for predicting drug susceptibility status further comprises: determining the first neural network model and the second neural network model such that the first neural network model matches the type of gene variation characterization data in the input sample set , and the second neural network model is matched to the drug representation data in the input sample set.
  • a method for predicting drug susceptibility status includes: dividing each set of input samples into a training data set, a validation data set, and a test data set; and for each set of input samples, based on root mean square error Determine the fit of the first neural network model, the second neural network model, and the third neural network model trained on the training data set in the verification data set, so as to determine the first neural network model applied to the test data set , the second neural network model and the third neural network model.
  • FIG. 1 shows a schematic diagram of a system for a method of predicting a drug susceptibility state according to an embodiment of the present disclosure.
  • FIG. 2 shows a flowchart of a method for predicting a drug susceptibility state according to an embodiment of the present disclosure.
  • FIG. 3 shows a schematic diagram of a neural network structure for predicting a drug sensitivity state according to an embodiment of the present disclosure.
  • Figure 4 schematically shows a schematic diagram of the 3D structure of a certain drug.
  • FIG. 5 shows a flowchart of a method for combining into sets of input samples according to an embodiment of the present disclosure.
  • FIG. 6 shows a flowchart of a data preprocessing method for gene variation information and drug information according to an embodiment of the present disclosure.
  • Figure 7 schematically illustrates a block diagram of an electronic device suitable for implementing embodiments of the present disclosure.
  • the term “including” and variations thereof mean open-ended inclusion, ie, "including but not limited to”.
  • the term “or” means “and/or” unless specifically stated otherwise.
  • the term “based on” means “based at least in part on”.
  • the terms “one example embodiment” and “one embodiment” mean “at least one example embodiment.”
  • the term “another embodiment” means “at least one additional embodiment.”
  • the terms “first”, “second”, etc. may refer to different or the same objects.
  • the traditional methods for predicting drug susceptibility status mostly focus on the prediction of IC50 values of drug resistance of a single type of drug to a few genes, the generality of the model used is not ideal, and because it does not take into account The effect of different drug and gene expression on model accuracy, thus making the predicted drug sensitivity less accurate.
  • example embodiments of the present disclosure propose a scheme for predicting drug susceptibility status.
  • this scheme by acquiring the gene variation information of the sample to be tested and the drug information about the drug, and preprocessing the acquired gene variation information and drug information to generate multiple gene variation characterization data and multiple drug characterization data, Used to combine into sets of input samples.
  • the present disclosure can enable the input sample set to contain multiple different feature representations of drugs and genes, thereby enabling the present disclosure to consider the effects of different feature representations of drugs and genes on the accuracy of the prediction model.
  • the present disclosure uses the first neural network and the second neural network to extract the features of the gene variation characterization data and the features of the drug characterization data, respectively, and fuse the extracted gene variation features and drug features; and use the trained third neural network.
  • the network model extracts the features of the fused gene variation features and drug features for use in determining a prediction result about the drug sensitivity state of the corresponding drug with the test sample.
  • the present disclosure can combine multiple feature representations of genes and drugs and feature extraction methods of different neural networks as factors to consider in parameter tuning and optimization of the neural network model for predicting drug sensitivity, so that the drug sensitivity prediction scheme of the present disclosure can be It can cover a wider range of genes and drugs, with better generalization ability and more accurate prediction value. Therefore, the present disclosure can accurately predict drug sensitivity and has good generality.
  • FIG. 1 shows a schematic diagram of a system 100 for a method of predicting a drug susceptibility state according to an embodiment of the present disclosure.
  • the system 100 includes, for example, a computing device 110 , a message generation server 150 and a network 170 .
  • the computing device 110 may perform data interaction with the information server 150 in a wired or wireless manner through the network 170 .
  • Computing device 110 is used to predict drug susceptibility status. Specifically, the computing device 110 is used to obtain the gene variation information of the sample to be tested and the drug information about the drug of the sample to be tested, and obtain the drug sensitivity state data determined by the cell activity test of the cell and the corresponding drug. The computing device 110 is further configured to perform preprocessing on the gene variation information and the drug information, so as to generate multiple gene variation characterization data and multiple drug characterization data for combining into multiple sets of input samples. The computing device 110 is further configured to extract features of the gene variation characterization data in the input sample set to generate gene variation features, and extract features of the drug characterization data in the input sample set to generate drug features, and extract the fused fused data based on the third neural network model.
  • computing device 110 may have one or more processing units, including special-purpose processing units such as GPUs, FPGAs, and ASICs, as well as general-purpose processing units such as CPUs. Additionally, one or more virtual machines may also be running on each computing device.
  • the computing device 110 is, for example, a server configured with a GPU, and the server is compatible with Pytorch and tensorflow.
  • the server is also configured with CUDA (8.0 or 9.0) and graphics driver, Anaconda software or Miniconda software, for example and without limitation.
  • the server is also configured with various software such as Python, torch, numpy, xlrd, Pillow, rdkit, for example and without limitation.
  • the computing device 110 includes, for example, a gene variation information and drug information acquisition unit 112, a drug sensitivity state data acquisition unit 114, a preprocessing unit 116, a gene variation feature generation unit 118, a drug feature generation unit 120, a fusion unit 122, and a drug sensitivity state prediction unit. 124.
  • the above-mentioned gene variation information and drug information acquisition unit 112, drug sensitivity state data acquisition unit 114, preprocessing unit 116, gene variation feature generation unit 118, drug feature generation unit 120, fusion unit 122, and drug sensitivity state prediction unit 124 can be configured in on one or more computing devices 110 .
  • the gene variation information and drug information acquiring unit 112 it is used for acquiring gene variation information of the sample to be tested and drug information about drugs, the drug information at least includes drug identification and drug molecular formula structure information.
  • the drug sensitive state data acquisition unit 114 it is used to acquire drug sensitive state data determined by the cell activity test of the cells and the corresponding drug.
  • the preprocessing unit 116 it is used for preprocessing the gene variation information and drug information, so as to generate multiple gene variation characterization data and multiple drug characterization data for combining into multiple sets of input samples.
  • the gene variation feature generating unit 118 it is configured to extract the features of the gene variation characterizing data in the input sample set based on the first neural network model, so as to generate the gene variation feature.
  • the drug feature generating unit 120 it is used for extracting features of drug characterizing data in the input sample set based on the second neural network model, so as to generate drug features.
  • the fusion unit 122 it is used to fuse the gene variant feature and the drug feature.
  • the drug sensitivity state prediction unit 124 it is used to extract the fused features of the gene variation feature and drug feature based on the third neural network model, so as to predict the drug sensitivity state of the sample to be tested for the corresponding drug, the first neural network
  • the model, the second neural network model and the third neural network model are trained via multiple samples.
  • FIG. 2 shows a flowchart of a method 200 for predicting a drug susceptibility state according to an embodiment of the present disclosure.
  • FIG. 3 shows a schematic diagram of a neural network structure for predicting a drug sensitive state according to an embodiment of the present disclosure. It should be understood that the method 200 may be performed, for example, at the electronic device 700 described in FIG. 7 . It may also execute at the computing device 110 depicted in FIG. 1 . It should be understood that the method 200 may also include additional actions not shown and/or the actions shown may be omitted, and the scope of the present disclosure is not limited in this regard.
  • the computing device 110 acquires the gene variation information of the sample to be tested and the drug information of the sample to be tested about the drug, where the drug information at least includes the drug identifier and the drug molecular formula structure information.
  • the sample to be tested it is, for example and not limited to, primary cells, cell strains or cell lines.
  • primary cells refer to cells obtained from tissues by protease or other methods to obtain single cells and cultured in vitro to simulate the body.
  • a cell line it is a population of cells propagated after the first successful passage of a primary cell culture. Also refers to cultured cells that can be serially passaged for long periods of time.
  • Cell lines are, for example and without limitation, tumor cells. Tumor cells may be involved in a variety of mutated situations.
  • Each cell line has a defined cell line identifier (eg, cell line name). The embodiments of the present disclosure are exemplified below by taking the sample to be tested as a cell line.
  • Gene variation information of the sample to be tested includes, for example, the cell line ID and the single nucleotide variation (SNV), gene copy number variation (CNV), gene structure variation (SV) and microsatellites corresponding to the cell line ID Instability (MSI) and other genetic variation information.
  • Drug identifiers are, for example, drug names, such as Camptothecin, Vinblastine, and Vinblastine shown in Table 1 below.
  • the drug formula structure information is, for example, the SMILES formula.
  • the computing device 110 may obtain data from public databases on cell line genomic information and drug responsiveness, such as NCI-60, the Genomics of Drug Sensitivity in Cancer (GDSC) and cancer cells The Cancer Cell Line Encyclopedia (CCLE) is used to obtain the gene variation information of the cell line and the cellular reactivity data of the corresponding drug of the cell line.
  • computing device 110 may obtain cell line whole exome (WES) sequencing gene variation information with drug IC50 data from GDSC.
  • WES whole exome
  • the range of drugs covered by the term "relevant drug of the sample to be tested (eg cell line)" may be larger than the range of drugs covered by the term “corresponding drug of the cell line” in the "cell reactivity data of the corresponding drug of the cell line” .
  • the related drug it is, for example, but not limited to, a targeted drug for tumor cells.
  • Targeted drugs are, for example, specially designed drugs that can identify tumor cell-specific gene mutations and target already defined carcinogenic sites. Some related drugs are prone to develop resistance.
  • Reasons for drug resistance include, for example: the target itself is altered by mutation, making the targeted drug less effective for a specific cell line; or cell lines (eg, tumor cells) find new ways to achieve target-independent tumor growth.
  • the way to obtain the structural information of the drug molecular formula includes, for example: first, determining the name of the drug used such as Trametinib or the drug identifier, such as the drug CID number such as: 11707110. Then, obtain the SMILES molecular formula through PaDel software or obtain the SMILES molecular formula of the corresponding drug by linking https://pubchem.ncbi.nlm.nih.gov/ and the drug name.
  • computing device 110 obtains drug susceptibility status data as determined by cell viability assays of cells and corresponding drugs.
  • the drug sensitivity state data is, for example, the half inhibitory concentration value (IC50 value) of the drug for the corresponding cell line obtained by performing the cell activity test for the cell line and the corresponding drug.
  • the drug sensitivity state data can also be the half inhibitory concentration value (IC50 value) of the drug on the corresponding cell, which is finally obtained through the cell activity test of the primary cell and the corresponding drug.
  • IC50 half inhibitory concentration
  • the IC50 value can be used to measure the ability of the corresponding drug to induce apoptosis of a certain sample (for example, primary cells, or cell lines), that is, the stronger the induction ability, the lower the value, which can reversely indicate that a certain cell is The drug resistance status of the corresponding drug.
  • GDSC stores fluorescence data representing cell viability when cells are treated with different drug concentrations in the laboratory, as well as half-inhibitory concentration IC50 pairs of more than 500 drugs against more than 1,000 human tumor cell lines obtained by fitting IC50 curves. Value (LN_IC50).
  • the computing device 110 obtains the log half inhibitory concentration IC50 values (LN_IC50) of more than 500 drugs against more than 1000 human tumor cell lines from the GDSC public database.
  • the following table 1 exemplifies the drug susceptibility status data determined by the cell viability test of the cell line and the corresponding drug.
  • the drug susceptibility status data IC50 of the cell line named HCC1954 against the corresponding drug named Camptothecin is -0.251083.
  • the computing device 110 preprocesses the genetic variation information and the drug information to generate multiple genetic variation characterization data and multiple drug characterization data for combining into sets of input samples.
  • gene variation characterization data includes, for example, a one-dimensional gene variation characterization feature (which may be abbreviated as "Multi-vec”), a two-dimensional gene variation characterization feature (which may be abbreviated as "Multi-mat”), and a third gene Variation Characterization Features (which can be referred to simply as "Multi-mat embeds”) are three different types of genetic variation characterisation data.
  • Multi-vec a one-dimensional gene variation characterization feature
  • Multi-mat two-dimensional gene variation characterization feature
  • Multi-mat embeds a third gene Variation Characterization Features
  • the manners of generating a variety of gene variation characterization data include, for example, the computing device 110 generates a one-dimensional gene variation characterization feature and a two-dimensional gene variation characterization feature based on the preprocessed gene variation information, and the one-dimensional gene variation characterization feature indicates a cell line Identification information, gene identification information and variation impact type information, two-dimensional gene variation characterization features indicating cell line identification information and cell line microsatellite instability state information; and based on two-dimensional gene variation characterization features and corresponding two-dimensional weight data, A third gene variant characterization signature is generated. Information about the type of effect of the variant, which includes, for example, about gene activation, gene inactivation, gene rearrangement, potentially clinically significant, unclear clinically significant, drug resistance.
  • the one-dimensional gene variation characteristic feature is, for example, a one-dimensional vector represented by the following expression (1) (which is, for example, a one-dimensional array composed of a plurality of feature values of 0 or 1).
  • One-dimensional genetic variation characterization features may be referred to as "Multi-vec" for short.
  • the number of eigenvalues of a one-dimensional genetic variation characterizing feature is equal to the number of genes multiplied by the number of gene mutation states plus the number of microsatellite instability states of the cell line.
  • the characteristic value of 1 indicates that the corresponding gene has the corresponding type of gene mutation state or the cell line has the corresponding microsatellite instability state.
  • 1*(N*M+K) represents the number of eigenvalues of the one-dimensional gene variation characteristic feature.
  • N represents the number of genes.
  • M represents the number of gene mutation states.
  • K represents the number of microsatellite unstable states of the cell line.
  • Table 2 schematically shows, for example, each feature value in the one-dimensional gene variation characterization feature Multi-vec of the cell line name Ca9-22.
  • a feature value of "1" corresponding to ABCB1_del indicates that a deletion variant (Del) exists in the corresponding gene ABCB1.
  • the eigenvalue "1" corresponding to ABCB1_VUS indicates the presence of Variants of Uncertain Significance (VUS) in the corresponding gene ABCB1.
  • a feature value of "0" corresponding to A1CF_VUS indicates that there is no variant of undetermined significance (VUS) in the corresponding gene A1CF.
  • a feature value of "1" corresponding to MSI-S indicates the presence of a "microsatellite stable” MSI state.
  • a characteristic value of "0" corresponding to MSI-H indicates the absence of a "microsatellite highly unstable" MSI state.
  • the two-dimensional gene variation characteristic feature is, for example, a two-dimensional matrix represented by the following expression (2) (which is, for example, a two-dimensional matrix composed of a plurality of eigenvalues of 0 or 1).
  • the two-dimensional genetic variation characterization feature may be referred to as "Multi-mat" for short.
  • Rows for the 2D gene variant characterization feature indicate the corresponding gene of the cell line
  • the row of the 2D gene variant characterization feature indicates the corresponding gene of the cell line
  • the column of the 2D gene variant characterization feature indicates the variant effect type information or microsatellite instability status information.
  • N*M+K represents the dimension of the two-dimensional gene variation characteristic feature.
  • N represents the number of genes.
  • M represents the number of gene mutation states.
  • K represents the number of microsatellite unstable states of the cell line.
  • Table 3 schematically shows, for example, each feature value in the two-dimensional gene variation characterization feature Multi-mat of the cell line name Ca9-22.
  • A1CF and ABCB1 represent corresponding genes.
  • the eigenvalue "0" in the VUS column corresponding to the A1CF row indicates that there is no variant of undetermined significance (VUS) in the corresponding gene A1CF.
  • the characteristic value "1" corresponding to the MSI-S column of the ABCB1 row indicates that the corresponding gene ABCB1 has a "microsatellite stable" MSI status.
  • the third gene variation characteristic feature is generated based on the two-dimensional gene variation characteristic feature and the corresponding two-dimensional weight data.
  • the third gene variation characteristic feature may be referred to as "Multi-mat embed" for short, which is generated by multiplying the two-dimensional matrix of Multi-mat by a corresponding two-dimensional weight matrix, for example.
  • the two-dimensional weight matrix is iteratively adjusted according to the training of the first neural network model.
  • Multi-mat embed is based on Multi-mat multiplied by a neural network embedding layer (for example, multiplied by the weight value of the embedding layer of the neural network).
  • drug characterization data includes, for example, drug characterization data in a simplified molecular linear input canonical format (may be referred to as "Smiles-mat” for short), drug characterization data in chemical fingerprint format (may be referred to as “Fingerprint” for short), and adjacency Drug Characterization Data in Matrix Structure Graph Format (may be referred to simply as "graph”)
  • Smiles-mat for short
  • Fingerprint for short
  • graph adjacency Drug Characterization Data in Matrix Structure Graph Format
  • the drug characterization data in the simplified molecular linear input specification format it is, for example, the drug characterization data in the simplified molecular linear input specification format (simplified molecular-input line entry system, or referred to as "SMILES") obtained by PaDEL software.
  • Drug characterization data in a simplified molecular linear input canonical format may, for example, be referred to as "Smiles-mat” for short.
  • SMILES characteristic of a drug is shown in the following expression (3).
  • drug characterization data in a chemical fingerprint format which is generated, for example, based on a drug-based chemical fingerprinting method, for converting the drawn molecules into a stream of 0 and 1 bits.
  • the drug characterization data in the adjacency matrix structure graph format is, for example, a two-dimensional adjacency matrix abstracted into molecules based on SMILES molecular formula structure information.
  • the drug characterization data in the chemical fingerprint format and the drug characterization data in the adjacency matrix structure diagram format will be described in detail below with reference to FIG. 5 , which will not be repeated here.
  • the method includes: respectively performing a gene variation characterization data selected from a plurality of gene variation characterization data and a drug characterization data selected from a plurality of drug characterization data. combined to generate sets of input samples.
  • Each input sample set includes one gene variant characterization data and one drug characterization data.
  • the computing device 110 extracts features of the gene variation characterization data in the input sample set based on the first neural network model to generate gene variation features.
  • the first neural network model it is constructed based on a convolutional neural network (CNN) model, for example.
  • the first neural network model includes, for example, a convolution layer and a pooling activation layer.
  • a first neural network model was constructed based on a graph convolutional neural network (GCN).
  • GCN graph convolutional neural network
  • Another first neural network model is constructed based on a convolutional neural network (CNN).
  • GCNs are beneficial for topological graphs in an abstract sense (e.g., graphs are irregular, each graph has an unordered node of variable size, and each node in the graph has a different number of adjacent nodes) , it is difficult to use a convolution kernel of the same size for convolution operation) for feature extraction.
  • CNN is beneficial to effectively extract spatial features, especially the pixel matrix in neatly arranged image data, but it is difficult to deal with traditional discrete convolution. Therefore, the first neural network models constructed based on different models have different feature extraction methods for the gene variation representation data. Thereby, it is beneficial to determine the first neural network model matching the network structure for different gene variation characterization data.
  • different input sample sets are generated by combining multiple (eg, 3) gene variation characterization data and multiple (eg, 3) drug characterization data, and a first neural network corresponding to the features of the respective sample sets or
  • the second neural network each of the first neural network and the second neural network is trained on the training set using two different feature fusion strategies based on the third neural network constructed by CNN and MLP, to generate a training set obtained by each feature combination.
  • Model After that, the MSE (root mean square error) was used as the criterion to compare the fitting conditions of each of the first neural network, the second neural network, and the third neural network in the validation set, and the model structure with the best performance was used as the final model application. to the test set.
  • the first neural network model matching the network structure can be determined for different gene variation characterization data.
  • the gene variation characterization data 312 in the input sample set is, for example, a third gene variation characterization feature (such as using a Multi-vec format for feature representation), and the gene variation characterization data 312 is input into the first neural network model (not shown).
  • the first neural network model is constructed based on the CNN model, for example).
  • features are extracted through the convolution layer and the pooling activation layer of the first neural network model to generate gene mutation features (for example, the gene feature map 322 shown in FIG. 3 ).
  • the computing device 110 extracts features of the drug characterizing data in the input sample set based on the second neural network model in order to generate drug features.
  • the second neural network model it is constructed based on the CNN model, for example.
  • the second neural network model includes, for example, a convolution layer and a pooling activation layer.
  • the second neural network model may include a plurality of second neural network models constructed based on different models.
  • a second neural network model is constructed based on long short-term memory (LSTM).
  • Another second neural network model is constructed based on a convolutional neural network (CNN).
  • the second neural network models constructed based on different models also have different feature extraction methods for the gene variation representation data. Thereby, a second neural network model matching the network structure can be determined for different drug characterization data. As shown in FIG.
  • the drug characterization data 310 in the input sample set (for example, the drug characterization data in SMILE-mat format) is input, and the drug characterization data 310 is input into a second neural network model (not shown, for example, the second neural network model is is based on the CNN model). And features are extracted through the convolution layer and the pooling activation layer of the second neural network model to generate drug features (for example, the drug feature map 320 shown in FIG. 3 ).
  • computing device 110 fuses the genetic variant signature and the drug signature. As shown in FIG. 3 , the computing device 110 fuses 324 (eg, concatenates) the gene profile 322 generated via the first neural network and the drug profile 320 generated by the second neural network for input to the third neural network model 330 .
  • the computing device 110 extracts the features of the fused gene variation feature and the drug feature for predicting the drug sensitivity of the sample to be tested (eg, but not limited to being a cell line) to the corresponding drug State, the first neural network model, the second neural network model and the third neural network model are trained via multiple samples.
  • the third neural network model is constructed based on, for example, a multilayer perceptron (MLP).
  • MLP multilayer perceptron
  • the third neural network model is, for example, a neural network composed of fully connected layers of at least one hidden layer.
  • the third neural network model includes two fully connected layers.
  • the third neural network model may include a plurality of third neural network models constructed based on different models.
  • a third neural network model is constructed based on MLP.
  • Another third neural network model is constructed based on a convolutional neural network (CNN). The calculation method of the third neural network model constructed by MLP will be described below in conjunction with expressions (4) and (5).
  • represents an activation function
  • the activation function includes a ReLU function, a sigmoid function or a tanh function.
  • H stands for hidden layer.
  • O stands for output layer.
  • X stands for input.
  • b h represents the coefficients of the hidden layer.
  • W h represents the weight of the hidden layer.
  • b o represents the coefficients of the output layer.
  • W o represents the weight of the output layer.
  • the gene feature map 322 generated by extracting genes from the first neural network and the drug feature map 320 generated by the second neural network are fused 324 (eg, concatenated) into the third neural network model 330 to predict the relationship between cells and cells
  • the drug susceptibility status 332 of the corresponding drug of the cell line (eg, without limitation, the corresponding drug of the cell line).
  • L(y, f(x)) represents a loss function.
  • y represents the predicted results regarding the drug susceptibility status of the corresponding drug with the cell line.
  • f(x) represents the true value for the drug susceptibility status of the corresponding drug with the cell line. This true value is determined, for example, from drug susceptibility status data determined for cell viability assays of cell lines and corresponding drugs.
  • the gene variation information of the sample to be tested and the drug information of the corresponding drug of the sample to be tested are obtained, and the acquired gene variation information and drug information are preprocessed to generate multiple gene variation representation data and multiple drugs.
  • the present disclosure can make the input sample set have multiple different feature representations of drugs and genes, so that the present disclosure can consider the influence of different feature representations of drugs and genes on the accuracy of the prediction model.
  • the present disclosure uses the first neural network and the second neural network to extract the features of the gene variation characterization data and the features of the drug characterization data, respectively, and fuse the extracted gene variation features and drug features; and use the trained third neural network.
  • the network model extracts the features of the fused gene variation features and drug features for use in determining a prediction result about the drug sensitivity state of the corresponding drug with the test sample.
  • the present disclosure can combine multiple feature representations of genes and drugs and feature extraction methods of different neural networks as factors to consider in parameter tuning and optimization of the neural network model for predicting drug sensitivity, so that the drug sensitivity prediction scheme of the present disclosure can be It can cover a wider range of genes and drugs, with better generalization ability and more accurate prediction value. Therefore, the present disclosure can accurately predict drug sensitivity and has good generality.
  • the method 200 further includes: the computing device 110 divides each set of input sample sets into a training data set, a validation data set, a test data set; and for each set of input sample sets, determining, based on the root mean square error, via The fitting situation of the first neural network model, the second neural network model and the third neural network obtained by training the training data set in the verification data set, so as to be used to determine the first neural network model, the second neural network model and the second neural network applied to the test data set. Neural network model and third neural network.
  • the computing device 110 divides the input sample sets processed in step 206 into training sets, validation sets and test sets according to a certain proportion in each set of input sample sets according to the principle of identical distribution and random sampling.
  • the training set is trained on two different feature fusion strategies based on the third neural network constructed by CNN and MLP to generate a model trained by each feature combination.
  • the MSE root mean square error
  • FIG. 5 shows a flowchart of a method 500 for combining into sets of input samples, according to an embodiment of the present disclosure. It should be understood that the method 500 may be performed, for example, at the electronic device 700 described in FIG. 7 . It may also execute at the computing device 110 depicted in FIG. 1 . It should be understood that the method 500 may also include additional actions not shown and/or the actions shown may be omitted, and the scope of the present disclosure is not limited in this regard.
  • the computing device 110 generates, based on the preprocessed gene variation information, a one-dimensional gene variation characterizing feature and a two-dimensional gene variation characterizing feature, respectively, the one-dimensional gene variation characterizing feature indicating cell line identification information, gene identification information, and variation Influence type information, two-dimensional gene variant characterization features indicate cell line identification information and cell line microsatellite instability status information.
  • the computing device 110 generates a third gene variation characteristic feature based on the two-dimensional gene variation characteristic feature and the corresponding two-dimensional weight data.
  • the third gene variation characteristic feature may be referred to as "Multi-mat embed" for short, which is generated by multiplying the two-dimensional matrix of Multi-mat by a corresponding two-dimensional weight matrix, for example.
  • the two-dimensional weight matrix is iteratively adjusted according to the training of the first neural network model.
  • Multi-mat embed is based on Multi-mat multiplied by a neural network embedding layer (for example, multiplied by the weight value of the embedding layer of the neural network).
  • the computing device 110 generates, based on the preprocessed drug information, drug characterization data in a simplified molecular linear input canonical format, drug characterization data in a chemical fingerprint format, and drug characterization data in an adjacency matrix structure graph format.
  • Drug characterization data in a chemical fingerprint format which is generated, for example, based on a drug-based chemical fingerprinting method, for converting the drawn molecules into a stream of 0 and 1 bits.
  • Drug characterization data in a chemical fingerprint format may be referred to, for example, as a "Fingerprint" for short.
  • the fingerprint type is, for example, the MACCS key. ⁇ ( ⁇ “0” ⁇ “1”), ⁇ “0000000000000000001000010000000001100010000000000000000000100100000000010100100001010100110001100100011100110100010011100101000101111100111111111111001010111110” ⁇
  • Generating drug characterization data in Fingerprint format for example, directly converting SMILES features into drug characterization data in Fingerprint format through the rdkit package.
  • Table 4 below for example, schematically shows the SMILES molecular formula structure information of the drug name Trametinib and the drug ID 11707110 and the drug characterization data in Fingerprint format.
  • each node eg, node 410
  • each atom in the SMILES formula
  • a horizontal bar eg, horizontal bar 420
  • it represents an edge connection between nodes eg, node 422 and node 424.
  • a drug can be collectively represented based on the adjacency matrix of the drug and the attribute matrix of the atoms.
  • the computing device 110 respectively inputs one of the one-dimensional gene variation characterizing features, the two-dimensional gene variation characterizing feature, and the third gene variation characterizing feature to the drug characterizing data in the simplified molecular linear input canonical format
  • the drug characterization data in the chemical fingerprint format and one of the drug characterization data in the adjacency matrix structure graph format are combined to generate multiple sets of input samples, each set of input samples including a gene variant characterization feature and a drug characterization data.
  • the computing device 110 generates 9 different sets of input samples in combination based on 3 kinds of gene variant characterization data and 3 kinds of drug characterization data.
  • the present disclosure can combine the gene variation characterization data and drug characterization data in different representations into a variety of different types of input sample sets for the prediction model by using the above-mentioned means, thereby making the data set learned by the prediction model more abundant, considering more Application scenarios of multiple gene mutations in cell lines and multiple drug combinations.
  • FIG. 6 shows a flowchart of a data preprocessing method 600 for gene variation information and drug information according to an embodiment of the present disclosure. It should be understood that the method 600 may be performed, for example, at the electronic device 800 described in FIG. 8 . It may also execute at the computing device 110 depicted in FIG. 1 . It should be understood that method 600 may also include additional actions not shown and/or actions shown may be omitted, and the scope of the present disclosure is not limited in this regard.
  • the computing device 110 selects gene variation information associated with genes belonging to a predetermined set from the acquired gene variation information of the cell line.
  • a predetermined set it is, for example, a set of genes associated with tumors.
  • the computing device 110 screens more than 600 (eg, 654) important tumor-related genes from the acquired cell line genes, and selects genes related to these genes from the acquired cell line gene variation information. Gene variant information associated with more than 600 important genes.
  • the gene variation information that is not highly correlated with the tumor in the acquired gene variation information of the cell line can be removed, which is beneficial to improve the efficiency of subsequent model training and the accuracy of prediction.
  • the computing device 110 annotates the selected gene variation information to generate variation impact type information.
  • the computing device 110 performs biological function annotation classification on the variation of the screened genes.
  • the original cell line variation information is converted into first-type variation information and second-type variation information through biological function annotation classification.
  • the first type of variation information indicates association information between cell lines, genes, and types of variation effects.
  • cell line (or "Sample name”) - gene (or “Gene name”) - variant effect type (or "MUT_TYPE”) the second type of variant information such as indicating that the cell line is not related to the microsatellite
  • the relational information system between steady states For example, cell line (or "Sample name”) - microsatellite instability status (or "MSI status").
  • variant impact type which includes, for example: gene activation (or “active”), gene inactivation (or “inactive”), gene rearrangement (or “Fusion”) ”, potential clinical significance (or “other”), unknown clinical significance (or “VUS”), drug resistance (or “resistant”).
  • gene activation or “active”
  • gene inactivation or “inactive”
  • gene rearrangement or “Fusion”
  • potential clinical significance or “other”
  • unknown clinical significance or “VUS”
  • drug resistance or “resistant”.
  • Table 5 Variation information, i.e. association information between cell line, gene, and variant effect type.
  • a gene named TSC2 in the cell line named ZR-75-30 has a variant effect type of unknown clinical significance.
  • the name is ZR-75-
  • the gene variant with the gene name ROS1 in 30 cell lines affected the type of potential clinical significance.
  • microsatellite instability state includes, for example: microsatellite stable (or called “MSI-S”), microsatellite low instability (or called “MSI-L”), microsatellite high instability (or called “MSI-L”), microsatellite instability “MSI-H”) and uncertain (or “Unsure”) types.
  • MSI-S microsatellite stable
  • MSI-L microsatellite low instability
  • MSI-L microsatellite high instability
  • MSI-H microsatellite instability
  • uncertain or "Unsure”
  • Table 6 illustrates the second type of variation information, ie, the association information between cell lines and microsatellite instability states.
  • the microsatellites of the cell line designated CW-2 are highly unstable (or "MSI-H").
  • the computing device 110 removes gene variation information and drug information that meet at least one of the following: the acquired drug sensitivity state data are unstable cell lines and corresponding drugs; and corresponding drugs lack drug molecular formula structure information. For example, computing device 110 removes genetic variation data for cell lines with unstable IC50 experimental values or without the SMILES drug formula.
  • the present disclosure can convert quantitative gene variation information into qualitative variation impact type information, and clean data with missing information attribute values and unstable information attributes, thereby obtaining a complete preprocessed Data, and then help to improve the effect of machine learning of neural network model.
  • FIG. 7 schematically illustrates a block diagram of an electronic device 700 suitable for implementing embodiments of the present disclosure.
  • the device 700 may be a device for implementing the methods 200 , 500 and 600 shown in FIGS. 2 , 5 and 6 .
  • device 700 includes a central processing unit (CPU) 701 that may be loaded into random access memory (RAM) 703 according to computer program instructions stored in read only memory (ROM) 702 or from storage unit 708 computer program instructions to perform various appropriate actions and processes.
  • RAM random access memory
  • ROM read only memory
  • storage unit 708 computer program instructions to perform various appropriate actions and processes.
  • various programs and data required for the operation of the device 700 can also be stored.
  • the CPU, ROM, and RAM are connected to each other through a bus 704 .
  • An input/output (I/O) interface 705 is also connected to bus 704 .
  • a number of components in the device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and the central processing unit 701 performs the various methods and processes described above, such as performing methods 200, 500 and 600.
  • methods 200 , 500 and 600 may be implemented as a computer software program stored on a machine-readable medium, such as storage unit 708 .
  • part or all of the computer program may be loaded and/or installed on the device 700 via the ROM and/or the communication unit 709 .
  • the CPU may be configured to perform one or more actions of methods 200, 500, and 600 by any other suitable means (eg, by means of firmware).
  • the present disclosure may be a method, an apparatus, a system and/or a computer program product.
  • the computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for carrying out various aspects of the present disclosure.
  • a computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory static random access memory
  • SRAM static random access memory
  • CD-ROM compact disk read only memory
  • DVD digital versatile disk
  • memory sticks floppy disks
  • mechanically coded devices such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above.
  • Computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.
  • the computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer program instructions for carrying out operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect).
  • LAN local area network
  • WAN wide area network
  • custom electronic circuits such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs) can be personalized by utilizing state information of computer readable program instructions.
  • Computer readable program instructions are executed to implement various aspects of the present disclosure.
  • These computer readable program instructions may be provided to a processor in a voice interaction device, a general purpose computer, a special purpose computer or a processing unit of other programmable data processing devices, thereby producing a machine that enables these instructions to be processed by a computer or other programmable
  • the processing elements of the data processing apparatus when executed, produce means for implementing the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
  • Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which contains one or more oper- ables for implementing the specified logical function(s) Execute the instruction.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Chemical & Material Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Evolutionary Computation (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computing Systems (AREA)
  • Medicinal Chemistry (AREA)
  • Bioethics (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Toxicology (AREA)
  • Primary Health Care (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present disclosure relates to a method for predicting a drug sensitivity state, a computing device, and a storage medium. The method comprises: obtaining genetic variation information of a sample to be detected and drug information of a related drug; obtaining drug sensitivity state data determined by a cell activity test about cells and a corresponding drug; pre-processing the genetic variation information and the drug information so as to generate multiple genetic variation characterization data and multiple drug characterization data to be combined into multiple groups of input sample sets; generating genetic variation features on the basis of a first neural network model; generating drug features on the basis of a second neural network model; and on the basis of a third neural network model, extracting fused genetic variation features and drug features, so as to predict the drug sensitivity state of the sample to be detected for the corresponding drug. The present disclosure can accurately predict drug sensitivity and has better universality.

Description

用于预测药物敏感状态的方法、设备和存储介质Method, device and storage medium for predicting drug susceptibility status 技术领域technical field
本公开总体上涉及生物信息处理,并且具体地,涉及用于预测药物敏感状态的方法、设备和存储介质。The present disclosure relates generally to biological information processing, and in particular, to methods, devices, and storage media for predicting drug susceptibility states.
背景技术Background technique
随着分子生物学和测序技术的发展以及肿瘤发生分子机制的深入研究,肿瘤的精准治疗展现了广阔的应用前景。然而,肿瘤细胞自身异质性通常会导致药物响应不稳定,这种药物响应不稳定给肿瘤药物研发领域一个重大挑战。With the development of molecular biology and sequencing technology and the in-depth study of the molecular mechanism of tumorigenesis, the precise treatment of tumors has shown broad application prospects. However, tumor cell heterogeneity often leads to unstable drug response, which presents a major challenge in the field of tumor drug development.
传统的用于预测药物敏感状态的方案例如是利用药物代谢相关的基因位点组合(DPYD*2A、DPYD*5A、DPYD*9A、MTHFR、TS和GSTP1)突变情况检测,对拟采用氟尿嘧啶类药物治疗的病人进行药物敏感指导。The traditional method for predicting drug susceptibility status, for example, is to use the combination of drug metabolism-related gene loci (DPYD*2A, DPYD*5A, DPYD*9A, MTHFR, TS, and GSTP1) mutation detection. Treat patients for drug susceptibility guidance.
上述传统的用于预测药物敏感状态的方案大多集中于单一类型药物对于少数基因的耐药性IC50值的预测,所采用的模型的泛用性不够理想,同时由于没有考虑到药物和基因不同表达对于模型精度的影响,因此使得所预测的药物敏感性的精度不高。The above-mentioned traditional methods for predicting drug susceptibility status mostly focus on the prediction of IC50 values of drug resistance of a single type of drug to a few genes, and the generality of the model used is not ideal. The impact on model accuracy, therefore, makes the predicted drug sensitivity less accurate.
综上,传统的用于预测药物敏感状态的方案存在泛用性不够理想,并且所预测的药物敏感性的精度不高的不足之处。To sum up, the traditional methods for predicting drug susceptibility status have the disadvantages that the generality is not ideal, and the accuracy of the predicted drug susceptibility is not high.
发明内容SUMMARY OF THE INVENTION
本公开提供一种用于预测药物敏感状态的方法、计算设备和计算机存储介质,能够准确预测药物敏感性并具有较好的泛用性。The present disclosure provides a method, a computing device and a computer storage medium for predicting a drug sensitivity state, which can accurately predict drug sensitivity and have good generality.
根据本公开的第一方面,提供了一种预测药物敏感状态的方法。该方法包括:获取待测样本的基因变异信息和有关药物的药物信息,药物信息至少包括药物标识以及药物分子式结构信息;获取关于细胞与对应药物的细胞活性试验而确定的药物敏感状态数据;针对基因变异信息、药物信息进行预处理,以便生成多种基因变异表征数据和多种药物表征数据,以用于组合成多组输入样本集;基于第一神经网络模型,提取输入样本集中的基因变异表征数据的特征,以便生成基因变异特征;基于第二神经网络模型,提取输入样本集中的药物表征数据的特征,以便生成药物特征;融合基因变异特征和药物特征;以及基于第三神经网络模型,提取经融合的基因变异特征和药物特征的特征,以用于预测待测样本针对对应药物的药物敏感状态,第一神经网络模型、第二神经网络模型和第三神经网络模型是经由多样本训练的。According to a first aspect of the present disclosure, there is provided a method of predicting a drug susceptibility state. The method includes: acquiring gene variation information of a sample to be tested and drug information about a drug, the drug information at least including drug identification and drug molecular formula structure information; acquiring drug sensitivity state data determined by a cell activity test about a cell and a corresponding drug; Gene variation information and drug information are preprocessed to generate multiple gene variation characterization data and multiple drug characterization data for combining into multiple sets of input sample sets; based on the first neural network model, extract the gene variants in the input sample set characterizing features of the data to generate gene variation features; extracting features of the drug characterization data in the input sample set based on the second neural network model to generate drug features; fusing the gene variation features and drug features; and based on the third neural network model, Extracting the features of the fused gene variation features and drug features to predict the drug sensitivity state of the sample to be tested for the corresponding drug, the first neural network model, the second neural network model and the third neural network model are trained through multiple samples of.
根据本发明的第二方面,还提供了一种计算设备,该设备包括:至少一个处理单元;至少一个存储器,至少一个存储器被耦合到至少一个处理单元并且存储用于由至少一个处理单元执行的指令,指令当由至少一个处理单元执行时,使得计算设备执行本公开的第一方面的方法。According to a second aspect of the present invention there is also provided a computing device comprising: at least one processing unit; at least one memory coupled to the at least one processing unit and storing data for execution by the at least one processing unit Instructions that, when executed by the at least one processing unit, cause a computing device to perform the method of the first aspect of the present disclosure.
根据本公开的第三方面,还提供了一种计算机可读存储介质。 该计算机可读存储介质上存储有机器可执行指令,该机器可执行指令在被执行时使机器执行本公开的第一方面的方法。According to a third aspect of the present disclosure, there is also provided a computer-readable storage medium. The computer-readable storage medium has stored thereon machine-executable instructions that, when executed, cause a machine to perform the method of the first aspect of the present disclosure.
在一些实施例中,所述待测样本为细胞系或者原代细胞,所述药物敏感状态数据是经由关于细胞系与对应药物的细胞活性试验而确定的。In some embodiments, the sample to be tested is a cell line or a primary cell, and the drug susceptibility status data is determined through a cell activity test on the cell line and the corresponding drug.
在一些实施例中,生成多种基因变异表征数据和多种药物表征数据以用于组合成多组输入样本集包括:基于经预处理的基因变异信息,分别生成一维基因变异表征特征和二维基因变异表征特征,一维基因变异表征特征指示细胞系标识信息、基因标识信息和变异影响类型信息,二维基因变异表征特征指示细胞系标识信息和细胞系的微卫星不稳定状态信息;以及基于二维基因变异表征特征和对应的二维权重数据,生成第三基因变异表征特征。In some embodiments, generating a plurality of gene variation characterization data and a plurality of drug characterization data for combining into multiple sets of input samples includes: based on the preprocessed gene variation information, generating a one-dimensional gene variation characterization feature and a two-dimensional gene variation characterization feature, respectively a two-dimensional gene variation characterizing feature, a one-dimensional gene variation characterizing feature indicating cell line identification information, gene identification information, and variation impact type information, and a two-dimensional gene variation characterizing feature indicating cell line identification information and microsatellite instability state information of the cell line; and Based on the two-dimensional gene variation characteristic feature and the corresponding two-dimensional weight data, a third gene variation characteristic feature is generated.
在一些实施例中,生成多种基因变异表征数据和多种药物表征数据以用于组合成多组输入样本集:基于经预处理的药物信息,生成简化分子线性输入规范格式的药物表征数据、化学指纹格式的药物表征数据和邻接矩阵结构图格式的药物表征数据;以及分别将一维基因变异表征特征、二维基因变异表征特征和第三基因变异表征特征中的一种基因变异表征特征与简化分子线性输入规范格式的药物表征数据、化学指纹格式的药物表征数据和邻接矩阵结构图格式的药物表征数据中的一种药物表征数据进行组合,以便生成多组输入样本集,每一组输入样本集包括一种基因变异表征特征和一种药物表征数据。In some embodiments, multiple gene variant characterization data and multiple drug characterization data are generated for combining into sets of input samples: based on preprocessed drug information, generating drug characterization data in a simplified molecular linear input canonical format, The drug characterization data in the chemical fingerprint format and the drug characterization data in the adjacency matrix structure graph format; The drug characterization data in the simplified molecular linear input canonical format, the drug characterization data in the chemical fingerprint format, and the drug characterization data in the adjacency matrix structure graph format are combined to generate sets of input samples, one for each input The sample set includes a gene variant characterization feature and a drug characterization data.
在一些实施例中,针对基因变异信息、药物信息进行预处理还包括:在所获取的细胞系的基因变异信息中选取与属于预定集合的基因相关联的基因变异信息;针对所选取的基因变异信息进行注释,以便生成变异影响类型信息;以及基于所选定的关于细胞系与对应药物的细胞活性试验而确定的药物敏感状态数据,去除符合以下至少一项的基因变异信息和药物信息:所获取的药物敏感状态数据为不稳定的细胞系与对应药物;以及对应药物缺失药物分子式结构信息。In some embodiments, preprocessing the gene variation information and drug information further includes: selecting gene variation information associated with genes belonging to a predetermined set from the acquired gene variation information of the cell line; information to generate variant effect type information; and drug susceptibility status data determined based on selected cellular viability assays on cell lines and corresponding drugs, to remove gene variant information and drug information that meet at least one of the following: all The obtained drug susceptibility status data are unstable cell lines and corresponding drugs; and corresponding drugs lack drug molecular formula structure information.
在一些实施例中,变异影响类型信息包括:关于基因激活、基因失活、基因重排、具有潜在临床意义、临床意义不明和耐药的信息,微卫星不稳定状态信息包括:关于微卫星稳定、微卫星低度不稳定、微卫星高度不稳定和微卫星稳定性不确定的信息。In some embodiments, the variant impact type information includes: information on gene activation, gene inactivation, gene rearrangement, potential clinical significance, unclear clinical significance, and drug resistance, and the microsatellite instability status information includes: information on microsatellite stability , Microsatellite Low Instability, Microsatellite High Instability, and Microsatellite Stability Uncertain.
在一些实施例中,一维基因变异表征特征的特征值个数等于基因数量乘以基因突变状态数量并且加上细胞系的微卫星不稳定状态数量,二维基因变异表征特征的行指示细胞系的对应基因,二维基因变异表征特征的列指示变异影响类型信息或微卫星不稳定状态信息。In some embodiments, the number of eigenvalues of the one-dimensional genetic variation characterization feature is equal to the number of genes multiplied by the number of gene mutation states plus the number of microsatellite instability states of the cell line, and the row of the two-dimensional genetic variation characterization feature indicates the cell line For the corresponding gene, the columns of the 2D gene variant characterization feature indicate variant effect type information or microsatellite instability status information.
在一些实施例中,用于预测药物敏感状态的方法还包括:确定述第一神经网络模型和第二神经网络模型,以便第一神经网络模型与输入样本集中的基因变异表征数据的种类相匹配,以及第二神经网络模型与输入样本集中的药物表征数据相匹配。In some embodiments, the method for predicting drug susceptibility status further comprises: determining the first neural network model and the second neural network model such that the first neural network model matches the type of gene variation characterization data in the input sample set , and the second neural network model is matched to the drug representation data in the input sample set.
在一些实施例中,用于预测药物敏感状态的方法包括:将每一 组输入样本集分成训练数据集、验证数据集合测试数据集;以及针对每一组输入样本集,基于均方根误差来确定经由训练数据集所训练得到的第一神经网络模型、第二神经网络模型和第三神经网络模型在验证数据集合的拟合情况,以用于确定应用于测试数据集的第一神经网络模型、第二神经网络模型和第三神经网络模型。In some embodiments, a method for predicting drug susceptibility status includes: dividing each set of input samples into a training data set, a validation data set, and a test data set; and for each set of input samples, based on root mean square error Determine the fit of the first neural network model, the second neural network model, and the third neural network model trained on the training data set in the verification data set, so as to determine the first neural network model applied to the test data set , the second neural network model and the third neural network model.
提供发明内容部分是为了以简化的形式来介绍对概念的选择,它们在下文的具体实施方式中将被进一步描述。发明内容部分无意标识本公开的关键特征或主要特征,也无意限制本公开的范围。This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary section is not intended to identify key features or essential features of the disclosure, nor is it intended to limit the scope of the disclosure.
附图说明Description of drawings
图1示出了根据本公开的实施例的用于预测药物敏感状态的方法的***的示意图。FIG. 1 shows a schematic diagram of a system for a method of predicting a drug susceptibility state according to an embodiment of the present disclosure.
图2示出了根据本公开的实施例的用于预测药物敏感状态的方法的流程图。2 shows a flowchart of a method for predicting a drug susceptibility state according to an embodiment of the present disclosure.
图3示出了根据本公开的实施例的用于预测药物敏感状态的神经网络结构的示意图。FIG. 3 shows a schematic diagram of a neural network structure for predicting a drug sensitivity state according to an embodiment of the present disclosure.
图4示意性示出了某个药物的3D结构的示意图。Figure 4 schematically shows a schematic diagram of the 3D structure of a certain drug.
图5示出了根据本公开的实施例的用于组合成多组输入样本集的方法的流程图。5 shows a flowchart of a method for combining into sets of input samples according to an embodiment of the present disclosure.
图6示出了根据本公开的实施例的用于基因变异信息和药物信息的数据预处理方法的流程图。FIG. 6 shows a flowchart of a data preprocessing method for gene variation information and drug information according to an embodiment of the present disclosure.
图7示意性示出了适于用来实现本公开实施例的电子设备的框 图。Figure 7 schematically illustrates a block diagram of an electronic device suitable for implementing embodiments of the present disclosure.
在各个附图中,相同或对应的标号表示相同或对应的部分。In the various figures, the same or corresponding reference numerals designate the same or corresponding parts.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的优选实施例。虽然附图中显示了本公开的优选实施例,然而应该理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个示例实施例”和“一个实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。As used herein, the term "including" and variations thereof mean open-ended inclusion, ie, "including but not limited to". The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment." The term "another embodiment" means "at least one additional embodiment." The terms "first", "second", etc. may refer to different or the same objects.
如前文提及,在传统的用于预测药物敏感状态的方案大多集中于单一类型药物对于少数基因的耐药性IC50值的预测,所采用的模型的泛用性不够理想,同时由于没有考虑到药物和基因不同表达对于模型精度的影响,因此使得所预测的药物敏感性的精度不高。As mentioned above, the traditional methods for predicting drug susceptibility status mostly focus on the prediction of IC50 values of drug resistance of a single type of drug to a few genes, the generality of the model used is not ideal, and because it does not take into account The effect of different drug and gene expression on model accuracy, thus making the predicted drug sensitivity less accurate.
为了至少部分地解决上述问题以及其他潜在问题中的一个或者多个,本公开的示例实施例提出了一种用于预测药物敏感状态的方案。在该方案中,通过获取待测样本的基因变异信息和有关药物 的药物信息,并且针对所获取基因变异信息和药物信息进行预处理以便生成多种基因变异表征数据和多种药物表征数据,以用于组合成多组输入样本集。本公开可以使得输入样本集中带有药物和基因多种不同特征表示形式,进而使得本公开能够考虑药物和基因不同特征表示形式对预测模型精度的影响。另外,本公开利用第一神经网络和第二神经网络分别提取基因变异表征数据的特征和药物表征数据的特征,并将所提取基因变异特征和药物特征进行融合;并且利用经训练的第三神经网络模型提取经融合的基因变异特征和药物特征的特征以用于确定关于与待测样本的对应药物的药物敏感状态的预测结果。本公开能够将基因和药物的多种特征表示以及不同神经网络的特征提取方式进行组合来作为用于预测药物敏感性的神经网络模型调参择优的考虑因素,使得本公开的药物敏感性预测方案可以覆盖的基因和药物范围更广,具有更好的泛化能力和更精准的预测值。因而,本公开能够准确预测药物敏感性并具有较好的泛用性。To at least partially address one or more of the above-mentioned problems and other potential problems, example embodiments of the present disclosure propose a scheme for predicting drug susceptibility status. In this scheme, by acquiring the gene variation information of the sample to be tested and the drug information about the drug, and preprocessing the acquired gene variation information and drug information to generate multiple gene variation characterization data and multiple drug characterization data, Used to combine into sets of input samples. The present disclosure can enable the input sample set to contain multiple different feature representations of drugs and genes, thereby enabling the present disclosure to consider the effects of different feature representations of drugs and genes on the accuracy of the prediction model. In addition, the present disclosure uses the first neural network and the second neural network to extract the features of the gene variation characterization data and the features of the drug characterization data, respectively, and fuse the extracted gene variation features and drug features; and use the trained third neural network. The network model extracts the features of the fused gene variation features and drug features for use in determining a prediction result about the drug sensitivity state of the corresponding drug with the test sample. The present disclosure can combine multiple feature representations of genes and drugs and feature extraction methods of different neural networks as factors to consider in parameter tuning and optimization of the neural network model for predicting drug sensitivity, so that the drug sensitivity prediction scheme of the present disclosure can be It can cover a wider range of genes and drugs, with better generalization ability and more accurate prediction value. Therefore, the present disclosure can accurately predict drug sensitivity and has good generality.
图1示出了根据本公开的实施例的用于预测药物敏感状态的方法的***100的示意图。如图1所示,***100例如包括计算设备110、生信服务器150和网络170。计算设备110可以通过网络170以有线或者无线的方式与生信服务器150进行数据交互。FIG. 1 shows a schematic diagram of a system 100 for a method of predicting a drug susceptibility state according to an embodiment of the present disclosure. As shown in FIG. 1 , the system 100 includes, for example, a computing device 110 , a message generation server 150 and a network 170 . The computing device 110 may perform data interaction with the information server 150 in a wired or wireless manner through the network 170 .
计算设备110用于预测药物敏感状态。具体而言,计算设备110用于获取待测样本的基因变异信息和待测样本的有关药物的药物信息,以及获取关于细胞与对应药物的细胞活性试验而确定的药物 敏感状态数据。计算设备110还用于针对基因变异信息、药物信息进行预处理,以便生成多种基因变异表征数据和多种药物表征数据,以用于组合成多组输入样本集。计算设备110还用于提取输入样本集中的基因变异表征数据的特征以便生成基因变异特征,并且提取输入样本集中的药物表征数据的特征以便生成药物特征,以及基于第三神经网络模型提取经融合的基因变异特征和药物特征的特征,以用于预测待测样本针对对应药物的药物敏感状态。上述待测样本例如而不限于为原代细胞、细胞株或者细胞系。在一些实施例中,计算设备110可以具有一个或多个处理单元,包括诸如GPU、FPGA和ASIC等的专用处理单元以及诸如CPU的通用处理单元。另外,在每个计算设备上也可以运行着一个或多个虚拟机。在一些实施例中,计算设备110例如是配置有GPU的服务器,该服务器兼容Pytorch、tensorflow。该服务器例如而不限于还配置CUDA(8.0或9.0)和显卡驱动,Anaconda软件或Miniconda软件。在一些实施例中,例如,该服务器例如而不限于还配置有Python、torch、numpy、xlrd、Pillow、rdkit中的多种软件。 Computing device 110 is used to predict drug susceptibility status. Specifically, the computing device 110 is used to obtain the gene variation information of the sample to be tested and the drug information about the drug of the sample to be tested, and obtain the drug sensitivity state data determined by the cell activity test of the cell and the corresponding drug. The computing device 110 is further configured to perform preprocessing on the gene variation information and the drug information, so as to generate multiple gene variation characterization data and multiple drug characterization data for combining into multiple sets of input samples. The computing device 110 is further configured to extract features of the gene variation characterization data in the input sample set to generate gene variation features, and extract features of the drug characterization data in the input sample set to generate drug features, and extract the fused fused data based on the third neural network model. The characteristics of gene variation characteristics and drug characteristics are used to predict the drug sensitivity state of the sample to be tested against the corresponding drug. The above-mentioned sample to be tested is, for example, but not limited to, primary cells, cell lines or cell lines. In some embodiments, computing device 110 may have one or more processing units, including special-purpose processing units such as GPUs, FPGAs, and ASICs, as well as general-purpose processing units such as CPUs. Additionally, one or more virtual machines may also be running on each computing device. In some embodiments, the computing device 110 is, for example, a server configured with a GPU, and the server is compatible with Pytorch and tensorflow. The server is also configured with CUDA (8.0 or 9.0) and graphics driver, Anaconda software or Miniconda software, for example and without limitation. In some embodiments, the server is also configured with various software such as Python, torch, numpy, xlrd, Pillow, rdkit, for example and without limitation.
计算设备110例如包括基因变异信息和药物信息获取单元112、药物敏感状态数据获取单元114、预处理单元116、基因变异特征生成单元118、药物特征生成单元120、融合单元122、药物敏感状态预测单元124。上述基因变异信息和药物信息获取单元112、药物敏感状态数据获取单元114、预处理单元116、基因变异特征生成单元118、药物特征生成单元120、融合单元122、药物敏感状 态预测单元124可以配置在一个或者多个计算设备110上。The computing device 110 includes, for example, a gene variation information and drug information acquisition unit 112, a drug sensitivity state data acquisition unit 114, a preprocessing unit 116, a gene variation feature generation unit 118, a drug feature generation unit 120, a fusion unit 122, and a drug sensitivity state prediction unit. 124. The above-mentioned gene variation information and drug information acquisition unit 112, drug sensitivity state data acquisition unit 114, preprocessing unit 116, gene variation feature generation unit 118, drug feature generation unit 120, fusion unit 122, and drug sensitivity state prediction unit 124 can be configured in on one or more computing devices 110 .
关于基因变异信息和药物信息获取单元112,其用于获取待测样本的基因变异信息和有关药物的药物信息,药物信息至少包括药物标识以及药物分子式结构信息。Regarding the gene variation information and drug information acquiring unit 112, it is used for acquiring gene variation information of the sample to be tested and drug information about drugs, the drug information at least includes drug identification and drug molecular formula structure information.
关于药物敏感状态数据获取单元114,其用于获取关于细胞与对应药物的细胞活性试验而确定的药物敏感状态数据。Regarding the drug sensitive state data acquisition unit 114 , it is used to acquire drug sensitive state data determined by the cell activity test of the cells and the corresponding drug.
关于预处理单元116,其用于针对基因变异信息、药物信息进行预处理,以便生成多种基因变异表征数据和多种药物表征数据,以用于组合成多组输入样本集。Regarding the preprocessing unit 116 , it is used for preprocessing the gene variation information and drug information, so as to generate multiple gene variation characterization data and multiple drug characterization data for combining into multiple sets of input samples.
关于基因变异特征生成单元118,其用于基于第一神经网络模型,提取输入样本集中的基因变异表征数据的特征,以便生成基因变异特征。Regarding the gene variation feature generating unit 118, it is configured to extract the features of the gene variation characterizing data in the input sample set based on the first neural network model, so as to generate the gene variation feature.
关于药物特征生成单元120,其用于基于第二神经网络模型,提取输入样本集中的药物表征数据的特征,以便生成药物特征。Regarding the drug feature generating unit 120, it is used for extracting features of drug characterizing data in the input sample set based on the second neural network model, so as to generate drug features.
关于融合单元122,其用于融合基因变异特征和药物特征。Regarding the fusion unit 122, it is used to fuse the gene variant feature and the drug feature.
关于药物敏感状态预测单元124,其用于基于第三神经网络模型,提取经融合的基因变异特征和药物特征的特征,以用于预测待测样本针对对应药物的药物敏感状态,第一神经网络模型、第二神经网络模型和第三神经网络模型是经由多样本训练的。Regarding the drug sensitivity state prediction unit 124, it is used to extract the fused features of the gene variation feature and drug feature based on the third neural network model, so as to predict the drug sensitivity state of the sample to be tested for the corresponding drug, the first neural network The model, the second neural network model and the third neural network model are trained via multiple samples.
以下将结合图2和图3描述根据本公开的实施例的用于预测药物敏感状态的方法。图2示出了根据本公开的实施例的用于预测药物敏感状态的方法200的流程图。图3示出了根据本公开的实施例 的用于预测药物敏感状态的神经网络结构的示意图。应当理解,方法200例如可以在图7所描述的电子设备700处执行。也可以在图1所描述的计算设备110处执行。应当理解,方法200还可以包括未示出的附加动作和/或可以省略所示出的动作,本公开的范围在此方面不受限制。A method for predicting a drug susceptibility state according to an embodiment of the present disclosure will be described below with reference to FIGS. 2 and 3 . FIG. 2 shows a flowchart of a method 200 for predicting a drug susceptibility state according to an embodiment of the present disclosure. FIG. 3 shows a schematic diagram of a neural network structure for predicting a drug sensitive state according to an embodiment of the present disclosure. It should be understood that the method 200 may be performed, for example, at the electronic device 700 described in FIG. 7 . It may also execute at the computing device 110 depicted in FIG. 1 . It should be understood that the method 200 may also include additional actions not shown and/or the actions shown may be omitted, and the scope of the present disclosure is not limited in this regard.
在步骤202处,计算设备110获取待测样本的基因变异信息和待测样本的有关药物的药物信息,药物信息至少包括药物标识以及药物分子式结构信息。At step 202, the computing device 110 acquires the gene variation information of the sample to be tested and the drug information of the sample to be tested about the drug, where the drug information at least includes the drug identifier and the drug molecular formula structure information.
关于待测样本,其例如而不限于为原代细胞、细胞株或者细胞系。应当理解,原代细胞(primary cell)是指组织经蛋白酶或其它的方法获得单个细胞并在体外进行模拟机体培养的细胞。关于细胞系(cell line),其为原代细胞培养物经首次传代成功后所繁殖的细胞群体。也指可长期连续传代的培养细胞。细胞系例如而不限于为肿瘤细胞。肿瘤细胞可能涉及多种变异的情形。每个细胞系有确定的细胞系标识(例如细胞系名称)。以下以待测样本为细胞系的实施方式来示例本公开方案,应当理解,待测样本不仅仅局限于细胞系,也可以更改和变化为原代细胞等。待测样本(例如细胞系)的基因变异信息例如包括细胞系标识和与细胞系标识所对应单核苷酸变异(SNV)、基因拷贝数变异(CNV)、基因结构变异(SV)和微卫星不稳定性(MSI)等基因变异信息。药物标识例如是药物名称,例如下表1所示的Camptothecin、Vinblastine和Vinblastine。药物分子式结构信息例如是SMILES分子式。As for the sample to be tested, it is, for example and not limited to, primary cells, cell strains or cell lines. It should be understood that primary cells refer to cells obtained from tissues by protease or other methods to obtain single cells and cultured in vitro to simulate the body. With respect to a cell line, it is a population of cells propagated after the first successful passage of a primary cell culture. Also refers to cultured cells that can be serially passaged for long periods of time. Cell lines are, for example and without limitation, tumor cells. Tumor cells may be involved in a variety of mutated situations. Each cell line has a defined cell line identifier (eg, cell line name). The embodiments of the present disclosure are exemplified below by taking the sample to be tested as a cell line. It should be understood that the sample to be tested is not limited to cell lines, but can also be modified and changed into primary cells and the like. Gene variation information of the sample to be tested (such as a cell line) includes, for example, the cell line ID and the single nucleotide variation (SNV), gene copy number variation (CNV), gene structure variation (SV) and microsatellites corresponding to the cell line ID Instability (MSI) and other genetic variation information. Drug identifiers are, for example, drug names, such as Camptothecin, Vinblastine, and Vinblastine shown in Table 1 below. The drug formula structure information is, for example, the SMILES formula.
在一些实施例中,计算设备110可以从关于细胞系基因组信息和药物反应活性的公共数据库,如NCI-60、抗癌药物敏感性基因组学数据库(Genomics of Drug Sensitibity in Cancer,GDSC)和癌症细胞系百科全书数据库(Cancer Cell Line Encyclopedia,CCLE)获取细胞系的基因变异信息和细胞系的对应药物的细胞反应活性数据。在一些实施例中,计算设备110可以从GDSC获取有药物IC50数据的细胞系全外显子(WES)测序基因变异信息。In some embodiments, the computing device 110 may obtain data from public databases on cell line genomic information and drug responsiveness, such as NCI-60, the Genomics of Drug Sensitivity in Cancer (GDSC) and cancer cells The Cancer Cell Line Encyclopedia (CCLE) is used to obtain the gene variation information of the cell line and the cellular reactivity data of the corresponding drug of the cell line. In some embodiments, computing device 110 may obtain cell line whole exome (WES) sequencing gene variation information with drug IC50 data from GDSC.
应当理解,术语“待测样本(例如细胞系)的有关药物”所涵盖的药物范围可以大于“细胞系的对应药物的细胞反应活性数据”中术语“细胞系的对应药物”所涵盖的药物范围。关于有关药物,其例如而不限于是针对肿瘤细胞的靶向药。靶向药例如为能够识别肿瘤细胞特有的基因变异,针对已经明确的致癌位点,特别设计的药物。一些有关药物容易产生耐药性。产生耐药性的原因例如包括:靶标本身通过突变而改变,使得靶向药针对特定的细胞系作用不明显;或者细胞系(例如肿瘤细胞)找到实现不依赖于靶标的肿瘤生长的新途径。It should be understood that the range of drugs covered by the term "relevant drug of the sample to be tested (eg cell line)" may be larger than the range of drugs covered by the term "corresponding drug of the cell line" in the "cell reactivity data of the corresponding drug of the cell line" . Regarding the related drug, it is, for example, but not limited to, a targeted drug for tumor cells. Targeted drugs are, for example, specially designed drugs that can identify tumor cell-specific gene mutations and target already defined carcinogenic sites. Some related drugs are prone to develop resistance. Reasons for drug resistance include, for example: the target itself is altered by mutation, making the targeted drug less effective for a specific cell line; or cell lines (eg, tumor cells) find new ways to achieve target-independent tumor growth.
获取药物分子式结构信息(例如是SMILES分子式)的方式例如包括:首先,确定使用的药物名称如Trametinib或药物标识,例如药物CID号如:11707110。然后,通过PaDel软件获取SMILES分子式或者通过连接https://pubchem.ncbi.nlm.nih.gov/和药物名称可以获得对应药物的SMILES分子式。The way to obtain the structural information of the drug molecular formula (for example, the molecular formula of SMILES) includes, for example: first, determining the name of the drug used such as Trametinib or the drug identifier, such as the drug CID number such as: 11707110. Then, obtain the SMILES molecular formula through PaDel software or obtain the SMILES molecular formula of the corresponding drug by linking https://pubchem.ncbi.nlm.nih.gov/ and the drug name.
在步骤204处,计算设备110获取关于细胞与对应药物的细胞 活性试验而确定的药物敏感状态数据。药物敏感状态数据例如是为细胞系与对应的药物进行细胞活性试验而最终获得的药物对该对应细胞系半数抑制浓度值(IC50值)。药物敏感状态数据也可以是经由原代细胞与对应的药物进行细胞活性试验而最终获得的药物对该对应细胞半数抑制浓度值(IC50值)。At step 204, computing device 110 obtains drug susceptibility status data as determined by cell viability assays of cells and corresponding drugs. The drug sensitivity state data is, for example, the half inhibitory concentration value (IC50 value) of the drug for the corresponding cell line obtained by performing the cell activity test for the cell line and the corresponding drug. The drug sensitivity state data can also be the half inhibitory concentration value (IC50 value) of the drug on the corresponding cell, which is finally obtained through the cell activity test of the primary cell and the corresponding drug.
关于半数抑制浓度(IC50),其是指被测量的拮抗剂的半抑制浓度,其代表只是某一药物或者物质(抑制物)在抑制某些生物程序(或者是包含在此程序中的某些物质,比如酶,细胞受体或是微生物)的半量。IC50值可以用于衡量对应药物诱导某种待测样本(例如,原代细胞、或者细胞系)凋亡的能力,即诱导能力越强,该数值越低,其可以反向说明某种细胞对于对应药物的耐药性状态。Regarding the half inhibitory concentration (IC50), it refers to the half inhibitory concentration of the measured antagonist, which represents that only a certain drug or substance (inhibitor) is inhibiting some biological program (or some substances, such as enzymes, cellular receptors or microorganisms). The IC50 value can be used to measure the ability of the corresponding drug to induce apoptosis of a certain sample (for example, primary cells, or cell lines), that is, the stronger the induction ability, the lower the value, which can reversely indicate that a certain cell is The drug resistance status of the corresponding drug.
例如,GDSC存储了实验室中不同药物浓度处理细胞时,表示细胞活性的荧光值数据,以及拟合IC50曲线而得来的500多种药物对1000多种人肿瘤细胞系的半抑制浓度IC50对数值(LN_IC50)。计算设备110从GDSC公共数据库获得500多种药物对1000多种人肿瘤细胞系的半抑制浓度IC50对数值(LN_IC50)。For example, GDSC stores fluorescence data representing cell viability when cells are treated with different drug concentrations in the laboratory, as well as half-inhibitory concentration IC50 pairs of more than 500 drugs against more than 1,000 human tumor cell lines obtained by fitting IC50 curves. Value (LN_IC50). The computing device 110 obtains the log half inhibitory concentration IC50 values (LN_IC50) of more than 500 drugs against more than 1000 human tumor cell lines from the GDSC public database.
以下结合表1示例性说明细胞系与对应药物的细胞活性试验而确定的药物敏感状态数据。如表1所示,名称为HCC1954的细胞系针对名称为Camptothecin的对应药物的药物敏感状态数据IC50为-0.251083。名称为VA-ES-BJ的细胞系针对名称为Vinblastine的对应药物的药物敏感状态数据-4.0475。The following table 1 exemplifies the drug susceptibility status data determined by the cell viability test of the cell line and the corresponding drug. As shown in Table 1, the drug susceptibility status data IC50 of the cell line named HCC1954 against the corresponding drug named Camptothecin is -0.251083. Drug susceptibility status data for the cell line named VA-ES-BJ against the corresponding drug named Vinblastine - 4.0475.
表1Table 1
Figure PCTCN2022085628-appb-000001
Figure PCTCN2022085628-appb-000001
Figure PCTCN2022085628-appb-000002
Figure PCTCN2022085628-appb-000002
在步骤206处,计算设备110针对基因变异信息、药物信息进行预处理,以便生成多种基因变异表征数据和多种药物表征数据,以用于组合成多组输入样本集。At step 206, the computing device 110 preprocesses the genetic variation information and the drug information to generate multiple genetic variation characterization data and multiple drug characterization data for combining into sets of input samples.
关于多种基因变异表征数据,其例如包括一维基因变异表征特征(可以被简称为“Multi-vec”)、二维基因变异表征特征(可以被简称为“Multi-mat”)和第三基因变异表征特征(可以被简称为“Multi-mat embed”)三种不同类型的基因变异表征数据。Regarding a variety of gene variation characterization data, it includes, for example, a one-dimensional gene variation characterization feature (which may be abbreviated as "Multi-vec"), a two-dimensional gene variation characterization feature (which may be abbreviated as "Multi-mat"), and a third gene Variation Characterization Features (which can be referred to simply as "Multi-mat embeds") are three different types of genetic variation characterisation data.
关于生成多种基因变异表征数据的方式例如包括:计算设备110基于经预处理的基因变异信息,分别生成一维基因变异表征特征和二维基因变异表征特征,一维基因变异表征特征指示细胞系标识信息、基因标识信息和变异影响类型信息,二维基因变异表征特征指示细胞系标识信息和细胞系的微卫星不稳定状态信息;以及基于二维基因变异表征特征和对应的二维权重数据,生成第三基因变异表征特征。关于变异影响类型信息,其例如包括关于基因激活、基因失活、基因重排、具有潜在临床意义、临床意义不明、耐药。The manners of generating a variety of gene variation characterization data include, for example, the computing device 110 generates a one-dimensional gene variation characterization feature and a two-dimensional gene variation characterization feature based on the preprocessed gene variation information, and the one-dimensional gene variation characterization feature indicates a cell line Identification information, gene identification information and variation impact type information, two-dimensional gene variation characterization features indicating cell line identification information and cell line microsatellite instability state information; and based on two-dimensional gene variation characterization features and corresponding two-dimensional weight data, A third gene variant characterization signature is generated. Information about the type of effect of the variant, which includes, for example, about gene activation, gene inactivation, gene rearrangement, potentially clinically significant, unclear clinically significant, drug resistance.
关于一维基因变异表征特征,其例如是以下表达式(1)所表示的一维向量(其例如为由多个为0或1的特征值所组成的一维数 组)。一维基因变异表征特征例如可以被简称为“Multi-vec”。一维基因变异表征特征的特征值个数等于基因数量乘以基因突变状态数量并且加上细胞系的微卫星不稳定状态数量。特征值为1表明对应基因存在对应类型的基因突变状态或者细胞系存在对应的微卫星不稳定状态。Regarding the one-dimensional gene variation characteristic feature, it is, for example, a one-dimensional vector represented by the following expression (1) (which is, for example, a one-dimensional array composed of a plurality of feature values of 0 or 1). One-dimensional genetic variation characterization features, for example, may be referred to as "Multi-vec" for short. The number of eigenvalues of a one-dimensional genetic variation characterizing feature is equal to the number of genes multiplied by the number of gene mutation states plus the number of microsatellite instability states of the cell line. The characteristic value of 1 indicates that the corresponding gene has the corresponding type of gene mutation state or the cell line has the corresponding microsatellite instability state.
[1,0,0,1,…,0,1,1] 1*(N*M+K)   (1) [1,0,0,1,…,0,1,1] 1*(N*M+K) (1)
在上述表达式(1)中,1*(N*M+K)代表一维基因变异表征特征的特征值个数。N代表基因数量。M代表基因突变状态数量。K代表细胞系的微卫星不稳定状态数量。以下表2例如示意性示出了细胞系名称为Ca9-22的一维基因变异表征特征Multi-vec中各特征值。例如,对应于ABCB1_del的特征值“1”指示对应基因ABCB1存在基因缺失变异(Del)。对应于ABCB1_VUS的特征值“1”指示对应基因ABCB1存在意义不明的变异(Variants of Uncertain Significance,VUS)。对应于A1CF_VUS的特征值“0”指示对应基因A1CF不存在意义不明的变异(VUS)。对应于MSI-S的特征值“1”指示存在“微卫星稳定”的MSI状态。对应于MSI-H的特征值“0”指示为不存在“微卫星高度不稳定”的MSI状态。In the above expression (1), 1*(N*M+K) represents the number of eigenvalues of the one-dimensional gene variation characteristic feature. N represents the number of genes. M represents the number of gene mutation states. K represents the number of microsatellite unstable states of the cell line. Table 2 below schematically shows, for example, each feature value in the one-dimensional gene variation characterization feature Multi-vec of the cell line name Ca9-22. For example, a feature value of "1" corresponding to ABCB1_del indicates that a deletion variant (Del) exists in the corresponding gene ABCB1. The eigenvalue "1" corresponding to ABCB1_VUS indicates the presence of Variants of Uncertain Significance (VUS) in the corresponding gene ABCB1. A feature value of "0" corresponding to A1CF_VUS indicates that there is no variant of undetermined significance (VUS) in the corresponding gene A1CF. A feature value of "1" corresponding to MSI-S indicates the presence of a "microsatellite stable" MSI state. A characteristic value of "0" corresponding to MSI-H indicates the absence of a "microsatellite highly unstable" MSI state.
表2Table 2
Figure PCTCN2022085628-appb-000003
Figure PCTCN2022085628-appb-000003
关于二维基因变异表征特征,其例如是以下表达式(2)所表 示的二维矩阵(其例如为由多个为0或1的特征值所组成的二维矩阵)。二维基因变异表征特征例如可以被简称为“Multi-mat”。二维基因变异表征特征的行例如指示细胞系的对应基因,二维基因变异表征特征的行指示细胞系的对应基因,二维基因变异表征特征的列指示变异影响类型信息或微卫星不稳定状态信息。Regarding the two-dimensional gene variation characteristic feature, it is, for example, a two-dimensional matrix represented by the following expression (2) (which is, for example, a two-dimensional matrix composed of a plurality of eigenvalues of 0 or 1). The two-dimensional genetic variation characterization feature, for example, may be referred to as "Multi-mat" for short. Rows for the 2D gene variant characterization feature, for example, indicate the corresponding gene of the cell line, the row of the 2D gene variant characterization feature indicates the corresponding gene of the cell line, and the column of the 2D gene variant characterization feature indicates the variant effect type information or microsatellite instability status information.
Figure PCTCN2022085628-appb-000004
Figure PCTCN2022085628-appb-000004
在上述表达式(2)中,N*M+K代表二维基因变异表征特征的维度。N代表基因数量。M代表基因突变状态数量。K代表细胞系的微卫星不稳定状态数量。以下表3例如示意性示出了细胞系名称为Ca9-22的二维基因变异表征特征Multi-mat中各特征值。例如,A1CF和ABCB1代表对应基因。对应于A1CF行VUS列的特征值“0”指示对应基因A1CF不存在意义不明的变异(VUS)。对应于ABCB1行MSI-S列的特征值“1”指示对应基因ABCB1存在“微卫星稳定”的MSI状态。In the above expression (2), N*M+K represents the dimension of the two-dimensional gene variation characteristic feature. N represents the number of genes. M represents the number of gene mutation states. K represents the number of microsatellite unstable states of the cell line. Table 3 below schematically shows, for example, each feature value in the two-dimensional gene variation characterization feature Multi-mat of the cell line name Ca9-22. For example, A1CF and ABCB1 represent corresponding genes. The eigenvalue "0" in the VUS column corresponding to the A1CF row indicates that there is no variant of undetermined significance (VUS) in the corresponding gene A1CF. The characteristic value "1" corresponding to the MSI-S column of the ABCB1 row indicates that the corresponding gene ABCB1 has a "microsatellite stable" MSI status.
表3table 3
Figure PCTCN2022085628-appb-000005
Figure PCTCN2022085628-appb-000005
关于第三基因变异表征特征,其例如是基于二维基因变异表征特征和对应的二维权重数据而生成的。第三基因变异表征特征例如可以被简称为“Multi-mat embed”,其例如是在Multi-mat的二维矩阵上再乘以一个对应的二维权重矩阵而生成的。该二维权重矩阵根据第一神经网络模型训练进行迭代调整。Multi-mat embed为在Multi-mat的基础上乘以一个神经网络嵌入层(例如乘以神经网络的embedding层的权重值)。Regarding the third gene variation characteristic feature, for example, it is generated based on the two-dimensional gene variation characteristic feature and the corresponding two-dimensional weight data. For example, the third gene variation characteristic feature may be referred to as "Multi-mat embed" for short, which is generated by multiplying the two-dimensional matrix of Multi-mat by a corresponding two-dimensional weight matrix, for example. The two-dimensional weight matrix is iteratively adjusted according to the training of the first neural network model. Multi-mat embed is based on Multi-mat multiplied by a neural network embedding layer (for example, multiplied by the weight value of the embedding layer of the neural network).
关于多种药物表征数据,其例如包括简化分子线性输入规范格式的药物表征数据(可以被简称为“Smiles-mat”)、化学指纹格式的药物表征数据(可以被简称为“Fingerprint”)和邻接矩阵结构图格式的药物表征数据(可以被简称为“graph”)三种不同类型的药物表征数据。Regarding a variety of drug characterization data, it includes, for example, drug characterization data in a simplified molecular linear input canonical format (may be referred to as "Smiles-mat" for short), drug characterization data in chemical fingerprint format (may be referred to as "Fingerprint" for short), and adjacency Drug Characterization Data in Matrix Structure Graph Format (may be referred to simply as "graph") Three different types of drug characterisation data.
关于简化分子线性输入规范格式的药物表征数据,其例如是通过PaDEL软件获取的简化分子线性输入规范(simplified molecular-input line entry system,或者称为“SMILES”)格式的药物表征数据。简化分子线性输入规范格式的药物表征数据例如可以被简称为“Smiles-mat”。例如有药物的SMILES特征例如为以下 表达式(3)所示。Regarding the drug characterization data in the simplified molecular linear input specification format, it is, for example, the drug characterization data in the simplified molecular linear input specification format (simplified molecular-input line entry system, or referred to as "SMILES") obtained by PaDEL software. Drug characterization data in a simplified molecular linear input canonical format may, for example, be referred to as "Smiles-mat" for short. For example, the SMILES characteristic of a drug is shown in the following expression (3).
C1=CC2=C(C3=CC=N3)C=C2)N=C1       (3)C1=CC2=C(C3=CC=N3)C=C2)N=C1 (3)
在将表达式(3)所示的SMILES特征转化为Smiles-mat的过程中,首先针对SMILES特征统计不重复元素,并且针对不重复元素进行拆分,如将上述表达式(3)拆分为8个不重复元素:C,1,=,2,(,3,N,)。然后。以行分别标记为不重复元素(例如利用8行分别标记上述8个不重复元素),以列标记为SMILES特征(SMILES分子式)某个位置是否出现该不重复元素,进而生成一个二维的矩阵。例如,图3所示的药物表征数据310(其为采用Smiles-mat格式的药物表征数据)。In the process of converting the SMILES feature shown in expression (3) into Smiles-mat, firstly count the non-repeating elements for the SMILES feature, and split the non-repeating elements, such as dividing the above expression (3) into 8 non-repeating elements: C,1,=,2,(,3,N,). Then. Mark the non-repeating elements in rows (for example, use 8 rows to mark the above 8 non-repeating elements), and mark them in columns as SMILES features (SMILES molecular formula) Whether the non-repeating element appears in a certain position, and then generate a two-dimensional matrix . For example, the drug characterization data 310 shown in FIG. 3 (which is drug characterization data in Smiles-mat format).
关于化学指纹格式的药物表征数据,其例如是基于药物的化学指纹识别方法而生成的,用于将绘制的分子转换为0和1位的流。Regarding drug characterization data in a chemical fingerprint format, which is generated, for example, based on a drug-based chemical fingerprinting method, for converting the drawn molecules into a stream of 0 and 1 bits.
关于邻接矩阵结构图格式的药物表征数据,其例如是基于SMILES分子式结构信息抽象成分子的二维邻接矩阵。Regarding the drug characterization data in the adjacency matrix structure graph format, it is, for example, a two-dimensional adjacency matrix abstracted into molecules based on SMILES molecular formula structure information.
下文将结合图5详细说明化学指纹格式的药物表征数据和邻接矩阵结构图格式的药物表征数据,在此,不再赘述。The drug characterization data in the chemical fingerprint format and the drug characterization data in the adjacency matrix structure diagram format will be described in detail below with reference to FIG. 5 , which will not be repeated here.
关于组合成多组输入样本集的方法,其例如包括:分别将从多种基因变异表征数据选出的一种基因变异表征数据与从多种药物表征数据中选出的一种药物表征数据进行组合,以便生成多组输入样本集。每组输入样本集包括一种基因变异表征数据和一种药物表征数据。下文将结合图5详细说明关于组合成多组输入样本集的方法。在此,不再赘述。Regarding the method of combining into multiple sets of input sample sets, for example, the method includes: respectively performing a gene variation characterization data selected from a plurality of gene variation characterization data and a drug characterization data selected from a plurality of drug characterization data. combined to generate sets of input samples. Each input sample set includes one gene variant characterization data and one drug characterization data. The method for combining into multiple sets of input samples will be described in detail below with reference to FIG. 5 . Here, details are not repeated here.
例如在步骤208处,计算设备110基于第一神经网络模型,提取输入样本集中的基因变异表征数据的特征,以便生成基因变异特征。For example, at step 208, the computing device 110 extracts features of the gene variation characterization data in the input sample set based on the first neural network model to generate gene variation features.
关于第一神经网络模型,其例如是基于卷积神经网络(CNN)模型而构建的。第一神经网络模型例如包括卷积层、池化激活层。例如,一种第一神经网络模型是基于图卷积神经网络(GCN)所构建的。另一种第一神经网络模型是基于卷积神经网络(CNN)所构建的。应当理解,GCN利于针对具有抽象意义上的拓扑图(例如,图是不规则的,每个图都有一个大小可变的无序节点,图中的每个节点都有不同数量的相邻节点,难以用一个同样尺寸的卷积核来进行卷积运算)进行特征提取。CNN利于有效地提取空间特征,特别是排列整齐的图像数据中像素点(pixel)矩阵,但是对于处理传统的离散卷积具有一定的难度。因此,基于不同模型所构建的第一神经网络模型针对基因变异表征数据的特征提取方式也不同。藉此,有利于针对不同的基因变异表征数据确定匹配网络结构的第一神经网络模型。在一些实施例中,将多种(例如3种)基因变异表征数据和多种(例如3种)药物表征数据组合生成不同的输入样本集,以及构建各自样本集特征对应的第一神经网络或第二神经网络,各个第一神经网络和第二神经网络分别在训练集上采用基于CNN和MLP所构建的第三神经网络两种不同特征融合策略上进行训练,生成经由各个特征组合训练得到的模型。之后,以MSE(均方根误差)作为评判标准比较各个第一神经网络、第二神经网络、 第三神经网络在验证集的拟合情况,将表现最好的模型结构作为最终使用的模型应用到测试集上。籍此,可以针对不同的基因变异表征数据确定匹配网络结构的第一神经网络模型。Regarding the first neural network model, it is constructed based on a convolutional neural network (CNN) model, for example. The first neural network model includes, for example, a convolution layer and a pooling activation layer. For example, a first neural network model was constructed based on a graph convolutional neural network (GCN). Another first neural network model is constructed based on a convolutional neural network (CNN). It should be understood that GCNs are beneficial for topological graphs in an abstract sense (e.g., graphs are irregular, each graph has an unordered node of variable size, and each node in the graph has a different number of adjacent nodes) , it is difficult to use a convolution kernel of the same size for convolution operation) for feature extraction. CNN is beneficial to effectively extract spatial features, especially the pixel matrix in neatly arranged image data, but it is difficult to deal with traditional discrete convolution. Therefore, the first neural network models constructed based on different models have different feature extraction methods for the gene variation representation data. Thereby, it is beneficial to determine the first neural network model matching the network structure for different gene variation characterization data. In some embodiments, different input sample sets are generated by combining multiple (eg, 3) gene variation characterization data and multiple (eg, 3) drug characterization data, and a first neural network corresponding to the features of the respective sample sets or The second neural network, each of the first neural network and the second neural network is trained on the training set using two different feature fusion strategies based on the third neural network constructed by CNN and MLP, to generate a training set obtained by each feature combination. Model. After that, the MSE (root mean square error) was used as the criterion to compare the fitting conditions of each of the first neural network, the second neural network, and the third neural network in the validation set, and the model structure with the best performance was used as the final model application. to the test set. Thereby, the first neural network model matching the network structure can be determined for different gene variation characterization data.
如图3所示,输入样本集中的基因变异表征数据312例如是第三基因变异表征特征(诸如采用Multi-vec格式进行特征表示),基因变异表征数据312输入第一神经网络模型(未示出,该第一神经网络模型例如是基于CNN模型而构建的)。并且经由第一神经网络模型的卷积层、池化激活层提取特征,生成基因变异特征(例如图3所示的基因特征图322)。As shown in FIG. 3 , the gene variation characterization data 312 in the input sample set is, for example, a third gene variation characterization feature (such as using a Multi-vec format for feature representation), and the gene variation characterization data 312 is input into the first neural network model (not shown). , the first neural network model is constructed based on the CNN model, for example). And features are extracted through the convolution layer and the pooling activation layer of the first neural network model to generate gene mutation features (for example, the gene feature map 322 shown in FIG. 3 ).
在步骤210处,计算设备110基于第二神经网络模型,提取输入样本集中的药物表征数据的特征,以便生成药物特征。At step 210, the computing device 110 extracts features of the drug characterizing data in the input sample set based on the second neural network model in order to generate drug features.
关于第二神经网络模型,其例如是基于CNN模型而构建的。第二神经网络模型例如包括卷积层、池化激活层。在一些实施例中,第二神经网络模型可以包括多种基于不同模型所构建的多个第二神经网络模型。例如一种第二神经网络模型是基于长短期记忆(LSTM)所构建的。另一种第二神经网络模型是基于卷积神经网络(CNN)所构建的。基于不同模型所构建的第二神经网络模型针对基因变异表征数据的特征提取方式也不同。藉此,可以针对不同的药物表征数据确定匹配网络结构的第二神经网络模型。如图3所示,输入样本集中的药物表征数据310(例如是采用SMILE-mat格式的药物表征数据),药物表征数据310输入第二神经网络模型(未示出,该第二神经网络模型例如是基于CNN模型而构建的)。 并且经由第二神经网络模型的卷积层、池化激活层提取特征,生成药物特征(例如图3所示的药物特征图320)。Regarding the second neural network model, it is constructed based on the CNN model, for example. The second neural network model includes, for example, a convolution layer and a pooling activation layer. In some embodiments, the second neural network model may include a plurality of second neural network models constructed based on different models. For example, a second neural network model is constructed based on long short-term memory (LSTM). Another second neural network model is constructed based on a convolutional neural network (CNN). The second neural network models constructed based on different models also have different feature extraction methods for the gene variation representation data. Thereby, a second neural network model matching the network structure can be determined for different drug characterization data. As shown in FIG. 3 , the drug characterization data 310 in the input sample set (for example, the drug characterization data in SMILE-mat format) is input, and the drug characterization data 310 is input into a second neural network model (not shown, for example, the second neural network model is is based on the CNN model). And features are extracted through the convolution layer and the pooling activation layer of the second neural network model to generate drug features (for example, the drug feature map 320 shown in FIG. 3 ).
在步骤212处,计算设备110融合基因变异特征和药物特征。如图3所示,计算设备110将经由第一神经网络生成的基因特征图322和第二神经网络生成的药物特征图320进行融合324(例如拼接),以用于输入第三神经网络模型330。At step 212, computing device 110 fuses the genetic variant signature and the drug signature. As shown in FIG. 3 , the computing device 110 fuses 324 (eg, concatenates) the gene profile 322 generated via the first neural network and the drug profile 320 generated by the second neural network for input to the third neural network model 330 .
在步骤214处,计算设备110基于第三神经网络模型,提取经融合的基因变异特征和药物特征的特征,以用于预测待测样本(例如而不限于为细胞系)针对对应药物的药物敏感状态,第一神经网络模型、第二神经网络模型和第三神经网络模型是经由多样本训练。At step 214, based on the third neural network model, the computing device 110 extracts the features of the fused gene variation feature and the drug feature for predicting the drug sensitivity of the sample to be tested (eg, but not limited to being a cell line) to the corresponding drug State, the first neural network model, the second neural network model and the third neural network model are trained via multiple samples.
关于第三神经网络模型,其例如为回归模型。第三神经网络模型例如是基于的多层感知机(multilayer perceptron,MLP)所构建。第三神经网络模型例如是至少一个隐藏层的由全连接层组成的神经网络。例如,第三神经网络模型包括两层的全连接层。在一些实施例中,第三神经网络模型可以包括多种基于不同模型所构建的多个第三神经网络模型。例如一种第三神经网络模型是基于MLP所构建的。另一种第三神经网络模型是基于卷积神经网络(CNN)所构建的。以下结合表达式(4)和(5)来说明由MLP所构建的第三神经网络模型的计算方式。Regarding the third neural network model, it is, for example, a regression model. The third neural network model is constructed based on, for example, a multilayer perceptron (MLP). The third neural network model is, for example, a neural network composed of fully connected layers of at least one hidden layer. For example, the third neural network model includes two fully connected layers. In some embodiments, the third neural network model may include a plurality of third neural network models constructed based on different models. For example, a third neural network model is constructed based on MLP. Another third neural network model is constructed based on a convolutional neural network (CNN). The calculation method of the third neural network model constructed by MLP will be described below in conjunction with expressions (4) and (5).
H=φ(XW h+b h)        (4) H=φ(XW h +b h ) (4)
O=HW o+b o,         (5) O=HW o +b o , (5)
其在上述表达式(4)和(5)中,φ代表激活函数,在一些实施例中,激活函数包括ReLU函数、sigmoid函数或tanh函数。H代表隐藏层。O代表输出层。X代表输入。b h代表隐藏层的系数。W h代表隐藏层的权重。b o代表输出层的系数。W o代表输出层的权重。如图3所示,将经由第一神经网络提取基因生成的基因特征图322和第二神经网络生成的药物特征图320进行融合324(例如拼接)输入第三神经网络模型330以预测关于与细胞的对应药物(例如而不限于为细胞系的对应药物)的药物敏感状态332。 In the above expressions (4) and (5), φ represents an activation function, and in some embodiments, the activation function includes a ReLU function, a sigmoid function or a tanh function. H stands for hidden layer. O stands for output layer. X stands for input. b h represents the coefficients of the hidden layer. W h represents the weight of the hidden layer. b o represents the coefficients of the output layer. W o represents the weight of the output layer. As shown in FIG. 3 , the gene feature map 322 generated by extracting genes from the first neural network and the drug feature map 320 generated by the second neural network are fused 324 (eg, concatenated) into the third neural network model 330 to predict the relationship between cells and cells The drug susceptibility status 332 of the corresponding drug of the cell line (eg, without limitation, the corresponding drug of the cell line).
以下结合表达式(6)说明关于第三神经网络模型的损失函数的计算方式。The calculation method of the loss function of the third neural network model will be described below with reference to Expression (6).
L(y,f(x))=(y-f(x)) 2       (6) L(y, f(x))=(yf(x)) 2 (6)
在上述表达式(6)中,L(y,f(x))代表损失函数。y代表关于与细胞系的对应药物的药物敏感状态的预测结果。f(x)代表关于与细胞系的对应药物的药物敏感状态的真实值。该真实值例如是由关于细胞系与对应药物的细胞活性试验而确定的药物敏感状态数据而确定的。In the above expression (6), L(y, f(x)) represents a loss function. y represents the predicted results regarding the drug susceptibility status of the corresponding drug with the cell line. f(x) represents the true value for the drug susceptibility status of the corresponding drug with the cell line. This true value is determined, for example, from drug susceptibility status data determined for cell viability assays of cell lines and corresponding drugs.
在上述方案中,通过获取待测样本的基因变异信息和待测样本的对应药物的药物信息,并且针对所获取基因变异信息和药物信息进行预处理以便生成多种基因变异表征数据和多种药物表征数据,以用于组合成多组输入样本集。本公开可以使得输入样本集中带有药物和基因多种不同特征表示形式,进而使得本公开能够考虑药物 和基因不同特征表示形式对预测模型精度的影响。另外,本公开利用第一神经网络和第二神经网络分别提取基因变异表征数据的特征和药物表征数据的特征,并将所提取基因变异特征和药物特征进行融合;并且利用经训练的第三神经网络模型提取经融合的基因变异特征和药物特征的特征以用于确定关于与待测样本的对应药物的药物敏感状态的预测结果。本公开能够将基因和药物的多种特征表示以及不同神经网络的特征提取方式进行组合来作为用于预测药物敏感性的神经网络模型调参择优的考虑因素,使得本公开的药物敏感性预测方案可以覆盖的基因和药物范围更广,具有更好的泛化能力和更精准的预测值。因而,本公开能够准确预测药物敏感性并具有较好的泛用性。In the above solution, the gene variation information of the sample to be tested and the drug information of the corresponding drug of the sample to be tested are obtained, and the acquired gene variation information and drug information are preprocessed to generate multiple gene variation representation data and multiple drugs. Characterize data for combining into sets of input samples. The present disclosure can make the input sample set have multiple different feature representations of drugs and genes, so that the present disclosure can consider the influence of different feature representations of drugs and genes on the accuracy of the prediction model. In addition, the present disclosure uses the first neural network and the second neural network to extract the features of the gene variation characterization data and the features of the drug characterization data, respectively, and fuse the extracted gene variation features and drug features; and use the trained third neural network. The network model extracts the features of the fused gene variation features and drug features for use in determining a prediction result about the drug sensitivity state of the corresponding drug with the test sample. The present disclosure can combine multiple feature representations of genes and drugs and feature extraction methods of different neural networks as factors to consider in parameter tuning and optimization of the neural network model for predicting drug sensitivity, so that the drug sensitivity prediction scheme of the present disclosure can be It can cover a wider range of genes and drugs, with better generalization ability and more accurate prediction value. Therefore, the present disclosure can accurately predict drug sensitivity and has good generality.
在一些实施例中,方法200还包括:计算设备110将每一组输入样本集分成训练数据集、验证数据集合测试数据集;以及针对每一组输入样本集,基于均方根误差来确定经由训练数据集所训练得到的第一神经网络模型、第二神经网络模型和第三神经网络在验证数据集合的拟合情况,以用于确定应用于测试数据集的第一神经网络模型、第二神经网络模型和第三神经网络。In some embodiments, the method 200 further includes: the computing device 110 divides each set of input sample sets into a training data set, a validation data set, a test data set; and for each set of input sample sets, determining, based on the root mean square error, via The fitting situation of the first neural network model, the second neural network model and the third neural network obtained by training the training data set in the verification data set, so as to be used to determine the first neural network model, the second neural network model and the second neural network applied to the test data set. Neural network model and third neural network.
例如,计算设备110将经由步骤206处理好的多组输入样本集按照同分布且随机抽样的原则在每组输入样本集中以按照一定比例的拆分为训练集、验证集和测试集。将3种基因变异表征数据和3种药物表征数据组合生成不同的输入样本集并构建各自样本集特征对应的第一神经网络或第二神经网络,各个第一神经网络或第二 神经网络分别在训练集上采用基于CNN和MLP所构建的第三神经网络两种不同特征融合策略上进行训练,生成经由各个特征组合训练得到的模型。之后,以MSE(均方根误差)作为评判标准比较各个第一神经网络、第二神经网络、第三神经网络在验证集的拟合情况,将表现最好的模型结构作为最终使用的模型应用到测试集上。由此,通过将基因和药物的多种特征表示组合来作为耐药性预测模型的调参择优的方式,本公开的耐药性预测模型不仅可以覆盖更广的基因和药物范围,而且模型有更好的拟合能力和更精准的预测值。For example, the computing device 110 divides the input sample sets processed in step 206 into training sets, validation sets and test sets according to a certain proportion in each set of input sample sets according to the principle of identical distribution and random sampling. Combining 3 kinds of gene variation representation data and 3 kinds of drug representation data to generate different input sample sets and construct the first neural network or second neural network corresponding to the characteristics of each sample set, each first neural network or second neural network is in The training set is trained on two different feature fusion strategies based on the third neural network constructed by CNN and MLP to generate a model trained by each feature combination. After that, the MSE (root mean square error) was used as the criterion to compare the fitting conditions of the first neural network, the second neural network, and the third neural network in the validation set, and the model structure with the best performance was used as the final model application. to the test set. Therefore, by combining multiple feature representations of genes and drugs as a way to adjust the parameters of the drug resistance prediction model, the drug resistance prediction model of the present disclosure can not only cover a wider range of genes and drugs, but also has Better fitting ability and more accurate predictions.
图5示出了根据本公开的实施例的用于组合成多组输入样本集的方法500的流程图。应当理解,方法500例如可以在图7所描述的电子设备700处执行。也可以在图1所描述的计算设备110处执行。应当理解,方法500还可以包括未示出的附加动作和/或可以省略所示出的动作,本公开的范围在此方面不受限制。FIG. 5 shows a flowchart of a method 500 for combining into sets of input samples, according to an embodiment of the present disclosure. It should be understood that the method 500 may be performed, for example, at the electronic device 700 described in FIG. 7 . It may also execute at the computing device 110 depicted in FIG. 1 . It should be understood that the method 500 may also include additional actions not shown and/or the actions shown may be omitted, and the scope of the present disclosure is not limited in this regard.
在步骤502处,计算设备110基于经预处理的基因变异信息,分别生成一维基因变异表征特征和二维基因变异表征特征,一维基因变异表征特征指示细胞系标识信息、基因标识信息和变异影响类型信息,二维基因变异表征特征指示细胞系标识信息和细胞系的微卫星不稳定状态信息。At step 502, the computing device 110 generates, based on the preprocessed gene variation information, a one-dimensional gene variation characterizing feature and a two-dimensional gene variation characterizing feature, respectively, the one-dimensional gene variation characterizing feature indicating cell line identification information, gene identification information, and variation Influence type information, two-dimensional gene variant characterization features indicate cell line identification information and cell line microsatellite instability status information.
在步骤504处,计算设备110基于二维基因变异表征特征和对应的二维权重数据,生成第三基因变异表征特征。第三基因变异表征特征例如可以被简称为“Multi-mat embed”,其例如是在Multi-mat 的二维矩阵上再乘以一个对应的二维权重矩阵而生成的。该二维权重矩阵根据第一神经网络模型训练进行迭代调整。Multi-mat embed为在Multi-mat的基础上乘以一个神经网络嵌入层(例如乘以神经网络的embedding层的权重值)。At step 504, the computing device 110 generates a third gene variation characteristic feature based on the two-dimensional gene variation characteristic feature and the corresponding two-dimensional weight data. For example, the third gene variation characteristic feature may be referred to as "Multi-mat embed" for short, which is generated by multiplying the two-dimensional matrix of Multi-mat by a corresponding two-dimensional weight matrix, for example. The two-dimensional weight matrix is iteratively adjusted according to the training of the first neural network model. Multi-mat embed is based on Multi-mat multiplied by a neural network embedding layer (for example, multiplied by the weight value of the embedding layer of the neural network).
在步骤506处,计算设备110基于经预处理的药物信息,生成简化分子线性输入规范格式的药物表征数据、化学指纹格式的药物表征数据和邻接矩阵结构图格式的药物表征数据。At step 506, the computing device 110 generates, based on the preprocessed drug information, drug characterization data in a simplified molecular linear input canonical format, drug characterization data in a chemical fingerprint format, and drug characterization data in an adjacency matrix structure graph format.
关于化学指纹格式的药物表征数据,其例如是基于药物的化学指纹识别方法而生成的,用于将绘制的分子转换为0和1位的流。化学指纹格式的药物表征数据例如可以被简称为“Fingerprint”。指纹类型例如是MACCS密钥。化学指纹格式的药物表征数据例如包括预定数量个键(即“0”和“1”),例如“00000000000000000000001000010000000001100010000000000000000000100100000000010100100001010100110001100100011100110100010011100101000101111100111111111111001010111111110”其中每个键对应于特定的分子特征。生成Fingerprint格式的药物表征数据,例如是通过rdkit包直接将SMILES特征转化为Fingerprint格式的药物表征数据的药物表征数据。以下表4例如示意性示出了药物名称为Trametinib、药物标识为11707110的SMILES分子式结构信息和Fingerprint格式的药物表征数据。Regarding drug characterization data in a chemical fingerprint format, which is generated, for example, based on a drug-based chemical fingerprinting method, for converting the drawn molecules into a stream of 0 and 1 bits. Drug characterization data in a chemical fingerprint format may be referred to, for example, as a "Fingerprint" for short. The fingerprint type is, for example, the MACCS key.化学指纹格式的药物表征数据例如包括预定数量个键(即“0”和“1”),例如“00000000000000000000001000010000000001100010000000000000000000100100000000010100100001010100110001100100011100110100010011100101000101111100111111111111001010111111110”其中每个键对应于特定的分子特征。 Generating drug characterization data in Fingerprint format, for example, directly converting SMILES features into drug characterization data in Fingerprint format through the rdkit package. Table 4 below, for example, schematically shows the SMILES molecular formula structure information of the drug name Trametinib and the drug ID 11707110 and the drug characterization data in Fingerprint format.
表4Table 4
Figure PCTCN2022085628-appb-000006
Figure PCTCN2022085628-appb-000006
关于邻接矩阵结构图格式的药物表征数据,其例如是基于SMILES分子式结构信息抽象成分子的二维邻接矩阵。以下结合图4来说明邻接矩阵结构图格式的药物表征数据的生成方式。图4示意性示出了某个药物的3D结构400的示意图。如图4所示,每一个节点(例如节点410)代表SMILES分子式中的每一个原子,如果存在横杠(例如横杠420)代表节点(例如节点422和节点424)之间有边的连接。例如,根据应用场景将药物的邻接矩阵维度大小例如设置为M(M例如为100),若某个药物的原子数小于M,则将多余的默认为0,形成二维矩阵M*M(100*100),原子间有连接则在对应位置处填为1,其他标为0,再加上每个原子的属性矩阵,假定每个原子有N(N=10)个属性,则该属性矩阵为(100*10)。可以基于药物的邻接矩阵和原子的属性矩阵这两个矩阵共同表示一个药物。Regarding the drug characterization data in the adjacency matrix structure graph format, it is, for example, a two-dimensional adjacency matrix abstracted into molecules based on SMILES molecular formula structure information. The following describes the generation method of the drug characterization data in the adjacency matrix structure graph format with reference to FIG. 4 . Figure 4 schematically shows a schematic diagram of a 3D structure 400 of a drug. As shown in FIG. 4, each node (eg, node 410) represents each atom in the SMILES formula, and if there is a horizontal bar (eg, horizontal bar 420), it represents an edge connection between nodes (eg, node 422 and node 424). For example, according to the application scenario, set the adjacency matrix dimension of the drug to M (for example, M is 100). If the atomic number of a drug is less than M, the excess is set to 0 by default, forming a two-dimensional matrix M*M (100 *100), if there is a connection between atoms, fill in 1 at the corresponding position, and the other is marked as 0, plus the attribute matrix of each atom, assuming that each atom has N (N=10) attributes, then the attribute matrix is (100*10). A drug can be collectively represented based on the adjacency matrix of the drug and the attribute matrix of the atoms.
在步骤508处,计算设备110分别将一维基因变异表征特征、二维基因变异表征特征和第三基因变异表征特征中的一种基因变异表征特征与简化分子线性输入规范格式的药物表征数据、化学指纹格式的药物表征数据和邻接矩阵结构图格式的药物表征数据中的一种药物表征数据进行组合,以便生成多组输入样本集,每一组输入样本集包括一种基因变异表征特征和一种药物表征数据。例如,计算设备110基于3种基因变异表征数据和3种药物表征数据,组合生成9组不同的输入样本集。At step 508, the computing device 110 respectively inputs one of the one-dimensional gene variation characterizing features, the two-dimensional gene variation characterizing feature, and the third gene variation characterizing feature to the drug characterizing data in the simplified molecular linear input canonical format, The drug characterization data in the chemical fingerprint format and one of the drug characterization data in the adjacency matrix structure graph format are combined to generate multiple sets of input samples, each set of input samples including a gene variant characterization feature and a drug characterization data. For example, the computing device 110 generates 9 different sets of input samples in combination based on 3 kinds of gene variant characterization data and 3 kinds of drug characterization data.
传统的关于耐药性的预测模型大多集中于单一类型药物对于少数基因的耐药性IC50值的预测,针对细胞系层和药物层面所建模型中,没有考虑到药物和基因不同特征表示形式对于模型精度的影响。而本公开通过采用上述手段能够将不同表示方式的基因变异表征数据和药物表征数据组合成预测模型的多种不同类型的输入样本集,进而使得预测模型所学习的数据集更丰富,考虑了多细胞系的多基因突变与多种药物组合的应用场景。Most of the traditional drug resistance prediction models focus on the prediction of the drug resistance IC50 value of a single type of drug for a few genes. The effect of model accuracy. The present disclosure can combine the gene variation characterization data and drug characterization data in different representations into a variety of different types of input sample sets for the prediction model by using the above-mentioned means, thereby making the data set learned by the prediction model more abundant, considering more Application scenarios of multiple gene mutations in cell lines and multiple drug combinations.
图6示出了根据本公开的实施例的用于基因变异信息和药物信息的数据预处理方法600的流程图。应当理解,方法600例如可以在图8所描述的电子设备800处执行。也可以在图1所描述的计算设备110处执行。应当理解,方法600还可以包括未示出的附加动作和/或可以省略所示出的动作,本公开的范围在此方面不受限制。FIG. 6 shows a flowchart of a data preprocessing method 600 for gene variation information and drug information according to an embodiment of the present disclosure. It should be understood that the method 600 may be performed, for example, at the electronic device 800 described in FIG. 8 . It may also execute at the computing device 110 depicted in FIG. 1 . It should be understood that method 600 may also include additional actions not shown and/or actions shown may be omitted, and the scope of the present disclosure is not limited in this regard.
在步骤602处,计算设备110在所获取的细胞系的基因变异信息中选取与属于预定集合的基因相关联的基因变异信息。At step 602, the computing device 110 selects gene variation information associated with genes belonging to a predetermined set from the acquired gene variation information of the cell line.
关于属于预定集合,其例如是与肿瘤相关的基因的集合。例如,计算设备110基于基因变异信息,从所获取的细胞系的基因中筛选了600多个(例如654个)肿瘤相关的重要基因,并在所获取的细胞系的基因变异信息中选取与这600多个重要基因相关联的基因变异信息。通过采用上述手段,可以去除所获取的细胞系的基因变异信息中与肿瘤相关性不高的基因变异信息,利于提高后续模型训练的效率和预测的准确性。Regarding belonging to a predetermined set, it is, for example, a set of genes associated with tumors. For example, based on the gene variation information, the computing device 110 screens more than 600 (eg, 654) important tumor-related genes from the acquired cell line genes, and selects genes related to these genes from the acquired cell line gene variation information. Gene variant information associated with more than 600 important genes. By using the above method, the gene variation information that is not highly correlated with the tumor in the acquired gene variation information of the cell line can be removed, which is beneficial to improve the efficiency of subsequent model training and the accuracy of prediction.
在步骤604处,计算设备110针对所选取的基因变异信息进行注释,以便生成变异影响类型信息。例如计算设备110对所筛选留的基因的变异进行生物功能注释分类。通过采用上述手段,可以将定量的基因变异信息转换为定性的变异影响类型信息,有利于更加标准化基因变异信息,为数据分析带来便利,例如有利于预测模型的训练与结果预测。At step 604, the computing device 110 annotates the selected gene variation information to generate variation impact type information. For example, the computing device 110 performs biological function annotation classification on the variation of the screened genes. By using the above methods, quantitative gene variation information can be converted into qualitative variation impact type information, which is conducive to more standardized gene variation information, and brings convenience to data analysis, such as training prediction models and predicting results.
例如,通过生物功能注释分类将原始的细胞系变异信息转换成第一类型变异信息和第二类型变异信息。第一类型变异信息例如指示细胞系、基因和变异影响类型之间的关联信息。例如,细胞系(或称为“Sample name”)-基因(或称为“Gene name”)-变异影响类型(或称为“MUT_TYPE”)、第二类型变异信息例如指示细胞系与微卫星不稳定状态之间的关联信息系。例如,细胞系(或称为“Sample name”)-微卫星不稳定状态(或称为“MSI status”)。For example, the original cell line variation information is converted into first-type variation information and second-type variation information through biological function annotation classification. The first type of variation information, for example, indicates association information between cell lines, genes, and types of variation effects. For example, cell line (or "Sample name") - gene (or "Gene name") - variant effect type (or "MUT_TYPE"), the second type of variant information such as indicating that the cell line is not related to the microsatellite The relational information system between steady states. For example, cell line (or "Sample name") - microsatellite instability status (or "MSI status").
关于变异影响类型信息(或称为“MUT_TYPE”),其例如包括:基因激活(或称为“active”)、基因失活(或称为“inactive”)、 基因重排(或称为“Fusion”、具有潜在临床意义(或称为“other”)、临床意义不明(或称为“VUS”)、耐药(或称为“resistant”)等信息。例如,以下表5示例出第一类型变异信息,即细胞系、基因和变异影响类型之间的关联信息。例如,名称为ZR-75-30细胞系中基因名称为TSC2的基因变异影响类型为临床意义不明。名称为ZR-75-30细胞系中基因名称为ROS1的基因变异影响类型为具有潜在临床意义。Information about variant impact type (or "MUT_TYPE"), which includes, for example: gene activation (or "active"), gene inactivation (or "inactive"), gene rearrangement (or "Fusion") ”, potential clinical significance (or “other”), unknown clinical significance (or “VUS”), drug resistance (or “resistant”). For example, the first type is illustrated in Table 5 below. Variation information, i.e. association information between cell line, gene, and variant effect type. For example, a gene named TSC2 in the cell line named ZR-75-30 has a variant effect type of unknown clinical significance. The name is ZR-75- The gene variant with the gene name ROS1 in 30 cell lines affected the type of potential clinical significance.
表5table 5
Figure PCTCN2022085628-appb-000007
Figure PCTCN2022085628-appb-000007
关于微卫星不稳定状态,其例如包括:微卫星稳定(或称为“MSI-S”)、微卫星低度不稳定(或称为“MSI-L”)、微卫星高度不稳定(或称为“MSI-H”)和不确定(或称为“Unsure”)几种类型。例如,以下表6示例出第二类型变异信息,即细胞系与微卫星不稳定状态之间的关联信息系。例如,名称为CW-2的细胞系的微卫星高度不稳定(或称为“MSI-H”)。Regarding the microsatellite instability state, it includes, for example: microsatellite stable (or called "MSI-S"), microsatellite low instability (or called "MSI-L"), microsatellite high instability (or called "MSI-L"), microsatellite instability "MSI-H") and uncertain (or "Unsure") types. For example, Table 6 below illustrates the second type of variation information, ie, the association information between cell lines and microsatellite instability states. For example, the microsatellites of the cell line designated CW-2 are highly unstable (or "MSI-H").
表6Table 6
细胞系名称cell line name 微卫星不稳定状态(MSI status)MSI status
697697 MSI-SMSI-S
56375637 MSI-SMSI-S
201T201T MSI-LMSI-L
22RV122RV1 MSI-HMSI-H
CRO-AP3CRO-AP3 UnsureUnsure
CS1CS1 MSI-SMSI-S
CW-2CW-2 MSI-HMSI-H
LIM1215LIM1215 UnsureUnsure
LK-2LK-2 MSI-SMSI-S
在步骤606处,计算设备110去除符合以下至少一项的基因变异信息和药物信息:所获取的药物敏感状态数据为不稳定的细胞系与对应药物;以及对应药物缺失药物分子式结构信息。例如,计算设备110去除IC50实验值不稳定的或无SMILES药物分子式的细胞系的基因变异数据。At step 606, the computing device 110 removes gene variation information and drug information that meet at least one of the following: the acquired drug sensitivity state data are unstable cell lines and corresponding drugs; and corresponding drugs lack drug molecular formula structure information. For example, computing device 110 removes genetic variation data for cell lines with unstable IC50 experimental values or without the SMILES drug formula.
通过采用上述手段,本公开可以将将定量的基因变异信息转换为定性的变异影响类型信息,以及针对存在遗漏信息属性值和信息属性不稳定的数据进行清洗,从而得到一个完备的经预处理的数据,进而利于提高神经网络模型的机器学习的效果。By adopting the above-mentioned means, the present disclosure can convert quantitative gene variation information into qualitative variation impact type information, and clean data with missing information attribute values and unstable information attributes, thereby obtaining a complete preprocessed Data, and then help to improve the effect of machine learning of neural network model.
图7示意性示出了适于用来实现本公开实施例的电子设备700 的框图。设备700可以是用于实现执行图2、图5和图6所示的方法200、500和600的设备。如图7所示,设备700包括中央处理单元(CPU)701,其可以根据存储在只读存储器(ROM)702中的计算机程序指令或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序指令,来执行各种适当的动作和处理。在RAM中,还可存储设备700操作所需的各种程序和数据。CPU、ROM以及RAM通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。Figure 7 schematically illustrates a block diagram of an electronic device 700 suitable for implementing embodiments of the present disclosure. The device 700 may be a device for implementing the methods 200 , 500 and 600 shown in FIGS. 2 , 5 and 6 . As shown in FIG. 7, device 700 includes a central processing unit (CPU) 701 that may be loaded into random access memory (RAM) 703 according to computer program instructions stored in read only memory (ROM) 702 or from storage unit 708 computer program instructions to perform various appropriate actions and processes. In the RAM, various programs and data required for the operation of the device 700 can also be stored. The CPU, ROM, and RAM are connected to each other through a bus 704 . An input/output (I/O) interface 705 is also connected to bus 704 .
设备700中的多个部件连接至I/O接口705,包括:输入单元706、输出单元707、存储单元708,中央处理单元701执行上文所描述的各个方法和处理,例如执行方法200、500和600。例如,在一些实施例中,方法200、500和600可被实现为计算机软件程序,其被存储于机器可读介质,例如存储单元708。在一些实施例中,计算机程序的部分或者全部可以经由ROM和/或通信单元709而被载入和/或安装到设备700上。当计算机程序加载到RAM并由CPU执行时,可以执行上文描述的方法200、500和600的一个或多个操作。备选地,在其他实施例中,CPU可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行方法200、500和600的一个或多个动作。A number of components in the device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and the central processing unit 701 performs the various methods and processes described above, such as performing methods 200, 500 and 600. For example, in some embodiments, methods 200 , 500 and 600 may be implemented as a computer software program stored on a machine-readable medium, such as storage unit 708 . In some embodiments, part or all of the computer program may be loaded and/or installed on the device 700 via the ROM and/or the communication unit 709 . When the computer program is loaded into RAM and executed by the CPU, one or more operations of the methods 200, 500 and 600 described above may be performed. Alternatively, in other embodiments, the CPU may be configured to perform one or more actions of methods 200, 500, and 600 by any other suitable means (eg, by means of firmware).
需要进一步说明的是,本公开可以是方法、装置、***和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于执行本公开的各个方面的计算机可读程序指令。It should be further stated that the present disclosure may be a method, an apparatus, a system and/or a computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for carrying out various aspects of the present disclosure.
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above. Computer-readable storage media, as used herein, are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。The computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
用于执行本公开操作的计算机程序指令可以是汇编指令、指令 集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,该编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。Computer program instructions for carrying out operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages. Source or object code written in any combination including object-oriented programming languages - such as Smalltalk, C++, etc., and conventional procedural programming languages - such as the "C" language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect). In some embodiments, custom electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), can be personalized by utilizing state information of computer readable program instructions. Computer readable program instructions are executed to implement various aspects of the present disclosure.
这里参照根据本公开实施例的方法、设备(***)、和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
这些计算机可读程序指令可以提供给语音交互装置中的处理器、通用计算机、专用计算机或其它可编程数据处理装置的处理单元,从而生产出一种机器,使得这些指令在通过计算机或其它可编 程数据处理装置的处理单元执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer readable program instructions may be provided to a processor in a voice interaction device, a general purpose computer, a special purpose computer or a processing unit of other programmable data processing devices, thereby producing a machine that enables these instructions to be processed by a computer or other programmable The processing elements of the data processing apparatus, when executed, produce means for implementing the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.
附图中的流程图和框图显示了根据本公开的多个实施例的设备、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,该模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的 组合,可以用执行规定的功能或动作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which contains one or more oper- ables for implementing the specified logical function(s) Execute the instruction. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions.
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Various embodiments of the present disclosure have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
以上该仅为本公开的可选实施例,并不用于限制本公开,对于本领域的技术人员来说,本公开可以有各种更改和变化。凡在本公开的精神和原则之内,所作的任何修改、等效替换、改进等,均应包含在本公开的保护范围之内。The above are only optional embodiments of the present disclosure, and are not intended to limit the present disclosure. For those skilled in the art, the present disclosure may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure should be included within the protection scope of the present disclosure.

Claims (10)

  1. 一种用于预测药物敏感状态的方法,包括:A method for predicting drug susceptibility status including:
    获取待测样本的基因变异信息和有关药物的药物信息,所述药物信息至少包括药物标识以及药物分子式结构信息;Obtain the gene variation information of the sample to be tested and the drug information about the drug, where the drug information at least includes the drug identifier and the drug molecular formula structure information;
    获取关于细胞与对应药物的细胞活性试验而确定的药物敏感状态数据,所述药物敏感状态数据为半数抑制浓度值;Obtaining drug sensitive state data determined by the cell activity test of the cells and the corresponding drug, the drug sensitive state data being the half-inhibitory concentration value;
    针对所述基因变异信息、药物信息进行预处理,以便生成多种基因变异表征数据和多种药物表征数据,以用于组合成多组输入样本集;Perform preprocessing on the gene variation information and drug information, so as to generate multiple gene variation characterization data and multiple drug characterization data for combining into multiple sets of input samples;
    基于第一神经网络模型,提取输入样本集中的基因变异表征数据的特征,以便生成基因变异特征;Based on the first neural network model, extracting the features of the gene variation representation data in the input sample set, so as to generate the gene variation features;
    基于第二神经网络模型,提取输入样本集中的药物表征数据的特征,以便生成药物特征;Based on the second neural network model, extract the features of the drug representation data in the input sample set to generate drug features;
    融合所述基因变异特征和所述药物特征;以及fusing the genetic variant signature and the drug signature; and
    基于第三神经网络模型,提取经融合的所述基因变异特征和所述药物特征的特征,以用于预测待测样本针对对应药物的药物敏感状态,所述第一神经网络模型、第二神经网络模型和第三神经网络模型是经由多样本训练的。Based on the third neural network model, the fused features of the gene variation feature and the drug feature are extracted to predict the drug sensitivity state of the sample to be tested for the corresponding drug, the first neural network model, the second neural network model The network model and the third neural network model are trained via multiple samples.
  2. 根据权利要求1所述的方法,其中所述待测样本为细胞系或者原代细胞,所述药物敏感状态数据是经由关于细胞系与对应药物的细胞活性试验而确定的。The method according to claim 1, wherein the sample to be tested is a cell line or a primary cell, and the drug susceptibility state data is determined through a cell activity test on the cell line and the corresponding drug.
  3. 根据权利要求2所述的方法,其中生成多种基因变异表征数 据和多种药物表征数据以用于组合成多组输入样本集包括:The method of claim 2, wherein generating a plurality of genetic variation characterization data and a plurality of drug characterization data for combining into sets of input samples comprises:
    基于经预处理的所述基因变异信息,分别生成一维基因变异表征特征和二维基因变异表征特征,所述一维基因变异表征特征指示细胞系标识信息、基因标识信息和变异影响类型信息,所述二维基因变异表征特征指示细胞系标识信息和细胞系的微卫星不稳定状态信息;以及Based on the preprocessed gene variation information, a one-dimensional gene variation representation feature and a two-dimensional gene variation representation feature are respectively generated, wherein the one-dimensional gene variation representation feature indicates cell line identification information, gene identification information and variation effect type information, The two-dimensional genetic variation characterization feature is indicative of cell line identification information and cell line microsatellite instability status information; and
    基于所述二维基因变异表征特征和对应的二维权重数据,生成第三基因变异表征特征。Based on the two-dimensional gene variation characteristic feature and the corresponding two-dimensional weight data, a third gene variation characteristic feature is generated.
  4. 根据权利要求3所述的方法,其中生成多种基因变异表征数据和多种药物表征数据以用于组合成多组输入样本集:3. The method of claim 3, wherein a plurality of genetic variation characterization data and a plurality of drug characterization data are generated for combining into sets of input samples:
    基于经预处理的所述药物信息,生成简化分子线性输入规范格式的药物表征数据、化学指纹格式的药物表征数据和邻接矩阵结构图格式的药物表征数据;以及generating drug characterization data in a simplified molecular linear input canonical format, drug characterization data in a chemical fingerprint format, and drug characterization data in an adjacency matrix structure graph format based on the preprocessed drug information; and
    分别将一维基因变异表征特征、二维基因变异表征特征和第三基因变异表征特征中的一种基因变异表征特征与简化分子线性输入规范格式的药物表征数据、化学指纹格式的药物表征数据和邻接矩阵结构图格式的药物表征数据中的一种药物表征数据进行组合,以便生成多组输入样本集,每一组输入样本集包括一种基因变异表征特征和一种药物表征数据。The one-dimensional gene variation characterization feature, the two-dimensional gene variation characterization feature, and the third gene variation characterization feature are respectively combined with the drug characterization data in the simplified molecular linear input canonical format, the drug characterization data in the chemical fingerprint format, and the One kind of drug characterization data in the drug characterization data in the adjacency matrix structure graph format is combined to generate multiple sets of input sample sets, each set of input sample sets includes a gene variant characterization feature and a drug characterization data.
  5. 根据权利要求2所述的方法,其中针对所述基因变异信息、药物信息进行预处理还包括:The method according to claim 2, wherein preprocessing the gene variation information and drug information further comprises:
    在所获取的细胞系的基因变异信息中选取与属于预定集合的 基因相关联的基因变异信息;Selecting gene variation information associated with genes belonging to a predetermined set from the acquired gene variation information of the cell line;
    针对所选取的基因变异信息进行注释,以便生成变异影响类型信息;以及去除符合以下至少一项的基因变异信息和药物信息:所获取的药物敏感状态数据为不稳定的细胞系与对应药物;以及对应药物缺失药物分子式结构信息。Annotate the selected gene variation information to generate variation impact type information; and remove gene variation information and drug information that meet at least one of the following: the acquired drug sensitivity status data are unstable cell lines and corresponding drugs; and The corresponding drug is missing the molecular formula information of the drug.
  6. 根据权利要求3所述的方法,其中变异影响类型信息包括:关于基因激活、基因失活、基因重排、具有潜在临床意义、临床意义不明和耐药的信息,所述微卫星不稳定状态信息包括:关于微卫星稳定、微卫星低度不稳定、微卫星高度不稳定和微卫星稳定性不确定的信息。The method according to claim 3, wherein the variation impact type information comprises: information on gene activation, gene inactivation, gene rearrangement, potential clinical significance, unclear clinical significance and drug resistance, the microsatellite instability status information Includes: Information on Microsatellite Stability, Microsatellite Low Instability, Microsatellite High Instability, and Microsatellite Stability Uncertain.
  7. 根据权利要求3所述的方法,其中所述一维基因变异表征特征的特征值个数等于基因数量乘以基因突变状态数量并且加上细胞系的微卫星不稳定状态数量,所述二维基因变异表征特征的行指示细胞系的对应基因,所述二维基因变异表征特征的列指示变异影响类型信息或微卫星不稳定状态信息。The method of claim 3, wherein the number of eigenvalues of the one-dimensional genetic variation characterizing feature is equal to the number of genes multiplied by the number of gene mutation states plus the number of microsatellite unstable states of the cell line, the two-dimensional gene Rows of variant characterization features indicate the corresponding genes of the cell line, and columns of the two-dimensional gene variant characterization features indicate variant effect type information or microsatellite instability status information.
  8. 根据权利要求3所述的方法,还包括:The method of claim 3, further comprising:
    确定述第一神经网络模型和第二神经网络模型,以便所述第一神经网络模型与所述输入样本集中的基因变异表征数据的种类相匹配,以及所述第二神经网络模型与所述输入样本集中的药物表征数据相匹配。determining the first neural network model and the second neural network model so that the first neural network model matches the type of gene variation characterization data in the input sample set, and the second neural network model matches the input match the drug characterization data in the sample set.
  9. 一种计算设备,包括:A computing device comprising:
    至少一个处理单元;at least one processing unit;
    至少一个存储器,所述至少一个存储器被耦合到所述至少一个处理单元并且存储用于由所述至少一个处理单元执行的指令,所述指令当由所述至少一个处理单元执行时,使得所述计算设备执行根据权利要求1至8任一项所述的方法。at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit that, when executed by the at least one processing unit, cause the at least one processing unit The computing device performs the method of any one of claims 1 to 8.
  10. 一种计算机可读存储介质,其上存储有机器可执行指令,该机器可执行指令在被执行时使机器执行根据权利要求1至8中任一项所述的方法。A computer-readable storage medium having stored thereon machine-executable instructions that, when executed, cause a machine to perform the method of any one of claims 1 to 8.
PCT/CN2022/085628 2021-04-09 2022-04-07 Method for predicting drug sensitivity state, device, and storage medium WO2022214036A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110380471.XA CN112768089B (en) 2021-04-09 2021-04-09 Method, apparatus and storage medium for predicting drug sensitivity status
CN202110380471.X 2021-04-09

Publications (1)

Publication Number Publication Date
WO2022214036A1 true WO2022214036A1 (en) 2022-10-13

Family

ID=75691385

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/085628 WO2022214036A1 (en) 2021-04-09 2022-04-07 Method for predicting drug sensitivity state, device, and storage medium

Country Status (2)

Country Link
CN (1) CN112768089B (en)
WO (1) WO2022214036A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403657A (en) * 2023-03-20 2023-07-07 本源量子计算科技(合肥)股份有限公司 Drug response prediction method and device, storage medium and electronic device
CN116705194A (en) * 2023-06-06 2023-09-05 之江实验室 Method and device for predicting drug cancer suppression sensitivity based on graph neural network
CN117275608A (en) * 2023-09-08 2023-12-22 浙江大学 Cooperative attention-based method and device for cooperative prediction of interpretable anticancer drugs
CN117524346A (en) * 2023-11-20 2024-02-06 东北林业大学 Multi-view cancer drug response prediction system

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112768089B (en) * 2021-04-09 2021-06-22 至本医疗科技(上海)有限公司 Method, apparatus and storage medium for predicting drug sensitivity status
CN113284553B (en) * 2021-05-28 2023-01-10 南昌大学 Method for testing binding capacity of drug target for treating drug addiction
CN114334078B (en) * 2022-03-14 2022-06-14 至本医疗科技(上海)有限公司 Method, electronic device, and computer storage medium for recommending medication
WO2023221125A1 (en) * 2022-05-20 2023-11-23 京东方科技集团股份有限公司 Drug sensitivity prediction method, model training method, storage medium and device
CN116110509B (en) * 2022-11-15 2023-08-04 浙江大学 Method and device for predicting drug sensitivity based on histology consistency pretraining

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108877953A (en) * 2018-06-06 2018-11-23 中南大学 A kind of drug sensitivity prediction method based on more similitude networks
CN110232978A (en) * 2019-06-14 2019-09-13 西安电子科技大学 Cancer cell system therapeutic agent prediction technique based on multidimensional network
CN110867254A (en) * 2019-11-18 2020-03-06 北京市商汤科技开发有限公司 Prediction method and device, electronic device and storage medium
US20200365270A1 (en) * 2019-05-15 2020-11-19 International Business Machines Corporation Drug efficacy prediction for treatment of genetic disease
CN112768089A (en) * 2021-04-09 2021-05-07 至本医疗科技(上海)有限公司 Method, apparatus and storage medium for predicting drug sensitivity status

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10262107B1 (en) * 2013-03-15 2019-04-16 Bao Tran Pharmacogenetic drug interaction management system
CN107609326A (en) * 2017-07-26 2018-01-19 同济大学 Drug sensitivity prediction method in the accurate medical treatment of cancer
KR101953762B1 (en) * 2017-09-25 2019-03-04 (주)신테카바이오 Drug indication and response prediction systems and method using AI deep learning based on convergence of different category data
CN111724911A (en) * 2020-05-13 2020-09-29 深圳哲源生物科技有限责任公司 Target drug sensitivity prediction method and device, terminal device and storage medium
CN111627515B (en) * 2020-05-29 2023-07-18 上海商汤智能科技有限公司 Medicine recommendation method, device, electronic equipment and medium
CN111798030B (en) * 2020-06-02 2023-07-25 中国科学院合肥物质科学研究院 Drug sensitivity prediction method and device based on depth genetic information characteristics
CN112435754B (en) * 2020-09-30 2022-04-08 天津大学 Method for predicting drug sensitivity based on depth factorization machine

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108877953A (en) * 2018-06-06 2018-11-23 中南大学 A kind of drug sensitivity prediction method based on more similitude networks
US20200365270A1 (en) * 2019-05-15 2020-11-19 International Business Machines Corporation Drug efficacy prediction for treatment of genetic disease
CN110232978A (en) * 2019-06-14 2019-09-13 西安电子科技大学 Cancer cell system therapeutic agent prediction technique based on multidimensional network
CN110867254A (en) * 2019-11-18 2020-03-06 北京市商汤科技开发有限公司 Prediction method and device, electronic device and storage medium
CN112768089A (en) * 2021-04-09 2021-05-07 至本医疗科技(上海)有限公司 Method, apparatus and storage medium for predicting drug sensitivity status

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403657A (en) * 2023-03-20 2023-07-07 本源量子计算科技(合肥)股份有限公司 Drug response prediction method and device, storage medium and electronic device
CN116705194A (en) * 2023-06-06 2023-09-05 之江实验室 Method and device for predicting drug cancer suppression sensitivity based on graph neural network
CN116705194B (en) * 2023-06-06 2024-06-04 之江实验室 Method and device for predicting drug cancer suppression sensitivity based on graph neural network
CN117275608A (en) * 2023-09-08 2023-12-22 浙江大学 Cooperative attention-based method and device for cooperative prediction of interpretable anticancer drugs
CN117275608B (en) * 2023-09-08 2024-04-26 浙江大学 Cooperative attention-based method and device for cooperative prediction of interpretable anticancer drugs
CN117524346A (en) * 2023-11-20 2024-02-06 东北林业大学 Multi-view cancer drug response prediction system

Also Published As

Publication number Publication date
CN112768089A (en) 2021-05-07
CN112768089B (en) 2021-06-22

Similar Documents

Publication Publication Date Title
WO2022214036A1 (en) Method for predicting drug sensitivity state, device, and storage medium
Clarke et al. Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods
Villanea et al. Multiple episodes of interbreeding between Neanderthal and modern humans
Haibe-Kains et al. Inconsistency in large pharmacogenomic studies
AU2017338775B2 (en) Phenotype/disease specific gene ranking using curated, gene library and network based data structures
Rangel et al. Phylogenetic uncertainty revisited: Implications for ecological analyses
Rudolph et al. Elucidation of signaling pathways from large-scale phosphoproteomic data using protein interaction networks
Gonzalez et al. GE nomes M anagement A pplication (GEM. app): A New Software Tool for Large‐Scale Collaborative Genome Analysis
Ma et al. Modeling disease progression using dynamics of pathway connectivity
US20100318528A1 (en) Sequence-centric scientific information management
CN106971071A (en) A kind of Clinical Decision Support Systems and method
Zhang et al. Accounting for tumor purity improves cancer subtype classification from DNA methylation data
Nettleton et al. Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis
Pataki et al. Understanding and predicting ciprofloxacin minimum inhibitory concentration in Escherichia coli with machine learning
Deng et al. Dissecting the genetic structure and admixture of four geographical Malay populations
Li et al. A gene-based information gain method for detecting gene–gene interactions in case–control studies
US20210090686A1 (en) Single cell rna-seq data processing
Matsui et al. phyC: Clustering cancer evolutionary trees
Dinov et al. The perfect neuroimaging-genetics-computation storm: collision of petabytes of data, millions of hardware devices and thousands of software tools
Marchetti-Bowick et al. A time-varying group sparse additive model for genome-wide association studies of dynamic complex traits
Peng et al. Improving drug response prediction based on two-space graph convolution
Bremer et al. Realistic gene transfer to gene duplication ratios identify different roots in the bacterial phylogeny using a tree reconciliation method
Akinola et al. A systems level comparison of Mycobacterium tuberculosis, Mycobacterium leprae and Mycobacterium smegmatis based on functional interaction network analysis
L’Hostis et al. Knowledge-based mechanistic modeling accurately predicts disease progression with gefitinib in EGFR-mutant lung adenocarcinoma
Audoux et al. SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22784102

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22784102

Country of ref document: EP

Kind code of ref document: A1