CN112635080A

CN112635080A - Deep learning-based drug prediction method and device

Info

Publication number: CN112635080A
Application number: CN202110053416.XA
Authority: CN
Inventors: 杨东; 许田
Original assignee: Fosun Lingzhi Shanghai Pharmaceutical Technology Co ltd
Current assignee: Hangzhou shenai Technology Co.,Ltd.
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2021-04-09

Abstract

The invention relates to a method (100) for drug prediction, comprising: receiving a genomic feature vector (130) derived from a tumor sample of a subject, the genomic feature vector representing genetic mutation information of the tumor; receiving a transcriptome feature vector (140) derived from the tumor sample, the transcriptome feature vector representing gene expression levels of the tumor; deriving a molecular structure feature vector (150) for a drug based on the chemical structure of the drug; and inputting at least one of the genomic feature vector and the transcriptome feature vector and the molecular structure feature vector into a trained deep neural network to determine whether the drug is effective against the tumor (160). The method can more accurately predict the medicine for tumor patients. Preferably, the method further comprises determining a drug target feature vector (155) of the drug based on the targeted pathway and the characteristic targeted by the drug; and additionally inputting the drug target feature vectors into the deep neural network to predict the drug.

Description

Deep learning-based drug prediction method and device

Technical Field

The present invention relates to the field of bioinformatics technology, and more particularly, to a method, an apparatus, a computer program product and a computer readable storage medium for drug prediction based on deep learning, which combine genetic characteristics of tumor patients and characteristics of anticancer drugs to achieve accurate prediction of anticancer drugs.

Background

The discussion in this section is not admitted to be prior art by reference. Similarly, the problems mentioned in this section should not be considered as having been recognized in the prior art.

Cancer is today the largest disease that endangers human health and has been difficult to combat until now. In china, the incidence of cancer is in a rapid rise and cancer has become the first cause of death. Cancer is caused by mutations in the genome (genome) of a human individual. Cancer cells, unlike normal cells, have three characteristics of unlimited growth, transformation, and metastasis, and are therefore difficult to destroy.

Currently, the main approach to cancer treatment is to use molecularly targeted drugs to inhibit the development of cancer. The disadvantage of this targeted therapy is that the patient population that can be covered by biomarkers (biomarker) for predicting drug efficacy is limited and the accuracy and specificity of the drug efficacy guidance is far from sufficient. Another method is to transplant tumor in animal body, then make compound act on animal body, observe growth change of tumor in animal body, so as to determine effect of compound on tumor.

In the face of these challenges, human cancer cell lines provide new vehicles for predicting drug response, facilitating the screening of candidate drugs for cancer treatment. The accuracy of predicting the drug response can be improved by predicting the drug response by analyzing the molecular data of the cancer cell line. According to different data and theories, the current drug response prediction methods are mainly divided into the following two categories:

a medicine response prediction method based on machine learning.

With the continuous development of the machine learning theory, the method for predicting the drug response by using machine learning obtains better results. The method has the advantages that the drug response research is carried out from the gene expression level of the cancer cell line, and the defects that only the characteristics of the tumor cells are considered, and the prediction accuracy rate is to be improved.

Secondly, a network-based drug response prediction method.

The network can reflect the relationships between the nodes. Similar cancer cell lines have been found to respond similarly to similar drugs. The cancer cell line similarity network describes similarities between cancer cell lines, the drug similarity network describes similarities between drugs, and information dissemination methods are used in the similarity network to predict drug responses. The method has the advantages that the drug response prediction research is carried out from the similarity relation between cancer cell lines and between drugs, and the defects that the similarity of the cancer cell lines and the similarity of the drugs are only calculated from a single layer, and the accuracy of the prediction result needs to be improved.

A deep neural network is an artificial intelligence neural network that uses multiple nonlinear and complex translation layers to model high-level features in succession. The deep neural network provides feedback through back propagation that carries the difference between the observed output and the predicted output to adjust the parameters. Deep neural networks have evolved with the availability of large training data sets, increased parallel and distributed computing power, and the development of complex training algorithms.

Given that sequencing data is multi-dimensional and high-dimensional, deep neural networks have broad prospects in bioinformatics research due to their wide applicability and enhanced predictive power. Convolutional neural networks have been used to address sequencing-based issues in genomics, such as motif discovery, pathogen variant recognition, and gene expression inference. The recurrent neural network can capture long-range dependencies in variable-length sequencing data (e.g., protein or DNA sequencing).

Disclosure of Invention

In view of the above problems of the prior art, embodiments of the present invention provide a drug prediction method, a drug prediction analysis apparatus, a computer program product for performing the method, and a computer readable medium storing the program, which combine genetic characteristics and anti-cancer drug characteristics of a tumor patient, by means of self-learning capability of a deep neural network, thereby achieving fast and accurate drug prediction.

According to an aspect of the present invention, there is provided a computer-implemented method of drug prediction, the method comprising: receiving a genomic feature vector derived from a tumor sample of a subject, the genomic feature vector representing genetic mutation information of the tumor; receiving a transcriptome (transcriptomes) feature vector derived from the tumor sample, the transcriptome feature vector representing a gene expression level of the tumor; deriving a molecular structure feature vector of a drug according to the chemical structure of the drug; and inputting at least one of the genomic feature vector and the transcriptome feature vector and the molecular structure feature vector into a trained deep neural network to determine whether the drug is effective against the tumor.

The inventive concept of the present invention is based on the recognition that: the structure of the medicine and the interaction between different medicines and different genetic background tumor cells are comprehensively considered, and the medicine can be more accurately predicted for tumor patients. Preferably, the interaction between the genome characteristics and the transcriptome characteristics and the molecular structure characteristics of the drug are simultaneously considered, so that the accuracy of drug prediction can be further improved.

According to a preferred embodiment of the present invention, the method may further comprise: receiving respective genomic sequencing data for the tumor sample and a normal sample of the subject; and receiving transcriptome sequencing data for the tumor sample of the subject. Preferably, the genomic feature vector is determined by mutation detection of genomic sequencing data of the tumor sample with reference to genomic sequencing data of the normal sample, and the transcriptome feature vector is determined from gene expression values of the tumor derived from the transcriptome sequencing data. As will be appreciated by those skilled in the art, given the increasing maturity of gene sequencing technology, there are more and more companies specializing in gene sequencing, and thus embodiments of the present invention can obtain genomic and transcriptome characteristics of tumors from these companies or their analytical products, as well as from analysis of the obtained genomic and transcriptome sequencing data to obtain genomic and transcriptome characteristics of tumors. Of course, embodiments of the present invention may additionally include the step of analyzing the tumor and normal samples to obtain genomic and transcriptome sequencing data, and even genomic and transcriptome signatures.

According to a preferred embodiment of the present invention, the method may further comprise: determining a drug target feature vector of the drug according to the targeted pathway and characteristics of the drug; and additionally inputting the drug target feature vector into the trained deep neural network to determine whether the drug is effective against the tumor. Thus, correlating the molecular structural features of the drug with the drug target features to genomic and/or transcriptome features of the tumor patient also improves the accuracy of drug prediction. Preferably, the drug target feature vector may be obtained by classifying the drug according to the mechanism of action of the molecular compound of the drug and then encoding.

According to a preferred embodiment of the present invention, the molecular structure feature vector may be calculated based on the atomic features, chemical bond features, and connection relationships between atoms in the molecular compound of the drug. Preferably, the atomic features are selected from at least one feature of the group consisting of: atomic species, atomic mass, chirality, aromaticity, type of hybrid orbital, number of chemical bonds attached to it, number of hydrogen atoms attached to it, and formal charge carried by the atom. Preferably, the chemical bond characteristic is at least one characteristic selected from the group consisting of: type of chemical bond, conjugation, cyclization, and steric properties. Preferably, the linkage between the atoms indicates the interconnection between the respective atoms in the molecular compound.

According to a preferred embodiment of the present invention, the deriving may include inputting a compound molecular map of a molecular compound for the drug into a map Neural Network (GNN) in the deep Neural Network and sequentially extracting features on each atom and chemical bond in the molecular compound according to the connection relationship between the atoms to calculate the molecular structure feature vector. The compound molecular diagram contains information about the characteristics of the atoms, the characteristics of chemical bonds, and the connection relationships between atoms.

According to a preferred embodiment of the invention, the input may comprise: combining and concatenating at least one of the genomic feature vector and the transcriptome feature vector with at least the molecular structure feature vector to form an input feature vector; and inputting the input feature vector into the trained deep neural network to determine whether the drug is effective against the tumor. In one example, the input may include: combining and concatenating at least one of the genomic feature vector and the transcriptome feature vector with the molecular structure feature vector and the drug target feature vector to form an input feature vector; and inputting the input feature vector into the trained deep neural network to determine whether the drug is effective against the tumor.

According to a preferred embodiment of the present invention, the input may further include: the input feature vectors for each of the at least one drug are input into the trained deep neural network in turn to calculate a probability that each drug is effective against the tumor. Preferably, the at least one drug includes at least two drugs, and the inputting may further include: ranking the probability of effectiveness of the at least two drugs to select the most effective drug. Thus, if all existing anticancer drugs are input into the deep neural network of the present invention, it is possible to analyze which drug is most effective for a certain tumor patient, thereby enabling to actively and effectively treat the patient.

According to a preferred embodiment of the present invention, the deep neural network may be trained using different tumors and pharmacodynamic results (as training data) corresponding to different drugs. Preferably, the deep neural network may be trained in a stochastic gradient descent method using two-class cross entropy as a loss function.

According to a preferred embodiment of the invention, the deep Neural Network may comprise a Feed-Forward Neural Network (FFNN), wherein an output layer of the Feed-Forward Neural Network comprises a Softmax classifier. Preferably, the hyper-parameters of the feedforward neural network may include the number of neural network layers, the number of nodes of each hidden layer of the neural network, and a dropout rate. These network hyper-parameters determine the structure of the feedforward neural network. Preferably, the number of nodes of the input layer of the feedforward neural network depends on the length of the input feature vector.

According to another aspect of the present invention, there is provided a medication prediction analysis apparatus including: a memory having executable instructions stored thereon; and a processor; wherein the processor is configured to execute the executable instructions to perform the method steps recited in the preceding paragraph.

According to yet another aspect of the present invention, there is provided a machine-readable storage medium having stored thereon executable instructions, wherein the executable instructions, when executed by a machine, cause the machine to perform the method steps recited in the preceding paragraph.

According to yet another aspect of the invention, there is provided a computer program product comprising executable instructions, characterized in that the executable instructions, when executed by a processor, cause the processor to perform the method steps of the preceding paragraph.

Other objects and effects of the present invention will become more apparent and more easily understood by referring to the description taken in conjunction with the accompanying drawings.

Drawings

The features, characteristics, advantages and benefits of the present invention will become apparent from the following detailed description taken in conjunction with the accompanying drawings.

FIG. 1 shows a general flow diagram of a computer-implemented drug prediction method 100 according to one embodiment of the invention.

Fig. 2 shows a schematic view of a medication prediction analysis device 200 according to another embodiment of the invention.

FIG. 3 shows the molecular structure of a compound of a certain drug and its molecular diagram.

FIG. 4 shows an architecture diagram 300 of a deep neural network, according to one embodiment of the present invention.

In the drawings the same reference numerals indicate similar or corresponding features and/or functions.

Detailed Description

The subject matter described herein will now be discussed with reference to example embodiments. It should be understood that these embodiments are discussed only to enable those skilled in the art to better understand and thereby implement the subject matter described herein, and are not intended to limit the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, substitute, or add various procedures or components as needed. For example, the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may also be combined in other examples.

As used herein, the term "include" and its variants mean open-ended terms in the sense of "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment". The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. The definition of a term is consistent throughout the specification unless the context clearly dictates otherwise.

Cancer is today the largest disease that endangers human health. The existing drug prediction methods for tumors only consider the characteristics of tumor cells, or focus on researching the targeting pathway of drugs, and the accuracy of the prediction results of the methods needs to be improved. Therefore, the inventor of the application firstly proposes the structure of the combined medicine, the medicine target and the interaction between different medicines and different genetic background tumor cells, and predicts the medicines for tumor patients by means of the self-learning capability of the deep neural network, thereby improving the accuracy of medicine prediction.

Hereinafter, various embodiments of the present invention will be described in detail with reference to the accompanying drawings.

FIG. 1 shows a general flow diagram of a computer-implemented drug prediction method 100 according to one embodiment of the invention. The drug prediction method 100 shown in fig. 1 will be described in detail below with reference to the drug prediction analysis apparatus 200 shown in fig. 2, the compound molecular structure shown in fig. 3 and its molecular diagram, and the neural network architecture diagram 300 shown in fig. 4.

As shown in fig. 1, genomic (genomics) sequencing data for a tumor sample and a normal sample of a subject (e.g., a tumor patient) is received at block 110, for example, by a processor 220. "tumor sample" generally refers to a sample derived from a diseased site or tissue of a tumor patient, such as a lung tumor tissue sample of a lung cancer patient, e.g., obtained by surgery, biopsy, etc. A "normal sample" is generally a normal sample from a non-diseased part or tissue of the same tumor patient, usually a normal sample taken from a side of the diseased part or tissue, also called "paracancerous normal tissue", e.g., lung cancer, and a normal tissue in the lung near the lung tumor is taken as a normal control. For example, the genome sequencing data may be acquired using state-of-the-art high-throughput sequencing technologies, such as Whole Exome Sequencing (WES), Whole Genome Sequencing (WGS), and the like. In one example, the sequencing of the genome may also be performed by sequencing companies that specialize in providing sequencing services, such as the wara gene (BGI), and thus, the drug prediction method 100 need only receive genome sequencing results from these sequencing companies. For example, the received genome sequencing results may be stored in the memory 210 (shown in fig. 2) of the drug prediction analysis device 200.

At block 120, transcriptome (transcriptomemics) sequencing data for the tumor sample of the subject (e.g., a tumor patient) is received. For example, the transcriptome sequencing data may be obtained using various known RNA sequencing techniques (RNA-seq). In one example, sequencing of transcriptomes may also be performed by sequencing companies that specialize in providing sequencing services (e.g., Huada Gene (BGI), Illumina, Thermo Fisher, etc.), and thus, the drug prediction method 100 need only receive transcriptome sequencing results from these sequencing companies. For example, the received transcriptome sequencing results may be stored in the memory 210 (shown in FIG. 2) of the drug prediction analysis device 200.

At block 130, genomic sequencing data of the tumor sample is mutation detected with reference to genomic sequencing data of the normal sample, e.g., by processor 220, to determine genomic feature vectors of the tumor. In one example, the tumor genomic DNA sequencing data is subjected to mutation detection to obtain the gene mutation information of point mutation, insertion deletion mutation and fusion gene mutation of tumor somatic cells. A wide variety of techniques are known to be suitable for variant detection of genomic sequencing data. In a preferred embodiment, the genomic sequencing data is subjected to mutation detection using at least two mutation detection methods.

In another example, at block 130, a genomic feature vector derived from a tumor sample of a subject is received, e.g., by processor 220, the genomic feature vector representing genetic mutation information of the tumor. As described above, as gene sequencing technologies become more mature, there are more and more companies dedicated to gene sequencing work, so that the embodiments of the present invention can analyze the obtained genome sequencing data to obtain the genome features of the tumor, and can also obtain the genome features of the tumor from these companies or their analysis products and convert them into genome feature vectors.

In a preferred embodiment, PolyPhen prediction soft can be used, for exampleThe obtained gene mutation information is subjected to mutation function annotation (http:// genetics. bwh. harvard. edu/pph2/) to obtain a genome feature vector F of the tumor_dna(Mⁱ ₁，Mⁱ ₂，Mⁱ ₃，…，Mⁱ _j) Wherein M isⁱ _jWhether there is a mutation in the jth gene of the ith tumor that causes gene dysfunction, Mⁱ _jE {1,0}, where 1 represents the presence of a mutation that causes gene dysfunction, 0 represents the absence of a mutation or the presence of a mutation that does not cause gene dysfunction, and j e {1,2,3, …, n }, where n is the number of genes, e.g., n may be 3 ten thousand total genes owned by a human, or 800, 1000, 1500, 2000, etc. genes. One skilled in the art will appreciate that other software or methods may be used to functionally annotate gene mutations.

At block 140, gene expression values for the tumor are derived from the transcriptome sequencing data, e.g., by processor 220, to determine a transcriptome feature vector for the tumor. The gene DNA of the tumor is transcribed into RNA, the RNA is translated into protein, and finally the protein performs functions in the body. "Gene expression value" refers to the amount of RNA formed by transcription of a gene. Higher or lower amounts of RNA may cause abnormal body functions. In one example, the ith tumor gene expression value is normalized (Z)ⁱ _j＝(xⁱ _jMu)/sigma, x is the gene expression value obtained by RNA sequencing), obtaining the transcriptome feature vector F of the tumorⁱ _rna(Zⁱ ₁，Zⁱ ₂，Zⁱ ₃，…，Zⁱ _j) Wherein Z isⁱ _jJ ∈ {1,2,3, …, n }, where n is the number of genes, and n may be, for example, 3 ten thousand total genes owned by human, or 800, 1000, 1500, 2000, etc. genes. In another example, the transcriptome feature vector for a tumor may be obtained directly from tumor gene expression values without normalization.

In another example, a transcriptome feature vector derived from the tumor sample is received, e.g., by processor 220, the transcriptome feature vector representing gene expression levels of the tumor. As mentioned above, as gene sequencing technology becomes more mature, there are more and more companies dedicated to gene sequencing work, so that the embodiments of the present invention can analyze the obtained transcriptome sequencing data to obtain the transcriptome characteristics of the tumor, and can also obtain the transcriptome characteristics of the tumor from these companies or their analysis products and convert them into the transcriptome characteristic vector.

At block 145, a compound molecular Graph of a molecular compound for a drug (e.g., any one of the anti-cancer drugs) is input into a Graph Neural Network (GNN: Graph Neural Network) in a deep Neural Network, for example, by processor 220, and features on each atom and chemical bond in the molecular compound are sequentially extracted according to the connections between atoms to calculate a molecular structure feature vector, wherein the compound molecular Graph contains information about the atomic features, the chemical bond features, and the connections between atoms. Preferably, the graph Neural Network comprises a Message Passing Neural Network (MPNN), such as a D-MPNN. To explain the principle of the present invention with reference to fig. 3, the chemical structure of a molecule of a certain compound is treated with a graph theory of mathematics (graph theory) structure. Each atom in the compound is represented as a node (vertex) of a molecular diagram, and in the example of fig. 3, the compound includes 6 nodes (where the hydrogen atom H in the compound is not considered a node). Chemical bonds between atoms are represented by edges (edges) between nodes, and the edges between nodes can reflect the interconnection relationship between atoms. Each atom and each chemical bond in the compound are subjected to one-hot coding by corresponding characteristics (except for atomic mass, the atomic mass is coded by a real value), and the specific coding mode is as follows:

atomic features include, but are not limited to:

(1) atomic species (100-dimensional, e.g., C, N, O, etc.);

(2) atomic mass (1-dimensional);

(3) chiral (4-dimensional, e.g., CW, CCW, uncertain, other);

(4) aromaticity (1 dimension);

(5) hybrid orbital type (5-dimensional, e.g., sp2, sp3, sp3d, sp3d 2);

(6) the number of chemical bonds (6 dimensions) attached thereto;

(7) the number of hydrogen atoms bonded thereto (5 dimensions);

(8) charge carried by the atom (5-dimensional).

Chemical bond characteristics include, but are not limited to:

(1) type of chemical bond (4-dimensional, e.g., single bond, double bond, triple bond, aromatic bond);

(2) conjugation (1-dimensional);

(3) looping (1 dimension);

(4) steric properties (6-dimensional, e.g., E, Z, cis, trans, etc.).

The connection relationship between atoms refers to the connection relationship between the atoms in the molecular compound. For example, in the example of fig. 3, atom number 1 is connected to

atoms number

2 and 5; atom No. 2 is attached to atoms No. 1, No. 3, and No. 5; atom No. 3 is linked to atoms No. 2 and No. 4; atom No. 3 is attached to atoms No. 2, No. 3 and No. 5; atom No. 5 is attached to atoms No. 1, No. 2, and No. 4; atom number 6 is attached only to atom number 4.

In one example, the extraction process of the GNN molecular structure features is as follows:

(1)h⁰ _vw＝Relu(W_i cat(x_v,e_vw) In which x is_vIs a feature of the v-th atom, e_vwIs a chemical bond of the v-th atom to the W-th atom, W_iIs the initial parameter matrix, Relu is the activation function, h⁰ _vwHidden states (hidden states) at time vw of 0;

(2)m^t+1 _vw＝∑_k∈{N(v)\w}h^t _kvwherein m is^t+1 _vwIs the information of the t +1 step vw;

(3)h^t+1 _vw＝Relu(h⁰ _vw+W_mm^t+1 _vw)；

(4)t∈{1,2,3…,T}，m_v＝∑_k∈N(v)h^T _vwt number of steps of information transfer;

(5)h_v＝Relu(W_a cat(x_v,m_v))，W_ais a parameter matrix obtained after neural network learning training;

(6)H_structure＝∑_v∈Gh_vwherein G is the set of all atoms of the compound, H_structureNamely the molecular structure characteristic vector extracted by GNN.

One skilled in the art will appreciate that the atomic features and chemical bond features are not limited to the examples described above, and some of the features of the examples described above may be used to determine the molecular structure feature vector of the drug. In one example, other means, such as molecular fingerprinting, may also be employed, e.g., by the processor 220, to derive a molecular structure feature vector for a drug from the chemical structure of the drug, see block 150.

At block 155, a drug target feature vector for the drug is determined, e.g., by the processor 220, based on the targeted pathway and feature to which the drug is directed. In one example, molecular compounds of anticancer drugs are classified according to their mechanism of action (MOA) and then encoded to give the drug target feature vector F of the drug_moa. For example, a cell has an abnormal expression (mRNA level) due to a mutation in the a gene, resulting in an abnormal a protein level. A drug is considered to be a drug targeting the a gene if its (small) molecular compound can specifically reverse the abnormality of the a protein without causing abnormalities in the levels of other proteins. If the a gene is biologically belonging to the a signaling pathway, then the drug is classified as a drug targeting a signaling. For example, drugs used to regulate apoptosis may be classified as "apoptosis regulation" drugs. For example, drugs used to control cell classification may be classified as "mitosis" drugs. For example, a drug whose mechanism of action is not known or belongs to the existing classification may be classified as "other". For an understanding of the specific classification of drugs, see, for example, the article "A Landscape of pharmaceutical International in Cancer" (2016, Cell) by France sco Iorio.

In one example, according to the targeting pathway and characteristics of the anticancer drugs, the present invention classifies the existing anticancer drugs into 21 classes (table 1), and accordingly performs one-hot (one-hot) encoding on each anticancer drug to obtain a drug target feature vector F_moa(21D). The dimension of a certain medicine belongs to which type, the dimension of the medicine belongs to be coded as 1 in the coding process, and the dimension of other types of medicines belongs to be coded as 0.

″ABL signaling″	″apoptosis regulation″	″cell cycle″
			″chromain histone acetylation″	″chromatin histone methylation″	″chromatin other″
″cytoskeleton″	″DNA replication″	″EGFR signaling″
			″ERK MAPK signaling″	″Genome integrity″	″IGFR signaling″
″JNK and p38 signaling″	″metabolism″	″mitosis″
			″other″	″p53 signaling″	″PI3K signaling″
″RTK signaling″	″TOR signaling″	″WNT signaling″

TABLE 1 classification of anticancer drugs

At block 160, the molecular structure feature vector and at least one of the genomic feature vector and the transcriptome feature vector are input to a trained deep neural network to determine, e.g., by processor 220, whether the drug is effective against the tumor. In one example, at least one of the genomic feature vector and the transcriptome feature vector is combined and concatenated with at least the molecular structure feature vector to form a (one-dimensional) input feature vector; and inputting the input feature vector into the trained deep neural network. The deep Neural Network may comprise, for example, a Feed-Forward Neural Network (FFNN) comprising an input layer, an implied layer, and an output layer. The input layer receives the (one-dimensional) input feature vectors. The output layer includes a Softmax classifier. The hyper-parameters (superparameters) of the feedforward neural network include the number of layers (layers) of the neural network, the number of nodes (hidden size) of each hidden layer of the neural network, and a dropout rate (which indicates the probability of each hidden layer randomly closing a node), which determine the structure of the neural network. In addition, the number of nodes of the input layer depends on the length of the input feature vector (i.e., F)_rnaAnd/or F_dna、H_structure(even F)_moa) Length of (c) and (c). In a preferred embodiment, the hyper-parameters of the FFNN are optimized by Bayesian Optimization (Bayesian Optimization)) (see Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., de Freitas, N. "Taking the Human Out of the Loop: a Review of Bayesian Optimization "proc.ieee 2016, 104, 148- > 175), wherein Bayesian Optimization of the number of iterations: 50 times; the Bayesian optimization search space is as follows: number of hidden layer nodes: 600-4800, number of hidden layers: 2-4; dropout rate: 0-0.5. Also searched for is the number of steps T of the hyper-parametric information transfer in GNN: 2-6.

In a preferred embodiment, the drug target feature vector is additionally input into the trained deep neural network to determine whether the drug is effective against the tumor. In one example, the output layer Softmax classifier of FFNN makes two-class predictions of whether a certain drug is valid or invalid for the tumor, outputs a prediction probability for each class, i.e., outputs two values: the probability of class 1 (valid) and the probability of class 2 (invalid), the sum of the two values being equal to 1.

In a preferred embodiment, the input feature vectors for each of the at least one drug are input into the trained deep neural network in turn to calculate the probability that each drug is effective against the tumor. More preferably, the probability of each of the at least two drugs (and more preferably all anti-cancer drugs) being effective against the tumor is calculated and the effective probabilities of the at least two drugs are ranked to select the most effective drug. For example, after a physician surgically resects a patient's tumor, or a biopsy sample is taken from the patient's tumor, the patient's sample is sent to a sequencing company for high throughput DNA, RNA sequencing, and obtaining such as genomic feature vector F as described above_dnaAnd/or transcriptome feature vector F_rnaThe patient characteristic of (1). In one example, assuming that the efficacy of 4 drugs A, B, C, D is predicted, the characteristics of drug a, drug B, drug C, and drug D (including at least the molecular structure feature vector H) are sequenced_structure) Features (including genomic feature vector F) associated with the patient, respectively_dnaAnd/or transcriptome feature vector F_rna) Combining the series to form a (one-dimensional) input feature vector, and inputting into a trained deep neural network (e.g., into a feed-forward neural network FFNN)The probability that each drug is effective against the tumor is given separately by Softmax via the output layer of the network (e.g., FFNN). The effective probabilities of the four drugs may then be ranked, such as by processor 220, to select the most effective drug. For example, the effective probability of a drug is C>B>A>D, the most recommended drug is C, and drug B times, relative to drug D, which was last recommended.

In one example, the deep neural network shown in FIG. 4 is trained using Stochastic Gradient Descent (SGD) with patients of different tumors and pharmacodynamic results corresponding to different drugs as training data and Binary Cross Entropy (Binary Cross Entropy) as a loss function. For example, drug a is known to be effective against tumor a. Then, the drug characteristics of drug a and the patient characteristics of tumor a were extracted separately according to the method described above. These feature combinations are input into FFNN shown in fig. 4, and through network calculation, output layer Softmax gives the probability that drug a is effective to tumor a and the probability that drug a is ineffective to tumor a, respectively. If the calculated probability is far from the true result, e.g. the probability of Softmax output drug A being valid is 0.3 and the probability of drug A being invalid is 0.7, which is not consistent with the true drug effect, then the network parameters are modified (the method of adjusting the parameters is determined by the SGD), e.g. W in GNN_aEtc. so that the calculated result is as consistent as possible with the true result, i.e. the prediction deviation is less than a predetermined threshold. And then, bringing the next group of medicines and tumors into the network to continue training until the prediction deviation is smaller than a preset threshold value after all groups of medicines and tumors are calculated by the network. The actual predicted deviation is reflected by a cross entropy loss function, and each group of training data can calculate the predicted deviation through the cross entropy function, namely loss (loss). And when the parameters are continuously trained and adjusted through multiple iterations, the model loss cannot be further reduced, and the training is finished.

Fig. 2 shows a schematic view of a medication prediction analysis device 200 according to another embodiment of the invention. As shown in fig. 2, the medication prediction analysis device 200 may include a memory 210 and a processor 220. Memory 210 has stored thereon executable instructions. The processor 220 may be configured to execute the executable instructions to perform the medication prediction method 100 shown in fig. 1. Those skilled in the art will appreciate that all of the above functions implemented by the processor 220 may be implemented by a single processor or may be implemented by multiple processors, respectively.

There is also provided in accordance with yet another embodiment of the present invention a machine-readable storage medium having stored thereon executable instructions, wherein the executable instructions, when executed by a machine, cause the machine to perform the medication prediction method 100 illustrated in fig. 1.

There is also provided, in accordance with yet another embodiment of the present invention, a computer program product including executable instructions, wherein the executable instructions, when executed by a machine, cause the machine to perform the medication prediction method 100 illustrated in fig. 1.

There is also provided in accordance with yet another embodiment of the present invention a medication prediction analysis device including a receiving unit. In one example, the receiving unit is configured to receive genomic sequencing data for a tumor sample and a normal sample of a subject. In another example, the receiving unit is further configured to receive transcriptome sequencing data for the tumor sample of the subject.

In a preferred embodiment, the medication prediction analysis device further comprises a patient characteristic determination unit. In one example, the patient feature determination unit is configured to perform mutation detection on the genomic sequencing data of the tumor sample with reference to the genomic sequencing data of the normal sample to determine a genomic feature vector of a tumor. In another example, the patient characteristic determination unit is further configured to derive gene expression values for the tumor from the transcriptome sequencing data to determine a transcriptome characteristic vector for the tumor.

In a preferred embodiment, the drug prediction analysis device further comprises a drug target feature determination unit configured to determine a drug target feature vector of the drug according to the targeted pathway and feature targeted by the drug.

In a preferred embodiment, the medication prediction analysis device further comprises an analysis unit. In one example, the analysis unit is configured to derive a molecular structure feature vector of a drug from a chemical structure of the drug. Preferably, the analysis unit is further configured to input at least one of the genomic feature vector and the transcriptome feature vector and the molecular structure feature vector into a trained deep neural network to determine whether the drug is effective against the tumor.

In another example, the analysis unit is further configured to input a compound molecular graph of a molecular compound for the drug into a Graph Neural Network (GNN) in the deep neural network and sequentially extract features on each atom and chemical bond in the molecular compound according to the connection relationship between the atoms to calculate the molecular structure feature vector, wherein the compound molecular graph contains information on the atom features, the chemical bond features, and the connection relationship between the atoms.

In a preferred embodiment, the analysis unit is further configured to input the input feature vectors for each of the at least two drugs into the trained deep neural network in turn to calculate a probability of each drug being effective against the tumor, and rank the effective probabilities of the at least two drugs to select the most effective drug.

The detailed description set forth above in connection with the appended drawings describes exemplary embodiments but does not represent all embodiments that may be practiced or fall within the scope of the claims. The term "exemplary" used throughout this specification means "serving as an example, instance, or illustration," and does not mean "preferred" or "advantageous" over other embodiments. The detailed description includes specific details for the purpose of providing an understanding of the described technology. However, the techniques may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described embodiments.

Reference throughout this disclosure to "computer program or instructions" may be stored or distributed on a machine or computer readable medium, or distributed over a network or otherwise.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A computer-implemented medication prediction method (100), the method (100) comprising:

receiving a genomic feature vector (130) derived from a tumor sample of a subject, the genomic feature vector representing genetic mutation information of the tumor;

receiving a transcriptome feature vector (140) derived from the tumor sample, the transcriptome feature vector representing gene expression levels of the tumor;

deriving a molecular structure feature vector (150) for a drug based on the chemical structure of the drug; and is

Inputting at least one of the genomic feature vector and the transcriptome feature vector and the molecular structure feature vector into a trained deep neural network to determine whether the drug is effective against the tumor (160).

2. The method (100) of claim 1, further comprising:

receiving respective genomic sequencing data (110) for the tumor sample and a normal sample of the subject; and is

Receiving transcriptome sequencing data for the tumor sample of the subject (120);

wherein the genomic feature vector is determined by mutation detection of genomic sequencing data of the tumor sample with reference to genomic sequencing data of the normal sample, and

wherein the transcriptome feature vector is determined from gene expression values of the tumor derived from the transcriptome sequencing data.

3. The method (100) according to claim 1 or 2, further comprising:

determining a drug target feature vector (155) of the drug according to the targeted pathway and characteristics to which the drug is directed; and is

Additionally inputting the drug target feature vector into the trained deep neural network to determine whether the drug is effective against the tumor.

4. The method (100) according to claim 3, wherein the drug target feature vector is obtained by classifying the drug according to its molecular compound's mechanism of action and then encoding it.

5. The method (100) according to claim 1 or 2, wherein the molecular structure feature vector is calculated from the characteristics of atoms, chemical bonds and connections between atoms in the molecular compound of the drug,

wherein the atomic feature is at least one feature selected from the group consisting of: atomic species, atomic mass, chirality, aromaticity, type of hybrid orbital, number of chemical bonds associated therewith, number of hydrogen atoms associated therewith, and formal charge carried by the atom;

wherein the chemical bond characteristic is at least one characteristic selected from the group consisting of: type of chemical bond, conjugation, cyclization, and steric properties; and is

Wherein the linkage between the atoms indicates the interconnection between the atoms in the molecular compound.

6. The method (100) of claim 5, wherein the deriving comprises inputting a compound molecular graph for a molecular compound of the drug into a Graph Neural Network (GNN) in the deep neural network and sequentially extracting features on each atom and chemical bond in the molecular compound according to the connections between the atoms to compute the molecular structure feature vector (145), wherein the compound molecular graph contains information about the atom features, chemical bond features, and connections between atoms.

7. The method (100) according to claim 1 or 2, wherein the input comprises:

combining and concatenating at least one of the genomic feature vector and the transcriptome feature vector with at least the molecular structure feature vector to form an input feature vector; and is

Inputting the input feature vector into the trained deep neural network to determine whether the drug is effective against the tumor.

8. The method (100) of claim 7, wherein the inputting further comprises:

the input feature vectors for each of the at least one drug are input into the trained deep neural network in turn to calculate a probability that each drug is effective against the tumor.

9. The method (100) of claim 8, wherein the at least one drug includes at least two drugs,

wherein the inputting further comprises: ranking the probability of effectiveness of the at least two drugs to select the most effective drug.

10. The method (100) of claim 3, wherein the inputting comprises:

combining and concatenating at least one of the genomic feature vector and the transcriptome feature vector with the molecular structure feature vector and the drug target feature vector to form an input feature vector; and is

11. The method (100) according to claim 1 or 2,

the deep neural network is trained by adopting different tumors and drug effect results corresponding to different drugs;

wherein the deep neural network is trained by a random gradient descent method by adopting two-class cross entropy as a loss function.

12. The method (100) according to claim 1 or 2, wherein the deep neural network comprises a feed-forward neural network (FFNN), wherein an output layer of the feed-forward neural network comprises a Softmax classifier, wherein the hyper-parameters of the feed-forward neural network comprise a number of neural network layers, a number of nodes per hidden layer of the neural network, and a dropout rate, and wherein the number of nodes of the input layer of the feed-forward neural network depends on the length of the input feature vector.

13. A medication prediction analysis device (200) comprising:

a memory (210) having executable instructions stored thereon; and

a processor (220); characterized in that the processor is configured to execute the executable instructions to perform the steps of the method (100) according to any one of claims 1-12.

14. A machine-readable storage medium having stored thereon executable instructions, which when executed by a machine, cause the machine to perform the steps of the method (100) according to any one of claims 1-12.

15. A computer program product comprising executable instructions, characterized in that the executable instructions, when executed on a processor, cause the processor to perform the steps of the method (100) according to any one of claims 1-12.