CN111429968A

CN111429968A - Method, electronic device, and computer storage medium for predicting tumor type

Info

Publication number: CN111429968A
Application number: CN202010166919.3A
Authority: CN
Inventors: 姚鸣; 张鹏; 王凯
Original assignee: Shanghai Zhiben Medical Laboratory Co ltd; Origimed Technology Shanghai Co ltd
Current assignee: Shanghai Zhiben Medical Laboratory Co ltd; Origimed Technology Shanghai Co ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-07-17
Anticipated expiration: 2040-03-11
Also published as: CN111429968B

Abstract

The present disclosure relates to a method, electronic device, and computer storage medium for predicting a tumor type. The method comprises the following steps: acquiring characteristic information about a tumor to be detected; obtaining the comparison result information of the genome sequencing sequence of a sample to be detected of a tumor to be detected and a reference genome sequence; generating mutation type data on a plurality of predetermined mutation types based on the alignment result information; generating input data for inputting the prediction model based on the feature information and the mutation type data; and extracting feature values of the input data via a prediction model to predict a type of the tumor to be measured based on the extracted feature values, the prediction model being generated via machine learning model training on a plurality of training samples. The present disclosure can improve the accuracy of predicting the type of tumor at the primary site.

Description

Method, electronic device, and computer storage medium for predicting tumor type

Technical Field

The present disclosure relates generally to biological information processing, and in particular, to methods, electronic devices, and computer storage media for predicting tumor type.

Background

Diagnosis of primary sites of cancer is the primary basis for clinical guidance. Conventional approaches for predicting tumor type are mainly based on histological prediction approaches, such as immunohistochemical based assessment and cross-sectional imaging of high quality tumor tissue. Clinical treatment of cancer has a close relationship with the site of origin, histopathological subtype and stage of the tumor. However, traditional histology-based approaches to tumor type prediction are challenging in many cases, especially for those with metastatic, poorly differentiated tumors, and it is sometimes difficult to unambiguously and accurately determine the type of tumor at the primary site. While an ambiguous or incorrect classification of tumor type may negatively impact the choice of treatment and the therapeutic effect.

In summary, the conventional scheme for predicting tumor type has the disadvantage that it is difficult to clearly and accurately determine the type of tumor at the primary site.

Disclosure of Invention

The present disclosure provides a method, an electronic device, and a computer storage medium for predicting a tumor type, which can improve accuracy of predicting a type of a tumor of a primary site.

According to a first aspect of the present disclosure, a method for predicting a tumor type is provided. The method comprises the following steps: acquiring characteristic information about a tumor to be detected; obtaining the comparison result information of the genome sequencing sequence of a sample to be detected of a tumor to be detected and a reference genome sequence; generating mutation type data on a plurality of predetermined mutation types based on the alignment result information; generating input data for inputting the prediction model based on the feature information and the mutation type data; and extracting feature values of the input data via a prediction model to predict a type of the tumor to be measured based on the extracted feature values, the prediction model being generated via machine learning model training on a plurality of training samples.

According to a second aspect of the present invention, there is also provided a computing device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute the one or more programs to cause the apparatus to perform the method of the first aspect of the disclosure.

According to a third aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions which, when executed, cause a machine to perform the method of the first aspect of the disclosure.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.

Drawings

Fig. 1 shows a schematic diagram of a system 100 for implementing a method of predicting a tumor type according to an embodiment of the present disclosure;

fig. 2 shows a flow diagram of a method 200 for predicting a tumor type according to an embodiment of the present disclosure;

FIG. 3 schematically shows a schematic diagram of a prediction model 300 constructed based on a random forest model;

fig. 4 shows a flowchart of a method 400 for generating second data about genetic variations, in accordance with an embodiment of the present disclosure;

FIG. 5 shows a flow diagram of a method 500 for generating input data for a predictive model, in accordance with an embodiment of the present disclosure;

FIG. 6 shows a flow diagram of a method 600 for generating input data, in accordance with an embodiment of the present disclosure; and

FIG. 7 schematically illustrates a block diagram of an electronic device 700 suitable for use in implementing embodiments of the present disclosure.

Like or corresponding reference characters designate like or corresponding parts throughout the several views.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.

As mentioned above, in the above-mentioned conventional histology-based tumor type prediction method, it is difficult to clearly and accurately determine the type of tumor at the primary site in the case of metastatic, poorly differentiated tumor, and the determination of the primary site of cancer is a main basis for guiding clinical treatment. Therefore, the conventional methods for predicting tumor types are not favorable for providing accurate detection or guidance for cancer diagnosis and treatment.

It is understood that even if patients with tumors have certain sensitive mutations that direct molecular targeted therapy, the clinical response is often associated with the primary site of the tumor. For example, the BRAF gene amino acid mutation V600E is an interference produced by many tissue sites, and the degree of effect on the response to RAF inhibitors varies depending on the tumor type. Therefore, the genome sequencing result capable of accurately indicating the primary part of the tumor is beneficial to providing accurate guide detection basis for cancer diagnosis and treatment.

To address, at least in part, one or more of the above issues and other potential issues, an example embodiment of the present disclosure proposes a scheme for predicting a tumor type. The scheme comprises the following steps: acquiring characteristic information about a tumor to be detected; obtaining the comparison result information of the genome sequencing sequence of a sample to be detected of a tumor to be detected and a reference genome sequence; generating mutation type data on a plurality of predetermined mutation types based on the alignment result information; generating input data for inputting the prediction model based on the feature information and the mutation type data; and extracting feature values of the input data via a prediction model to predict a type of the tumor to be measured based on the extracted feature values, the prediction model being generated via machine learning model training on a plurality of training samples.

In the above scheme, the present disclosure predicts the accuracy of the type of tumor of the primary site by generating input data based on the characteristic information of the tumor to be detected and various genome mutation characteristics determined through genome sequencing alignment, and extracting the characteristics of the input data via a prediction model trained by a sample, and predicting the type of the tumor to be detected. This is because, on the one hand, mutations accumulate in the DNA data, forming a history of tumor evolution that is not affected by the local metastatic environment, thus facilitating a more accurate indication of the primary site of the tumor; on the other hand, the method is beneficial to improving the accuracy of predicting the tumor type of the primary part by comprehensively considering the contribution of the characteristic information of the tumor to be detected determined through traditional immunohistochemistry and clinical evaluation to the aspect of tumor type prediction and considering the contribution of a plurality of mutation type conditions of a genome obtained through DNA sequencing comparison to the aspect of tumor type prediction.

Fig. 1 shows a schematic diagram of a system 100 for implementing a method of predicting a tumor type according to an embodiment of the present disclosure. As shown in fig. 1, the system 100 includes: a data acquisition unit 112, a data conversion module 114, and a prediction model 116. In some embodiments, the system 100 further comprises: a comparison unit 110, a letter generation server 140, a network 150, and a server 120.

In some embodiments, the data acquisition unit 112, the data transformation module 114, the prediction model 116 may be configured on one or more computing devices 130; and the alignment unit 110 may be independent of the computing device 130. The computing device 130 may interact with the comparison unit 110, the letter generation server 140, and the server 120 in a wired or wireless manner (e.g., the network 150).

Regarding the computing device 130, it is used for predicting the type of tumor via a prediction model based on the acquired feature information of the tumor to be tested and mutation type data generated based on genome sequencing information of a sample to be tested. In some embodiments, computing device 130 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device.

In some embodiments, the computing device 130 may obtain, via the network 150, alignment result information from the trust generation server 140 regarding genomic (DNA) sequencing sequences of a predetermined number (e.g., without limitation, ten thousand) of tumor patients (e.g., tumor types for which primary sites are known) and reference genomic sequences for use in forming training sample data for training the predictive model 116. In some embodiments, the computing device 130 may also be configured with a general process flow for generating information about the alignment of the genomic sequencing sequence of the test sample with the reference genomic sequence of the tumor to be tested via the configured process flow for generating the results from the alignment unit 110. The computing device 130 may also obtain feature information about the tumor to be measured and attribute information of the patient about the patient (i.e., the subject to which the tumor to be measured belongs) from the server 120 directly via the network 150. The characteristic information of the tumor to be tested is, for example, stage type information of the cancer to be tested (or the tumor to be tested) determined through conventional immunohistochemistry and clinical evaluation. The attribute information of the patient is, for example, age data and sex information of the patient. In some embodiments, the characteristic information about the tumor to be measured and the attribute information of the patient may be directly input locally at the computing device 130.

And a data acquiring unit 112 for acquiring characteristic information about the tumor to be detected and acquiring information about the alignment result of the genome sequencing sequence of the sample to be detected and the reference genome sequence of the tumor to be detected. In some embodiments, the data acquisition unit 112 is further configured to acquire attribute information about the subject (i.e., the patient) of the tumor to be measured. The data acquisition unit 112 transmits the acquired feature information, comparison result information, and attribute information to the data conversion module 114.

And a data conversion module 114 for generating mutation type data on a plurality of predetermined mutation types based on the acquired comparison result information, and further converting the acquired feature information and attribute information, the generated mutation type data, to input data for inputting to the prediction model 116. In some embodiments, the data conversion module 114 may also first perform a preliminary filtering on the obtained comparison result information, and then generate mutation type data regarding a plurality of predetermined mutation types based on the filtered comparison result information. By adopting the means, the reliability of mutation data is improved.

With regard to the types of mutations mentioned above, they include, for example: synonymous mutation (same sense mutation), missense mutation (missense mutation), nonsense mutation (nonsense mutation), termination codon mutation (terminatodon mutation), copy number mutation and gene fusion. The synonymous mutation means that after the base substitution, each codon is changed to another codon, but the amino acids encoded by the codons before and after the change are not changed due to the degeneracy of the codon, and thus the mutation effect does not occur in practice. Missense mutation means: base pair substitution changes one codon of the mRNA to a codon encoding another amino acid. Missense mutation may cause structural and functional abnormality of certain protein or enzyme in body, resulting in disease. Nonsense mutations refer to: the codon for a given encoded amino acid is mutated to a stop codon, and polypeptide chain synthesis is prematurely terminated, resulting in a polypeptide fragment with no biological activity. The stop codon mutation is: a stop codon in the gene is mutated into a codon encoding an amino acid. Copy number variation is caused by genomic rearrangements, generally meaning an increase or decrease in copy number of large genomic fragments of 1kb or more in length, mainly expressed as deletions and duplications at the sub-microscopic level. The gene fusion refers to: the mutation of chimeric gene is formed by connecting the coding regions of two or more genes end to end and placing them under the control of same set of regulatory sequence (including promoter, enhancer, ribosome binding sequence and terminator). The expression product of the fusion gene is a fusion protein.

With respect to the prediction model 116, it is used to extract the feature values of the input data generated by the data conversion module 114 to predict the type of tumor to be tested. The predictive model 116 is generated via machine learning model training of a plurality of training samples. The predictive model 116 may be constructed from a network model that may be based on a random forest model or deep learning.

A method for predicting a tumor type according to an embodiment of the present disclosure will be described below in conjunction with fig. 2. Fig. 2 shows a flow diagram of a method 200 for predicting a tumor type according to an embodiment of the present disclosure. It should be understood that the method 200 may be performed, for example, at the electronic device 700 depicted in fig. 7. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 202, the computing device 130 obtains characteristic information about the tumor to be measured. In some embodiments, the characteristic information comprises, for example, stage type information about a tumor of the subject to be tested. In some embodiments, the computing device 130 may further obtain attribute information of the subject to which the tumor to be detected belongs.

With respect to staging of a tumor (or staging of a cancer), medically, staging of a tumor is staging of a tumor based on factors such as tumor size, whether it has invaded neighboring organs, how much cancer cells have spread to lymph nodes in neighboring areas, and whether it is present in a distant site (distant metastasis), e.g., via traditional immunohistochemical and/or clinical assessment means. For example, a TNM staging in medicine includes one to four stages; wherein, first stage is the early stage; the second and third stages belong to the middle stage, the second stage is early in the middle stage, and the third stage is late in the middle stage; there are also four stages, which belong to the late stage, i.e. extensive metastasis.

Stage type information about the tumor, which in some embodiments includes, for example: and generating a five-dimensional feature vector based on the distribution conditions of the first stage, the second stage, the third stage, the fourth stage and the unknown stage types of the stage of the tumor to be detected. For example, if the tumor to be detected belongs to the third stage of the tumor stage, the five-dimensional feature vector is (0,0,1,0,0), for example. That is, "1" in the eigenvalue corresponding to the third stage, and "0" in the eigenvalue corresponding to the first stage, second stage, fourth stage and unknown stage type, respectively. If the stages of the tumor to be detected are two and three, the five-dimensional feature vector is (0,1,1,0,0), for example.

Attribute information about an object to which a tumor to be measured belongs, i.e., attribute information of a tumor patient. In some implementations, the attribute information includes, for example, at least one of age information and gender information about the subject to which the tumor is to be measured. Regarding gender information, for example, there are some tumors with obvious association with gender, such as breast cancer, etc., so using gender information as one of the input data of the prediction model is beneficial for accurately predicting tumor types with high gender correlation. Regarding the age information, the computing device 130 may generate a characteristic value regarding the age information according to which one of predetermined age ranges the subject of the sample to be measured belongs to. For example, the predetermined several age ranges include: under 25 years old, 25 to 50 years old, 50 to 75 years old, and over 75 years old. If the subject to which the sample to be measured belongs is 45 years old, the characteristic value regarding the age information is, for example, (0,1,0, 0). By adopting the means, the method can not only consider the difference of different age groups in the generation, development and change of tumors in the process of predicting the tumor types, but also can not bring excessive burden to data processing.

At block 204, the computing device 130 obtains alignment result information for the genomic sequencing sequence of the test sample with respect to the tumor to be tested and the reference genomic sequence. It should be understood that the sequencing protocol used for the training samples should be consistent with the sequencing protocol for the samples to be tested.

With respect to the test sample, in some embodiments, it may be a sample of tumor tissue or a sample of blood of the subject to be tested. For example, after a tissue and/or blood sample of the test individual is collected, DNA data of the sampled sample may be obtained and then randomly sampled to generate a sequencing sequence for the test sample, e.g., by genomic sequencing means. The computing device 130 then aligns the genomic sequencing sequence of the test sample with a reference genomic sequence (e.g., the gene sequence of a human Hg19 standard sample) to generate alignment result information.

As for the sequencing means, it is obtained, for example, via one of sequencing techniques of whole genome sequencing, whole exon sequencing, and probe sequencing of a predetermined gene. In some embodiments, sequencing may be performed on a sample of blood or tissue by different DNA sequencing means. For example: genomic (DNA) Sequencing sequences can be obtained for tissue and blood samples using a clinical Sequencing assay approved by the U.S. Food and Drug Administration (FDA), such as Panel by MSK, Sequencing all exons of WES, or Whole Genome Sequencing (WGS). The whole genome sequencing information contains the inherent relevance between all genes and vital signs.

At block 206, the computing device 130 generates mutation type data for a plurality of predetermined mutation types based on the alignment result information. It is understood that DNA data serves as the underlying molecular data, which includes a variety of information including mutations, copy number alterations, and gene fusions. Further, DNA-level mutational signatures are associated with specific tumor types, e.g., APC loss of function mutations are commonly present in colorectal cancer, TMPRSS2-ERG fusions are commonly present in prostate cancer, and C > T substitutions are commonly present in cutaneous melanomas. For other types of cancer, a combination of genomic alterations will typically occur simultaneously, e.g., TP53 and CTNNB1 mutations typically occur simultaneously in endometrial cancer. The absence of highly prevalent alterations in specific tumor types, such as KRAS mutations in pancreatic adenocarcinoma and gene fusions in certain tumors, may also provide important evidence for classification of specific tumor types. Therefore, it is advantageous to take the mutation type data of a plurality of predetermined mutation types as one of the bases of the input data of the prediction model to fully consider the influence of the intrinsic association of mutations at the DNA level on the tumor characteristics. In some embodiments, the manner of generating mutation type data for a plurality of predetermined mutation types includes, for example, at least two of: generating first data regarding amino acid variations; generating second data about the genetic variation; generating third data regarding copy number variation; and generating fourth data regarding the fused structural variation. In some embodiments, based on the alignment result information, the mutation type data of the plurality of predetermined mutation types includes first data on amino acid variation, second data on genetic variation, third data on copy number variation, and fourth data on fused structural variation.

In some embodiments, the method includes obtaining information about amino acid variations of each gene by the computing device 130 based on the information on the alignment results, and generating first data about the amino acid variations based on a comparison of all the information about amino acid variations of each gene with a predetermined set of amino acid variations, the predetermined set of amino acid variations including a plurality of amino acid variations having a variation probability of the amino acid variation greater than or equal to a first predetermined probability threshold.

For example, the calculation device 130 obtains all the amino acid variation information on each gene of the tumor to be detected based on the alignment result information, then determines whether all the amino acid variation information on each gene belongs to the common amino acid variation set, and if the amino acid variation of a certain gene belongs to the EGFR L858R, that is, if the amino acid variation of a certain gene belongs to the common amino acid variation set, the characteristic value corresponding to the EGFR L R is expressed as "1" and the characteristic value corresponding to the EGFR L is expressed as "5390" if the amino acid variation of a certain gene does not belong to the EGFR 858R, then the calculation device 858 is expressed as "L" if the amino acid variation of a certain gene does not belong to the EGFR 858R.

Regarding the manner in which the second data regarding genetic variation is generated, in some embodiments, it includes, for example: the computing device 130 determines the number of mutations at the locus on each gene based on the information of the comparison result; and generating second data on the genetic variation based on the number of mutations at the site on each gene. For example, if the sequencing data for a probe relates to 450 genes, the second data on the genetic variation may be 450 bits of data, each bit representing how many sites of the corresponding gene are mutated. For example, if a gene has 3 mutations at sites, the characteristic value corresponding to the gene is 3. For example, sequencing data for WES involves 2 ten thousand genes, then the second data for genetic variation can be data with 2 ten thousand characteristic values. The specific manner for generating the second data about the genetic variation will be further described with reference to fig. 4, and will not be described herein.

Regarding the manner in which the third data regarding copy number variation is generated, in some embodiments, it includes, for example: the computing device 130 determines whether at least one of an insertion fragment and a deletion fragment occurs for each gene based on the alignment result information to generate third data regarding copy number variation. Copy Number Variation (CNV) is an important component of genomic Structural Variation (SV). The mutation rate of CNV site is much higher than Single Nucleotide Polymorphism (SNP), which is one of the important pathogenic factors of human diseases. Therefore, by generating third data on copy number variation based on the comparison result information of the sample to be tested and using the third data as one of the input data of the prediction model, the tumor type of the primary site can be predicted more accurately. For example, the computing device 130 forms a two-dimensional feature matrix according to whether an insertion fragment occurs and whether a deletion fragment occurs for each gene. If no insertion or deletion occurs in a certain gene, the characteristic values of the insertion or deletion corresponding to the gene are both expressed as "0"; if an insertion occurs in a certain gene, the characteristic value of the insertion corresponding to the gene is expressed as "+ 2", for example; if a deletion occurs in a gene, the characteristic value of the deletion corresponding to the gene is expressed as "-2", for example. Thereby, the third data in the form of the feature matrix can be generated.

With respect to the manner in which fourth data regarding fused structural variations is generated, in some embodiments, it includes, for example: the computing device 130 obtains structural variation information about the fusion based on the comparison result information; and generating fourth data on the fused structural variation based on a comparison of the fusion with a predetermined fused set, the predetermined fused set comprising a plurality of fusions having a probability of occurrence of the fusion greater than or equal to a second predetermined probability threshold. The predetermined fused set is, for example, a common fused set that is organized for gene fusions whose probability of occurrence is greater than or equal to a second predetermined probability threshold (e.g., 1%), for example, but not limited to, for a population of ten thousand people. The common fusion set includes a plurality of frequently occurring common fusions. For example, if the computing device 130 determines that there is a common fusion in the common fusion set in the sample to be tested based on the comparison result information, the feature value corresponding to the common fusion is represented as "1", for example, and if there is no common fusion in the sample to be tested, the feature value corresponding to the common fusion is represented as "0", for example. Thereby, fourth data in the form of a feature matrix can be generated. By adopting the means, the influence of the occurrence of common gene fusion on the tumor characteristics can be considered in the process of predicting the type of the tumor.

At block 208, the computing device 130 generates input data for inputting the predictive model based on the feature information and the catastrophe type data. In some embodiments, the computing device 130 generates input data for the predictive model based on the feature information, the mutation type data, and attribute information of the subject to which the tumor to be measured belongs. For example, the computing device 130 is based on age information, gender information, staging type information of the tumor, first data regarding amino acid variation, second data regarding genetic variation, third data regarding copy number variation of the patient of the tumor to be tested; and fourth data mutation type data on the fused structural variation and attribute information of the object to which the tumor to be detected belongs, and generating a feature input matrix of the prediction model 116. Therefore, quantitative data are generated by single point mutation data, small segment insertion or deletion data, long segment CNV data and fusion data of the DNA layer and are used for inputting a prediction model, so that the method is favorable for considering the relation of variation of various DNA layers and the common influence on tumors, the accuracy of predicting the primary focus of the tumors is improved, and the guidance for selecting tumor treatment means is improved. For example, if a lung is found to have a mutation, it may be that the lung is not the primary site of the mutation, and may be metastasized from other sites, such as by administering only to the lung, the therapeutic effect may not be ideal, and the primary focus can be accurately determined and administered to the primary focus to achieve a more significant therapeutic effect.

The following will further describe a specific manner for generating the input data of the prediction model with reference to fig. 5, and the detailed description thereof is omitted here.

At block 210, the computing device 130 extracts feature values of the input data via a predictive model generated via machine learning model training on a plurality of training samples to predict the type of tumor under test based on the extracted feature values.

With respect to the predictive model 116, in some embodiments, the predictive model 116 may be constructed based on a random forest model or a deep-learned network model.

As for the input of the prediction model 116, for example, the computing device 130 combines the feature information of the acquired tumor, the mutation type data and the attribute information of the object to which the tumor to be detected belongs to generate the data to be processed, and further performs dimension reduction on the generated data to be processed by similarity calibration and/or random sampling, so as to generate the input feature matrix of the prediction model 116. With respect to a specific method for generating the input data of the prediction model 116, the following will be further described with reference to fig. 5 and 6, and the description thereof will be omitted.

With respect to the training samples of the prediction model 116, the computing device 130 may count a certain amount (e.g., ten thousand) of DNA sequencing comparison result information about tumors of tumor patients known to be tumor-type in advance, and filter out rare tumor-type or low-tumor-content DNA sequencing comparison information, so as to generate mutation type data at a genome level based on the filtered DNA sequencing comparison result information, and then merge stage characteristic information of tumors, attribute information of related tumor patients into a sample data set. In some embodiments, the computing device 130 uses three quarters of the sample data set for training of the predictive model 116 and the other quarters of the sample data set for testing of the predictive model 116 for cross-validation.

The following table illustrates predicted output data of a sample to be tested having a sample ID of 100010ASM 1L 1 (the actual tumor type of which is CRC), wherein a plurality of tumor types are given in the output data of the prediction model 116, and a prediction probability is given for each given tumor type, and the sum of the corresponding prediction probabilities of the plurality of tumor types is 1. as shown in table one, three tumor types with the highest prediction probability are defined as a primary prediction, a secondary prediction, and a tertiary prediction, for example, as shown in table one, a primary prediction, a secondary prediction, and a tertiary prediction with a sample ID of 100010ASM 1L 1 are CRC (the prediction probability of which is about 0.9346), GC (the prediction probability of which is about 0.0110), and PAC (the prediction probability of which is about 0.0107), respectively.

Watch 1

The prediction output data of the test sample with the sample ID of 100155AZD 1L 1 (the actual tumor type of the test sample is PAC) is illustrated in the second table, as shown in the second table, the primary prediction, the secondary prediction and the tertiary prediction with the sample ID of 100010ASM 1L 1 are respectively the ECC with the highest prediction probability (the prediction probability of the ECC is about 0.2578), the PAC (the prediction probability of the PAC is about 0.2345) and the GBC (the prediction probability of the GBC is about 0.1548).

Watch two

From the results shown in tables one and two, the actual tumor type at the primary site is accurately indicated in the primary, secondary and tertiary predictions predicted by the prediction model 116.

In the scheme, input data generated based on the characteristic information of the tumor to be detected and mutation type data of a DNA sequencing comparison layer are input into a prediction model trained by a sample, and the type of the tumor to be detected is predicted based on the characteristics of the extracted input data. And further, the accuracy of predicting the type of the tumor at the primary site can be improved.

In some embodiments, the prediction model 116 may be constructed based on a random forest model, the random forest model is an Ensemble algorithm (Ensemble L early) model fig. 3 schematically shows a schematic diagram of a prediction model 300 constructed based on a random forest model, as shown in fig. 3, the prediction model 300 is randomly sampled 320 (e.g., with a put-back sampling) for a raw data set (e.g., input data 310 about a lesion to be detected) by a bootstrap aggregation method (bootstrapping), to reselect N (N is a natural number) new data sets (e.g., a first training sample 330-1, a second training sample 330-2 through an nth training sample 330-N) for classifier training 340, using a plurality of classifiers (e.g., the first classifier 350-1, the second classifier 350-2 through the nth classifier 350-N) for the first training sample 330-1, the second training sample 330-2 through the nth sample 330-N, and then classifying via the training samples 350-N as a final output of a voting output of a majority classifier for the classifier, or a final output of the classifier by a majority classifier output method (e.g., a voting method for classifying a tumor to be output by a classifier 360).

Since the prediction model 300 can process high-dimensional, sparse input data regarding DNA mutation information without making feature selection by randomly sampling the original input data to generate a plurality of training samples, it has strong adaptability to the input data and good noise immunity. In the training process, the prediction model 300 performs classification decision through a plurality of groups of uninvolved classifiers, and then takes the output type of the strong classifier with the highest classification result as the output data of the prediction model, which is beneficial to detecting the mutual influence among the features, so that the prediction result of the prediction model disclosed by the invention has higher accuracy and generalization performance. The following table three, for example, illustrates the prediction results of the prediction model 300. As shown in table three, the first column represents the tumor type (or cancer type) predicted by the predictive model 300, and the data in the following columns are the predicted probabilities for each tumor type.

Watch III

A method for generating second data regarding genetic variation according to an embodiment of the present disclosure will be described below with reference to fig. 4. Fig. 4 shows a flowchart of a method 400 for generating second data about genetic variations, according to an embodiment of the present disclosure. It should be understood that method 400 may be performed, for example, at electronic device 700 depicted in fig. 7. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 400 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 402, the computing device 130 counts the number of mutations that occur at a site on each gene based on the alignment result information.

At block 404, the computing device 130 counts the maximum number of mutations that occur at sites on each gene in a predetermined amount of objects. For example, the maximum number is, for example, the maximum number of mutations in the gene in the data of 1 ten thousand individuals.

At block 406, the computing device 130 generates second data regarding the genetic variation based on the counted maximum number of site mutations on each gene of the subject. For example, the computing device 130 may ensure the stability of each dimension of the feature matrix with respect to the second data by dividing the number of mutations at the sites on each gene of the object to be measured by the maximum number of mutations at the sites on each gene.

A method for generating input data for a predictive model according to an embodiment of the disclosure will be described below in conjunction with fig. 5. Fig. 5 shows a flow diagram of a method 500 for generating input data for a predictive model according to an embodiment of the disclosure. It should be understood that method 500 may be performed, for example, at electronic device 700 depicted in fig. 7. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 500 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 502, the computing device 130 generates data to be processed based on the feature information, the mutation type data, and attribute information of the subject to which the tumor to be detected belongs. For example, for a sample to be tested, the computing device 130 combines the aforementioned feature information, mutation type data, and attribute information of an object to which a tumor to be tested belongs to generate data to be processed. The dimensionality of the data to be processed is generally high, including, for example, over 9000 features, and there are more similar features among the features.

At block 504, the computing device 130 calculates a similarity between a plurality of features included in the data to be processed. For example, the computing device 130 calculates the similarity between each feature and the other features.

At block 506, the computing device 130 performs a dimension reduction process on the data to be processed based on the comparison of the calculated similarity to the predetermined similarity threshold to generate input data. For example, if the computing device 130 determines that the computed similarity of one feature to another within the data to be processed exceeds a predetermined similarity threshold (e.g., without limitation, 80%), one of the features may be filtered out for dimension reduction on the data to be processed, e.g., without limitation 1/3 for reducing the feature dimension after processing to the original feature dimension of the data to be processed. The reason why the above-described dimension reduction processing is performed is that if the highly similar features are too many, their influence weight on the interpretation degree in the prediction model is too large, and therefore, by filtering the features exceeding a predetermined similarity threshold value, it is advantageous to improve the prediction accuracy of the prediction model.

According to research, data processed by the similarity-based dimensionality reduction are still sparse, and information which does not relate to mutation exists, and the information has small influence on a prediction result. Moreover, the feature dimension is high, the training cost of the prediction model is high, and the training efficiency is relatively low. Therefore, further processing can be performed on the data subjected to the aforementioned dimension reduction processing. In some embodiments, the method 400 of generating input data for a predictive model also includes the method illustrated in FIG. 6.

A method for generating input data according to an embodiment of the present disclosure will be described below in conjunction with fig. 6. Fig. 6 shows a flow diagram of a method 600 for generating input data according to an embodiment of the present disclosure. It should be understood that method 600 may be performed, for example, at electronic device 700 depicted in fig. 7. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 600 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 602, the computing device 130 performs a plurality of random samplings for the dimensionality reduced data based on each of a plurality of predetermined dimensionality values, respectively, to generate a plurality of sample data having characteristic dimensions of each of the predetermined dimensionality values. For example, the computing device 130 performs a plurality of random samplings for the dimension-reduced data based on each of the predetermined dimension values 500, 1000, 2000, respectively, and generates a plurality of sample data for each of the dimension values 500, 1000, 2000, respectively.

At block 604, the computing device 130 trains against the predictive model based on the plurality of sample data for which the feature dimension is each of the predetermined dimension values to generate a plurality of predictions about the tumor type associated with each of the predetermined dimension values.

At block 606, the computing device 130 determines whether the variance value of the plurality of predictors is less than or equal to a predetermined variance threshold.

If the computing device 130 determines that the variance value of the prediction result is less than or equal to the predetermined variance threshold, at block 608, input data for the predictive model is determined based on each of the predetermined dimensional values associated with the plurality of prediction results.

By adopting the means, the feature which has small contribution to the prediction result of the predicted tumor type is filtered in a random sampling mode, so that the training efficiency of the prediction model can be obviously improved while the high prediction accuracy of the tumor type is ensured.

FIG. 7 schematically illustrates a block diagram of an electronic device 700 suitable for use in implementing embodiments of the present disclosure. The apparatus 700 may be an apparatus for implementing the

methods

200, 400 to 600 shown in fig. 2, 4 to 6, and the predictive model 300 shown in fig. 4. As shown in fig. 7, device 700 includes a Central Processing Unit (CPU)701 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)702 or computer program instructions loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 can also be stored. The CPU701, the ROM 702, and the RAM703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, a processing unit 701 performs the respective methods and processes described above, for example, the

methods

200, 400 to 600. For example, in some embodiments, the

methods

200, 400, to 600 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM703 and executed by the CPU701, one or more operations of the

methods

200, 400 to 600 described above may be performed. Alternatively, in other embodiments, the CPU701 may be configured by any other suitable means (e.g., by way of firmware) to perform one or more of the acts of the methods 200, 400-600.

It should be further appreciated that the present disclosure may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including AN object oriented programming language such as Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" language or similar programming languages.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above are merely alternative embodiments of the present disclosure and are not intended to limit the present disclosure, which may be modified and varied by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method for predicting a tumor type, comprising:

acquiring characteristic information about a tumor to be detected;

obtaining the comparison result information of the genome sequencing sequence of the sample to be detected of the tumor to be detected and the reference genome sequence;

generating mutation type data on a plurality of predetermined mutation types based on the alignment result information;

generating input data for inputting a predictive model based on the feature information and the mutation type data; and

extracting feature values of the input data via the prediction model so as to predict a type of the tumor to be measured based on the extracted feature values, the prediction model being generated via machine learning model training on a plurality of training samples.

2. The method of claim 1, wherein generating mutation type data for a plurality of predetermined mutation types comprises at least two of:

generating first data regarding amino acid variations;

generating second data about the genetic variation;

generating third data regarding copy number variation; and

fourth data regarding the fused structural variation is generated.

3. The method of claim 1, wherein generating input data for a predictive model based on the feature information and the mutation type data comprises:

generating the input data of a prediction model based on the feature information, the mutation type data, and attribute information of an belonging subject of the tumor to be detected, the attribute information including at least one of age information and gender information about the belonging subject, the feature information including stage type information about the tumor to be detected.

4. The method of claim 2, wherein generating first data regarding amino acid variations comprises:

acquiring all amino acid variation information on each gene based on the comparison result information; and

generating the first data regarding amino acid variations based on a comparison of all of the amino acid variation information on each of the genes to a predetermined set of amino acid variations, the predetermined set of amino acid variations including a plurality of amino acid variations having a variation probability of amino acid variation greater than or equal to a first predetermined probability threshold.

5. The method of claim 2, wherein generating second data about genetic variations comprises:

determining the number of the site mutations on each gene based on the comparison result information; and

generating the second data on genetic variation based on the number of mutations at the site on each of the genes.

6. The method of claim 2, wherein generating second data about genetic variations comprises:

counting the number of mutations at the locus of each gene based on the comparison result information;

counting the maximum number of mutations at the site on each gene in a predetermined amount of subjects; and

generating the second data about gene variation based on the number of site mutations on each gene of the object to be tested and the counted maximum number of site mutations on each gene.

7. The method of claim 2, wherein generating third data regarding copy number variation comprises:

determining whether at least one of an insertion and a deletion occurs for each gene based on the alignment result information to generate the third data on copy number variation.

8. The method of claim 2, wherein generating fourth data about fused structural variations comprises:

acquiring structural variation information about fusion based on the comparison result information; and

generating the fourth data on the fused structural variation based on a comparison of the fusion to a predetermined fused set, the predetermined fused set comprising a plurality of fusions having a probability of occurrence of fusion greater than or equal to a second predetermined probability threshold.

9. The method of claim 1, wherein generating input data for the input predictive model comprises:

generating data to be processed based on the characteristic information, the mutation type data and the attribute information of the object to which the tumor to be detected belongs;

calculating the similarity among a plurality of characteristics included in the data to be processed; and

performing dimensionality reduction processing on the data to be processed based on the comparison of the calculated similarity with a predetermined similarity threshold value to generate the input data.

10. The method of claim 9, wherein generating the input data comprises:

on the basis of each preset dimension value in a plurality of preset dimension values, randomly sampling for the data subjected to dimension reduction respectively for a plurality of times so as to generate a plurality of sample data with characteristic dimension being each preset dimension value;

training the predictive model to generate a plurality of predictions about the tumor type associated with each of the predetermined dimension values based on a plurality of sample data for which the feature dimension is said each predetermined dimension value;

determining whether a variation value of the plurality of predicted outcomes is less than or equal to a predetermined variation threshold; and

in response to determining that a variance value of a plurality of the predictors is less than or equal to a predetermined variance threshold, determining the input data for the predictive model based on the each of the predetermined dimensional values associated with the plurality of predictors.

11. The method of claim 1, wherein the predictive model is constructed based on a random forest model.

12. The method of claim 1, wherein the genomic sequencing sequence is obtained via one of whole genome sequencing, whole exon sequencing, and probe sequencing of a predetermined gene.

13. A computing device, comprising:

at least one processing unit;

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit, cause the apparatus to perform the steps of the method of any of claims 1 to 12.

14. A computer-readable storage medium, having stored thereon a computer program which, when executed by a machine, implements the method of any of claims 1-12.