KR20230114557A

KR20230114557A - Apparatus and method for deriving gene associated with homologous recombination deficiency, apparatus and method for generating homologous recombination deficiency determination model, apparatus and method for determining homologous recombination deficiency

Info

Publication number: KR20230114557A
Application number: KR1020220010873A
Authority: KR
Inventors: 이진구; 이재준; 김진호
Original assignee: 서울대학교산학협력단; 서울대학교병원
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2023-08-01

Abstract

상동 재조합 결핍 연관 유전자 도출 장치 및 방법, 상동 재조합 결핍 판단 모델 생성 장치 및 방법과, 상동 재조합 결핍 판단 장치 및 방법이 개시된다. 일 양상에 따른 상동 재조합 결핍 연관 유전자 도출 장치는 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 상동 재조합 결핍 스코어 데이터를 수집하는 데이터 수집부; 차별 발현 유전자 분석 기법을 통해, 상기 수집된 데이터로부터 후보 유전자를 판단하는 후보 유전자 판단부; 상기 수집된 데이터 중 상기 후보 유전자의 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 학습데이터로 이용하여 기계학습 모델을 학습시켜 주요 유전자 판단 모델을 생성하는 모델 생성부; 및 상기 생성된 주요 유전자 판단 모델을 이용하여 상동 재조합 결핍과 관련된 주요 유전자를 판단하는 주요 유전자 판단부; 를 포함한다.Disclosed are an apparatus and method for deriving homologous recombination deficiency-associated genes, an apparatus and method for generating a model for determining homologous recombination deficiency, and an apparatus and method for determining homologous recombination deficiency. An apparatus for deriving homologous recombination deficiency-related genes according to one aspect includes a data collection unit for collecting transcriptome data, gene expression level data, and homologous recombination deficiency score data of various types of cancer patients; a candidate gene determination unit for determining a candidate gene from the collected data through a differentially expressed gene analysis technique; a model generator configured to generate a main gene determination model by learning a machine learning model using expression level data, transcriptome data, and HRD score data of the candidate gene among the collected data as training data; and a key gene determining unit for determining key genes related to homologous recombination deficiency using the generated key gene determining model; includes

Description

상동 재조합 결핍 연관 유전자 도출 장치 및 방법, 상동 재조합 결핍 판단 모델 생성 장치 및 방법과, 상동 재조합 결핍 판단 장치 및 방법{Apparatus and method for deriving gene associated with homologous recombination deficiency, apparatus and method for generating homologous recombination deficiency determination model, apparatus and method for determining homologous recombination deficiency}Apparatus and method for deriving gene associated with homologous recombination deficiency, apparatus and method for generating homologous recombination deficiency determination model, apparatus and method for determining homologous recombination deficiency}

상동 재조합 결핍과 연관된 유전자를 도출하고, 도출된 유전자를 이용하여 상동 재조합 결핍을 판단하는 기술과 관련된다.It relates to a technique for deriving a gene associated with homologous recombination deficiency and determining homologous recombination deficiency using the derived gene.

DNA 상동 재조합 결핍(Homologous recombination deficiency)은 염색체의 복구 메커니즘을 구성하는 유전자의 돌연변이, 그로 인한 상동 재조합 기능의 문제로 발생하는 현상이며 여러 암 세포에서 확인된다. 상동 재조합 결핍을 직접적으로 일으키는 유전자는 BRCA1, BRCA2 등이 있으며, 특정 암에서 해당 유전자의 돌연변이가 관측된다. 유방암, 난소암, 췌장암 같은 고형암의 발현과 상동 재조합 결핍의 표현 형질이 서로 상관관계를 나타내고 있다는 연구가 보고되었으며, 특히 상동 재조합 결핍 표현 형질을 가지고 있는 암 환자의 경우에는 높은 생존 확률과 PARP 타겟 억제제에 민감성을 보이고 있어 해당 표현 형질을 예측하여 정밀 진단 및 맞춤 의학에 사용하려는 연구가 활발하다.Homologous recombination deficiency is a phenomenon caused by a mutation in a gene constituting the repair mechanism of a chromosome, resulting in a problem in homologous recombination function, and has been identified in several cancer cells. Genes that directly cause homologous recombination deficiency include BRCA1 and BRCA2, and mutations in these genes are observed in certain cancers. A study has been reported showing a correlation between expression and traits of homologous recombination deficiency in solid cancers such as breast cancer, ovarian cancer, and pancreatic cancer. Since it shows sensitivity to , researches to predict the corresponding expression trait and use it for precision diagnosis and personalized medicine are active.

이러한 상동 재조합 결핍을 예측하려는 기존의 연구들은 특정 암 세포에 대한 차세대 염기서열 분석(Next generation sequencing)을 기반으로 전장유전체 분석(Whole genome sequencing)과 전장엑솜 분석(Whole exome sequencing)과 관련된다. 또한, 단일 염기 변이(Single nucleotide polymorphism) array를 적용하여 전장유전체 분석에 소요되는 비용과 시간을 줄인 연구도 보고되었는데, 단일 염기 변이를 이용한 분석은 'HRD score'의 개념을 적용하여 상동재조합 결핍을 3종류로 세부 표현 형질로 나누어 해석하였으며, 3종류의 세부 표현 형질은 각각 LOH(loss of heterozygosity), TAI(Telomeric allelic imbalance), LST(Large-scale transitions)으로 구성된다.Existing studies to predict such homologous recombination deficiency are related to whole genome sequencing and whole exome sequencing based on next generation sequencing for specific cancer cells. In addition, a study that reduced the cost and time required for full-length genome analysis by applying a single nucleotide polymorphism array has also been reported. Three types of detailed expression traits were analyzed, and each of the three types of detailed expression traits consisted of LOH (loss of heterozygosity), TAI (telomeric allelic imbalance), and LST (large-scale transitions).

한편, 전사체(mRNA) 데이터를 이용한 상동재조합 결핍에 관한 연구는 미비하며, 그로 인하여 전장유전체, 단일 염기 변이 분석이 불가능한 특정 조건에 있는 세포의 상동 재조합 결핍은 판단하기 어렵다. On the other hand, research on homologous recombination deficiency using transcriptome (mRNA) data is insufficient, and as a result, it is difficult to determine homologous recombination deficiency in cells under specific conditions where whole genome and single nucleotide mutation analysis are impossible.

대한민국 공개특허공보 제10-2019-0029618호 (2019.03.20.)Republic of Korea Patent Publication No. 10-2019-0029618 (2019.03.20.)

상동 재조합 결핍 연관 유전자 도출 장치 및 방법, 상동 재조합 결핍 판단 모델 생성 장치 및 방법과, 상동 재조합 결핍 판단 장치 및 방법을 제공하는 것을 목적으로 한다.It is an object of the present invention to provide an apparatus and method for deriving a gene associated with homologous recombination deficiency, an apparatus and method for generating a model for determining homologous recombination deficiency, and an apparatus and method for determining homologous recombination deficiency.

일 양상에 따른 상동 재조합 결핍 연관 유전자 도출 장치는, 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 상동 재조합 결핍 스코어 데이터를 수집하는 데이터 수집부; 차별 발현 유전자 분석 기법을 통해, 상기 수집된 데이터로부터 후보 유전자를 판단하는 후보 유전자 판단부; 상기 수집된 데이터 중 상기 후보 유전자의 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 학습데이터로 이용하여 기계학습 모델을 학습시켜 주요 유전자 판단 모델을 생성하는 모델 생성부; 및 상기 생성된 주요 유전자 판단 모델을 이용하여 상동 재조합 결핍과 관련된 주요 유전자를 판단하는 주요 유전자 판단부; 를 포함한다.Homologous recombination deficiency associated gene derivation apparatus according to one aspect, a data collection unit for collecting transcriptome data, gene expression level data and homologous recombination deficiency score data of various types of cancer patients; a candidate gene determination unit for determining a candidate gene from the collected data through a differentially expressed gene analysis technique; a model generator configured to generate a main gene determination model by learning a machine learning model using expression level data, transcriptome data, and HRD score data of the candidate gene among the collected data as training data; and a key gene determining unit for determining key genes related to homologous recombination deficiency using the generated key gene determining model; includes

상기 기계학습 모델은 엘라스틱넷 회귀, 라쏘 회귀, 릿지 회귀, 그래디언트 부스팅, 서포트 벡터 머신 및 다층 퍼셉트론을 포함한다.The machine learning model includes elasticnet regression, lasso regression, ridge regression, gradient boosting, support vector machine, and multilayer perceptron.

상기 주요 유전자 판단부는 상기 생성된 주요 유전자 판단 모델로부터 각 유전자의 가중치 또는 중요도를 판단하고, 상기 판단된 가중치의 절대값 또는 중요도가 임계값 이상인 유전자를 주요 유전자로 판단한다.The major gene determination unit determines the weight or importance of each gene from the generated major gene determination model, and determines a gene whose absolute value or importance of the determined weight is greater than or equal to a threshold value as a major gene.

상기 상동 재조합 결핍 연관 유전자 도출 장치는 상기 수집된 데이터를 정규화하는 전처리부; 를 더 포함한다.The apparatus for deriving genes associated with homologous recombination deficiency includes a pre-processing unit normalizing the collected data; more includes

다른 양상에 따른 상동 재조합 결핍 판단 모델 생성 장치는, 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 상동 재조합 결핍 스코어 데이터를 수집하는 데이터 수집부; 차별 발현 유전자 분석 기법을 통해, 상기 수집된 데이터로부터 후보 유전자를 판단하는 후보 유전자 판단부; 상기 수집된 데이터 중 상기 후보 유전자의 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 이용하여 제1 기계학습 모델을 학습시켜 주요 유전자 판단 모델을 생성하는 제1 모델 생성부; 상기 생성된 주요 유전자 판단 모델을 이용하여 상동 재조합 결핍과 관련된 주요 유전자를 판단하는 주요 유전자 판단부; 및 상기 수집된 데이터 중 상기 주요 유전자의 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 학습 데이터로 이용하여 제2 기계학습 모델을 학습시켜 상동 재조합 결핍 판단 모델을 생성하는 제2 모델 생성부; 를 포함한다.An apparatus for generating a homologous recombination deficiency determination model according to another aspect includes a data collection unit for collecting transcriptome data, gene expression level data, and homologous recombination deficiency score data of various types of cancer patients; a candidate gene determination unit for determining a candidate gene from the collected data through a differentially expressed gene analysis technique; a first model generation unit configured to generate a main gene determination model by learning a first machine learning model using expression level data, transcriptome data, and HRD score data of the candidate gene among the collected data; a major gene determination unit for determining a major gene related to homologous recombination deficiency using the generated major gene determination model; and a second model generating unit configured to generate a homologous recombination deficiency determination model by learning a second machine learning model using the expression level data, transcriptome data, and HRD score data of the main gene among the collected data as training data; includes

상기 제1 기계학습 모델은 그래디언트 부스팅이고, 상기 제2 기계학습 모델은 엘라스틱넷 회귀이다.The first machine learning model is gradient boosting, and the second machine learning model is ElasticNet regression.

다른 양상에 따른 상동 재조합 결핍 판단 장치는, 암 환자의 주요 유전자 발현량 데이터 및 전사체 데이터를 획득하는 데이터 획득부; 및 상동 재조합 결핍 판단 모델을 이용하여 상기 획득된 데이터로부터 상기 암 환자의 상동 재조합 결핍 여부를 판단하는 상동 재조합 결핍 판단부; 를 포함하고, 상기 상동 재조합 결핍 판단 모델은 다양한 종류의 암 환자의 주요 유전자의 발현량 데이터, 전사체 데이터 및 상동 재조합 결핍 스코어 데이터를 학습 데이터로 이용하여 기계학습 모델을 학습시켜 생성되고, 상기 주요 유전자는 상동 재조합 결핍과 관련된 유전자로서, 주요 유전자 판단 모델로부터 판단된 가중치 절대값 또는 중요도가 임계값 이상인 유전자이다.Homologous recombination deficiency determination apparatus according to another aspect, the data acquisition unit for obtaining the main gene expression level data and transcript data of cancer patients; and a homologous recombination deficiency determination unit determining whether the cancer patient has a homologous recombination deficiency based on the obtained data using a homologous recombination deficiency determination model. The homologous recombination deficiency determination model is generated by learning a machine learning model using expression data, transcriptome data, and homologous recombination deficiency score data of major genes of various types of cancer patients as learning data, and the main The gene is a gene related to homologous recombination deficiency, and is a gene whose weight absolute value or importance determined from a major gene determination model is greater than or equal to a threshold value.

상기 주요 유전자 판단 모델은 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 상동 재조합 결핍 스코어 데이터를 수집하고, 차별 발현 유전자 분석 기법을 통해 상기 수집된 데이터로부터 후보 유전자를 판단하고, 상기 수집된 데이터 중 상기 후보 유전자의 발현량 데이터, 전사체 데이터 및 상동 재조합 결핍 스코어 데이터를 학습데이터로 이용하여 생성된다.The main gene determination model collects transcriptome data, gene expression level data, and homologous recombination deficiency score data of various types of cancer patients, determines candidate genes from the collected data through a differential expression gene analysis technique, and collects the collected data. It is generated by using expression level data, transcriptome data, and homologous recombination deficiency score data of the candidate gene among the generated data as learning data.

다양한 암종에서 획득한 전사체 데이터를 기반으로 기계학습을 통해 상동 재조합 결핍과 관련된 주요 유전자를 도출할 수 있다.Based on transcriptome data obtained from various cancer types, key genes related to homologous recombination deficiency can be derived through machine learning.

또한, 도출된 주요 유전자를 이용하여 기계학습 기반의 상동 재조합 결핍 판단 모델을 생성하여 이용함으로써, 다양한 암종에 대한 상동 재조합 결핍을 높은 정확도로 판단할 수 있다.In addition, by generating and using a machine learning-based homologous recombination deficiency determination model using the derived main genes, homologous recombination deficiency for various cancer types can be determined with high accuracy.

도 1은 예시적 실시예에 따른 상동 재조합 결핍 연관 유전자 도출 장치를 도시한 도면이다.
도 2는 예시적 실시예에 따른 상동 재조합 결핍 판단 모델 생성 장치를 도시한 도면이다.
도 3은 예시적 실시예에 따른 상동 재조합 결핍 판단 장치를 도시한 도면이다.
도 4는 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 도면이다.
도 5는 예시적 실시예에 따른 상동 재조합 결핍 연관 유전자 도출 방법을 도시한 도면이다.
도 6은 예시적 실시예에 따른 상동 재조합 결핍 판단 모델 생성 방법을 도시한 도면이다.
도 7은 예시적 실시예에 따른 상동 재조합 결핍 판단 방법을 도시한 도면이다.1 is a diagram illustrating an apparatus for deriving a gene associated with a homologous recombination deficiency according to an exemplary embodiment.
2 is a diagram illustrating an apparatus for generating a homologous recombination deficiency determination model according to an exemplary embodiment.
3 is a diagram illustrating an apparatus for determining homologous recombination deficiency according to an exemplary embodiment.
4 is a diagram for illustrating and describing a computing environment including a computing device suitable for use in example embodiments.
5 is a diagram illustrating a method for deriving homologous recombination deficiency-associated genes according to an exemplary embodiment.
6 is a diagram illustrating a method for generating a homologous recombination deficiency determination model according to an exemplary embodiment.
7 is a diagram illustrating a method for determining homologous recombination deficiency according to an exemplary embodiment.

이하, 첨부된 도면을 참조하여 본 발명의 일 실시예를 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings. In adding reference numerals to components of each drawing, it should be noted that the same components have the same numerals as much as possible even if they are displayed on different drawings. In addition, in describing the present invention, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted.

한편, 각 단계들에 있어, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않은 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 수행될 수 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.Meanwhile, in each step, each step may occur in a different order from the specified order unless a specific order is clearly described in context. That is, each step may be performed in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.Terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to the intention or custom of a user or operator. Therefore, the definition should be made based on the contents throughout this specification.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 다수의 표현을 포함하고, '포함하다' 또는 '가지다' 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. Terms are only used to distinguish one component from another. Singular expressions include plural expressions unless the context clearly indicates otherwise, and terms such as 'include' or 'have' refer to features, numbers, steps, operations, components, parts, or combinations thereof described in the specification. It is intended to specify that something exists, but it should be understood that it does not preclude the possibility of the existence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

또한, 본 명세서에서의 구성부들에 대한 구분은 각 구성부가 담당하는 주 기능별로 구분한 것에 불과하다. 즉, 2개 이상의 구성부가 하나의 구성부로 합쳐지거나 또는 하나의 구성부가 보다 세분화된 기능별로 2개 이상으로 분화되어 구비될 수도 있다. 그리고 구성부 각각은 자신이 담당하는 주기능 이외에도 다른 구성부가 담당하는 기능 중 일부 또는 전부의 기능을 추가적으로 수행할 수도 있으며, 구성부 각각이 담당하는 주기능 중 일부 기능이 다른 구성부에 의해 전담되어 수행될 수도 있다. 각 구성부는 하드웨어(예컨대 프로세서) 또는 소프트웨어로 구현되거나 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In addition, the division of components in the present specification is merely a classification for each main function in charge of each component. That is, two or more components may be combined into one component, or one component may be divided into two or more for each more subdivided function. In addition, each component may additionally perform some or all of the functions of other components in addition to its main function, and some of the main functions of each component are dedicated to other components. may be performed. Each component may be implemented in hardware (eg, a processor) or software, or a combination of hardware and software.

도 1은 예시적 실시예에 따른 상동 재조합 결핍 연관 유전자 도출 장치를 도시한 도면이다.1 is a diagram illustrating an apparatus for deriving a gene associated with a homologous recombination deficiency according to an exemplary embodiment.

도 1을 참조하면, 예시적 실시예에 따른 상동 재조합 결핍 연관 유전자 도출 장치(100)는 데이터 수집부(110), 전처리부(120), 후보 유전자 판단부(130), 모델 생성부(140) 및 주요 유전자 판단부(150)를 포함할 수 있다.Referring to FIG. 1 , an apparatus 100 for deriving genes associated with homologous recombination deficiency according to an exemplary embodiment includes a data collection unit 110, a pre-processing unit 120, a candidate gene determination unit 130, and a model generation unit 140. And it may include a main gene determination unit (150).

데이터 수집부(110)는 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 상동 재조합 결핍(Homologous Recombination Deficiency, HRD) 스코어 데이터를 수집할 수 있다. 여기서 전사체 데이터는 mRNA 시퀀스 데이터일 수 있다.The data collection unit 110 may collect transcriptome data, gene expression level data, and Homologous Recombination Deficiency (HRD) score data of various types of cancer patients. Here, transcriptome data may be mRNA sequence data.

예시적 실시예에 따르면, 데이터 수집부(110)는 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 HRD 스코어 데이터를 저장하는 외부 데이베이스로부터 전사체 데이터, 유전자 발현량 데이터 및 HRD 스코어 데이터를 수집할 수 있다. 이때, 데이터 수집부(110)는 유무선 통신 기술을 이용할 수 있다. 여기서, 무선 통신 기술은 블루투스(bluetooth) 통신, BLE(Bluetooth Low Energy) 통신, 근거리 무선 통신(Near Field Communication, NFC), WLAN 통신, 지그비(Zigbee) 통신, 적외선(Infrared Data Association, IrDA) 통신, WFD(Wi-Fi Direct) 통신, UWB(ultra-wideband) 통신, Ant+ 통신, WIFI 통신, RFID(Radio Frequency Identification) 통신, 3G 통신, 4G 통신 및 5G 통신 등을 포함할 수 있으나 이에 한정되는 것은 아니다.According to an exemplary embodiment, the data collection unit 110 may store transcript data, gene expression level data, and HRD score data of various types of cancer patients from an external database storing transcript data, gene expression level data, and HRD score data. data can be collected. At this time, the data collection unit 110 may use a wired or wireless communication technology. Here, the wireless communication technology includes Bluetooth communication, Bluetooth Low Energy (BLE) communication, Near Field Communication (NFC), WLAN communication, Zigbee communication, Infrared Data Association (IrDA) communication, Wi-Fi Direct (WFD) communication, ultra-wideband (UWB) communication, Ant+ communication, WIFI communication, RFID (Radio Frequency Identification) communication, 3G communication, 4G communication, and 5G communication may include, but are not limited thereto .

예시적 실시예에 따르면, 데이터 수집부(110)는 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 단일 염기 변이(single nucleotide polymorphism) 어레이 데이터를 저장하는 외부 데이터베이스로부터 전사체 데이터, 유전자 발현량 데이터 및 단일 염기 변이 어레이 데이터를 수신할 수 있다. 또한, 데이터 수집부(110)는 수신된 단일 염기서열 어레이 데이터를 기반으로 HRD 스코어를 산출함으로써, 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 HRD 스코어를 수집할 수 있다. 여기서 단일 염기 변이 어레이 데이터를 기반으로 HRD 스코어를 산출하는 방법은 공지된 방법이므로 그 상세한 설명은 생략하기로 한다.According to an exemplary embodiment, the data collection unit 110 collects transcript data, gene data from an external database that stores transcript data, gene expression level data, and single nucleotide polymorphism array data of various types of cancer patients. Expression level data and single nucleotide variation array data may be received. In addition, the data collection unit 110 may collect transcriptome data, gene expression level data, and HRD scores of various types of cancer patients by calculating an HRD score based on the received single nucleotide sequence array data. Here, since a method for calculating an HRD score based on single nucleotide mutation array data is a known method, a detailed description thereof will be omitted.

전처리부(120)는 수집된 데이터를 전처리할 수 있다.The pre-processing unit 120 may pre-process the collected data.

예시적 실시예에 따르면, 전처리부(120)는 전사체 데이터와 유전자 발현량 데이터를 정규화(normalization)할 수 있다. 예를 들어, 전처리부(120)는 TPM(transcripts per million), RPKM(reads per kilobase per millions mapped reads), TMM(trimmed means of M values) 등 다양한 정규화 기법을 이용하여 전사체 데이터를 정규화할 수 있다. 또한, 전처리부(120)는 유전자 발현량 데이터에 로그(log)를 취하여 유전자 발현량 데이터를 정규화할 수 있다.According to an exemplary embodiment, the preprocessor 120 may normalize transcript data and gene expression level data. For example, the preprocessor 120 may normalize transcriptome data using various normalization techniques such as transcripts per million (TPM), reads per kilobase per millions mapped reads (RPKM), and trimmed means of M values (TMM). there is. In addition, the preprocessor 120 may normalize the gene expression level data by taking a log of the gene expression level data.

후보 유전자 판단부(130)는 차별 발현 유전자(Deferentially Expressed Gene, DEG) 분석 기법을 통해, 전처리된 데이터로부터 후보 유전자를 판단할 수 있다. 여기서 후보 유전자는 상동 재조합 결핍과 관련될 가능성이 있는 유전자일 수 있다. DEG 분석 기법은 공지된 방법이므로 그 상세한 설명은 생략하기로 한다.The candidate gene determination unit 130 may determine candidate genes from the preprocessed data through a differentially expressed gene (DEG) analysis technique. Here, the candidate gene may be a gene likely to be associated with homologous recombination deficiency. Since the DEG analysis technique is a well-known method, a detailed description thereof will be omitted.

일 실시예에 따르면, 후보 유전자 판단부(130)는 전처리된 데이터를 소정의 비율(예컨대 2:1)에 따라 학습 세트(training set)와 검증 세트(validation set)로 구분하고, DEG 분석 기법을 통해 학습 세트에서 상동 재조합 결핍과 관련된 후보 유전자를 판단할 수 있다.According to an embodiment, the candidate gene determination unit 130 divides the preprocessed data into a training set and a validation set according to a predetermined ratio (eg, 2:1), and uses a DEG analysis technique. Through this, it is possible to determine candidate genes related to homologous recombination deficiency in the training set.

모델 생성부(140)는 후보 유전자 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 학습 데이터로 이용하여 기계학습을 통해 주요 유전자 판단 모델을 생성할 수 있다. 여기서 주요 유전자는 HRD 스코어 예측과 관련하여 다른 유전자들보다 영향이 큰 중요한 유전자일 수 있다. 기계학습 알고리즘은 엘라스틱넷 회귀(elasticnet regression), 라쏘 회귀(lasso regression), 릿지 회귀(ridge regression), 그래디언트 부스팅(gradient boosting), 서포트 벡터 머신(support vector machine), 다층 퍼셉트론(multi-layer perceptron) 등을 포함할 수 있으나 이는 일 실시예에 불과할 뿐 이에 한정되는 것은 아니다.The model generation unit 140 may generate a main gene determination model through machine learning using candidate gene expression level data, transcriptome data, and HRD score data as training data. Here, the main gene may be an important gene having a greater influence than other genes in relation to HRD score prediction. Machine learning algorithms include elasticnet regression, lasso regression, ridge regression, gradient boosting, support vector machine, and multi-layer perceptron. etc., but this is only one embodiment and is not limited thereto.

예시적 실시예에 따르면, 모델 생성부(140)는 후보 유전자 발현량 데이터 및 전사체 데이터를 입력으로 하고, 그에 대응하는 HRD 스코어 데이터를 정답(target)으로 하여 기계학습 모델을 학습시킬 수 있다. 즉, 모델 생성부(140)는 후보 유전자 발현량 데이터 및 전사체 데이터를 기반으로 HRD 스코어를 예측하도록 기계학습 모델을 학습시킬 수 있다. 예를 들어, 모델 생성부(140)는 학습 세트의 후보 유전자 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 기반으로 기계학습 모델을 학습시킨 후 학습된 기계학습 모델을 검증 세트로 평가하고, 평가 결과를 기반으로 기계학습 모델의 하이퍼파라미터(hyperparameter)를 조정할 수 있다. 이러한 하이퍼파라미터 조정 과정을 통해 최종적으로 주요 유전자 판단 모델이 생성될 수 있다.According to an exemplary embodiment, the model generation unit 140 may train a machine learning model by taking candidate gene expression level data and transcript data as inputs and using HRD score data corresponding thereto as a target. That is, the model generating unit 140 may train a machine learning model to predict an HRD score based on candidate gene expression data and transcript data. For example, the model generation unit 140 trains a machine learning model based on candidate gene expression data, transcriptome data, and HRD score data of the learning set, and then evaluates the learned machine learning model as a verification set and evaluates the machine learning model. Based on the results, you can adjust the hyperparameters of the machine learning model. Through this hyperparameter adjustment process, a major genetic judgment model can finally be created.

주요 유전자 판단부(150)는 생성된 주요 유전자 판단 모델을 이용하여 상동 재조합 결핍과 관련된 주요 유전자를 판단할 수 있다. 예를 들어, 주요 유전자 판단부(150)는 주요 유전자 판단 모델로부터 각 유전자의 가중치 또는 중요도를 판단하고, 판단된 가중치의 절대값 또는 중요도가 소정의 임계값 이상인 유전자를 주요 유전자로 판단할 수 있다. 여기서 가중치 또는 중요도는 각 유전자가 HRD 스코어를 예측하는 데 영향을 미치는 정도를 나타낼 수 있다.The major gene determining unit 150 may determine a major gene related to homologous recombination deficiency using the generated major gene determination model. For example, the main gene determining unit 150 may determine the weight or importance of each gene from the main gene determination model, and determine a gene whose absolute value or importance of the determined weight is greater than or equal to a predetermined threshold value as the main gene. . Here, the weight or importance may represent the degree to which each gene has an effect on predicting the HRD score.

예시적 실시예에 따르면 상동 재조합 결핍 연관 유전자 도출 장치(100)는 기계학습을 통해 상동 재조합 결핍과 관련된 주요 유전자를 판단할 수 있으며 판단된 주요 유전자는 후술하는 바와 같이 상동 재조합 결핍 판단 모델 생성에 이용되어 HRD 스코어의 판단 또는 예측의 효율과 정확도를 향상시킬 수 있다.According to an exemplary embodiment, the apparatus 100 for deriving genes associated with homologous recombination deficiency can determine key genes related to homologous recombination deficiency through machine learning, and the determined key genes are used to generate a homologous recombination deficiency judgment model as described below. This can improve the efficiency and accuracy of judgment or prediction of the HRD score.

도 2는 예시적 실시예에 따른 상동 재조합 결핍 판단 모델 생성 장치를 도시한 도면이다.2 is a diagram illustrating an apparatus for generating a homologous recombination deficiency determination model according to an exemplary embodiment.

도 2를 참조하면, 예시적 실시예에 따른 상동 재조합 결핍 판단 모델 생성 장치(200)는 데이터 수집부(210), 전처리부(220), 후보 유전자 판단부(230), 제1 모델 생성부(240), 주요 유전자 판단부(250) 및 제2 모델 생성부(260)를 포함할 수 있다. 여기서, 데이터 수집부(210), 전처리부(220), 후보 유전자 판단부(230), 제1 모델 생성부(240), 주요 유전자 판단부(250)는 도 1의 데이터 수집부(110), 전처리부(120), 후보 유전자 판단부(130), 모델 생성부(140), 주요 유전자 판단부(150)와 각각 동일 또는 유사하므로 중복되는 범위에서 그 상세한 설명은 생략하기로 한다.Referring to FIG. 2 , an apparatus for generating a homologous recombination deficiency determination model 200 according to an exemplary embodiment includes a data collection unit 210, a pre-processing unit 220, a candidate gene determination unit 230, a first model generation unit ( 240), a main gene determining unit 250, and a second model generating unit 260. Here, the data collection unit 210, the pre-processing unit 220, the candidate gene determination unit 230, the first model generation unit 240, and the main gene determination unit 250 are the data collection unit 110 of FIG. 1, Since the preprocessing unit 120, the candidate gene determination unit 130, the model generation unit 140, and the main gene determination unit 150 are identical or similar to each other, detailed descriptions thereof will be omitted to the extent that they overlap.

제2 모델 생성부(260)는 주요 유전자 판단부(250)에서 판단된 주요 유전자의 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 학습 데이터로 이용하여 기계학습을 통해 상동 재조합 결핍 판단 모델을 생성할 수 있다. 여기서 주요 유전자는 전술한 바와 같이 HRD 스코어 예측과 관련하여 다른 유전자들보다 영향이 큰 중요한 유전자일 수 있다. 기계학습 알고리즘은 엘라스틱넷 회귀(elasticnet regression), 라쏘 회귀(lasso regression), 릿지 회귀(ridge regression), 그래디언트 부스팅(gradient boosting), 서포트 벡터 머신(support vector machine), 다층 퍼셉트론(multi-layer perceptron) 등을 포함할 수 있으나 이는 일 실시예에 불과할 뿐 이에 한정되는 것은 아니다.The second model generation unit 260 generates a homologous recombination deficiency determination model through machine learning using the expression data, transcriptome data, and HRD score data of the main genes determined by the main gene determination unit 250 as learning data. can do. As described above, the main gene may be an important gene having a greater influence than other genes in relation to HRD score prediction. Machine learning algorithms include elasticnet regression, lasso regression, ridge regression, gradient boosting, support vector machine, and multi-layer perceptron. etc., but this is only one embodiment and is not limited thereto.

예시적 실시예에 따르면, 제2 모델 생성부(260)는 주요 유전자 발현량 데이터 및 전사체 데이터를 입력으로 하고, 그에 대응하는 HRD 스코어 데이터를 정답(target)으로 하여 기계학습 모델을 학습시킬 수 있다. 즉, 제2 모델 생성부(260)는 주요 유전자 발현량 데이터 및 전사체 데이터를 기반으로 HRD 스코어를 예측하도록 기계학습 모델을 학습시킬 수 있다. 예를 들어, 제2 모델 생성부(260)는 학습 세트의 주요 유전자 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 기반으로 기계학습 모델을 학습시킨 후 학습된 기계학습 모델을 검증 세트로 평가하고, 평가 결과를 기반으로 기계학습 모델의 하이퍼파라미터(hyperparameter)를 조정할 수 있다. 이러한 하이퍼파라미터 조정 과정을 통해 최종적으로 상동 재조합 결핍 판단 모델이 생성될 수 있다.According to an exemplary embodiment, the second model generation unit 260 takes key gene expression data and transcript data as inputs and uses HRD score data corresponding thereto as a target to train a machine learning model. there is. That is, the second model generation unit 260 may train a machine learning model to predict an HRD score based on the main gene expression level data and transcriptome data. For example, the second model generating unit 260 trains a machine learning model based on the main gene expression level data, transcriptome data, and HRD score data of the learning set, and then evaluates the learned machine learning model as a verification set. , the hyperparameters of the machine learning model can be adjusted based on the evaluation results. Through this process of adjusting the hyperparameters, a homologous recombination deficiency determination model may be finally generated.

도 3은 예시적 실시예에 따른 상동 재조합 결핍 판단 장치를 도시한 도면이다.3 is a diagram illustrating an apparatus for determining homologous recombination deficiency according to an exemplary embodiment.

도 3을 참조하면, 예시적 실시예에 따른 상동 재조합 결핍 판단 장치(300)는 데이터 획득부(310) 및 상동 재조합 결핍 판단부(320)를 포함할 수 있다.Referring to FIG. 3 , an apparatus 300 for determining homologous recombination deficiency according to an exemplary embodiment may include a data acquisition unit 310 and a homologous recombination deficiency determination unit 320 .

데이터 획득부(310)는 암 환자의 주요 유전자 발현량 데이터 및 전사체 데이터를 획득할 수 있다. 예를 들어, 데이터 획득부(310)는 암 환자의 주요 유전자 발현량 데이터 및 전사체 데이터를 측정 및/또는 저장하는 외부 장치로부터 암 환자의 주요 유전자 발현량 데이터 및 전사체 데이터를 획득할 수 있다. 이때, 데이터 획득부(310)는 유무선 통신 기술을 이용할 수 있다.The data acquisition unit 310 may obtain main gene expression level data and transcriptome data of the cancer patient. For example, the data acquisition unit 310 may obtain the main gene expression level data and transcript data of the cancer patient from an external device that measures and/or stores the main gene expression level data and transcript data of the cancer patient. . At this time, the data acquisition unit 310 may use a wired or wireless communication technology.

상동 재조합 결핍 판단부(320)는 상동 재조합 결핍 판단 모델을 이용하여 데이터 획득부(310)를 통해 획득한 암 환자의 주요 유전자 발현량 데이터 및 전사체 데이터로부터 암 환자의 상동 재조합 결핍 여부를 판단할 수 있다. 여기서 상동 재조합 결핍 판단 모델은 도 2의 상동 재조합 결핍 판단 모델 생성 장치(200)에서 생성된 상동 재조합 결핍 판단 모델일 수 있다.The homologous recombination deficiency determination unit 320 determines whether or not the cancer patient has a homologous recombination deficiency from the main gene expression level data and transcriptome data of the cancer patient acquired through the data acquisition unit 310 using the homologous recombination deficiency determination model. can Here, the homologous recombination deficiency determination model may be a homologous recombination deficiency determination model generated by the homologous recombination deficiency determination model generator 200 of FIG. 2 .

예를 들어, 상동 재조합 결핍 판단부(320)는 암 환자의 주요 유전자 발현량 데이터 및 전사체 데이터를 상동 재조합 결핍 판단 모델에 입력하여 해당 암 환자의 HRD 스코어를 판단할 수 있다.For example, the homologous recombination deficiency determination unit 320 may determine the HRD score of the cancer patient by inputting key gene expression data and transcript data of the cancer patient into a homologous recombination deficiency determination model.

도 4는 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 도면이다. 도시된 실시예에서, 각 구성부들은 이하에 기술된 것 이외에 상이한 기능 및 능력을 가질 수도 있고, 컴퓨팅 환경은 이하에 기술되지 것 이외에도 추가적인 구성부를 포함할 수도 있다.4 is a diagram for illustrating and describing a computing environment including a computing device suitable for use in example embodiments. In the illustrated embodiment, each component may have different functions and capabilities other than those described below, and the computing environment may include additional components other than those described below.

도시된 컴퓨팅 환경(400)은 컴퓨팅 장치(410)를 포함할 수 있다. 일 실시예에 따르면, 컴퓨팅 장치(410)는 예를 들어, 도 1 내지 도 3을 참조하여 설명한 상동 재조합 결핍 연관 유전자 도출 장치(100), 상동 재조합 결핍 판단 모델 생성 장치(200) 및/또는 상동 재조합 결핍 판단 장치(300)에 포함되는 하나 이상의 구성부를 포함할 수 있다.The illustrated computing environment 400 may include a computing device 410 . According to one embodiment, the computing device 410 includes, for example, the homologous recombination deficiency-related gene derivation device 100, the homologous recombination deficiency determination model generating device 200, and/or the homologous recombination deficiency-related gene derivation device 200 described with reference to FIGS. 1 to 3 . One or more components included in the recombination deficiency determining device 300 may be included.

컴퓨팅 장치(410)는 적어도 하나의 프로세서(411), 컴퓨터 판독 가능 저장 매체(412) 및 통신 버스(413)를 포함할 수 있다. 프로세서(411)는 컴퓨팅 장치(410)로 하여금 앞서 언급된 예시적인 실시예에 따라 동작하도록 할 수 있다. 예컨대, 프로세서(411)는 컴퓨터 판독 가능 저장 매체(412)에 저장된 하나 이상의 프로그램들(414)을 실행할 수 있다. 하나 이상의 프로그램들(414)은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 컴퓨터 실행 가능 명령어는 프로세서(411)에 의해 실행되는 경우 컴퓨팅 장치(410)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.The computing device 410 may include at least one processor 411 , a computer readable storage medium 412 and a communication bus 413 . Processor 411 may cause computing device 410 to operate according to the above-mentioned example embodiments. For example, processor 411 may execute one or more programs 414 stored on computer readable storage medium 412 . One or more programs 414 may include one or more computer executable instructions, which when executed by processor 411 cause computing device 410 to perform operations in accordance with an illustrative embodiment. can be configured to

컴퓨터 판독 가능 저장 매체(412)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장할 수 있다. 컴퓨터 판독 가능 저장 매체(412)에 저장된 프로그램(414)은 프로세서(411)에 의해 실행 가능한 명령어의 집합을 포함할 수 있다. 일 실시예에 따르면, 컴퓨터 판독 가능 저장 매체(412)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 디바이스들, 광학 디스크 저장 디바이스들, 플래시 메모리 디바이스들, 그 밖에 컴퓨팅 장치(410)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장 매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 412 may store computer-executable instructions or program code, program data, and/or other suitable forms of information. The program 414 stored in the computer readable storage medium 412 may include a set of instructions executable by the processor 411 . According to one embodiment, computer readable storage medium 412 may include memory (volatile memory such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, other forms of storage media that can be accessed by computing device 410 and store desired information, or a suitable combination thereof.

통신 버스(413)는 프로세서(411), 컴퓨터 판독 가능 저장 매체(412)를 포함하여 컴퓨팅 장치(410)의 다른 다양한 구성부들을 상호 연결할 수 있다.The communication bus 413 may interconnect various components of the computing device 410, including the processor 411 and the computer readable storage medium 412.

컴퓨팅 장치(410)는 또한 하나 이상의 입출력 장치(420)를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(415) 및 하나 이상의 네트워크 통신 인터페이스(416)를 포함할 수 있다. 입출력 인터페이스(415) 및 네트워크 통신 인터페이스(416)는 통신 버스(413)에 연결될 수 있다. 입출력 장치(420)는 입출력 인터페이스(415)를 통해 컴퓨팅 장치(410)의 다른 구성부들에 연결될 수 있다. 입출력 장치(420)는 예를 들어, 포인팅 장치(마우스 또는 트랙패드 등), 키보드, 터치 입력 장치(터치패드 또는 터치스크린 등), 음성 또는 소리 입력 장치, 다양한 종류의 센서 장치 및/또는 촬영 장치와 같은 입력 장치, 및/또는 디스플레이 장치, 프린터, 스피커 및/또는 네트워크 카드와 같은 출력 장치를 포함할 수 있다. 입출력 장치(420)는 컴퓨팅 장치(410)를 구성하는 일 구성부로서 컴퓨팅 장치(410)의 내부에 포함될 수도 있고, 컴퓨팅 장치(410)와는 구별되는 별개의 장치로 컴퓨팅 장치(410)와 연결될 수도 있다.Computing device 410 may also include one or more input/output interfaces 415 and one or more network communication interfaces 416 that provide interfaces for one or more input/output devices 420 . The input/output interface 415 and the network communication interface 416 may be connected to the communication bus 413 . The input/output device 420 may be connected to other components of the computing device 410 through an input/output interface 415 . The input/output device 420 may include, for example, a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touchpad or touchscreen), a voice or sound input device, various types of sensor devices, and/or a photographing device. and/or output devices such as display devices, printers, speakers, and/or network cards. The input/output device 420 may be included inside the computing device 410 as a component constituting the computing device 410, or may be connected to the computing device 410 as a separate device distinct from the computing device 410. there is.

도 5는 예시적 실시예에 따른 상동 재조합 결핍 연관 유전자 도출 방법을 도시한 도면이다. 도 5의 상동 재조합 결핍 연관 유전자 도출 방법은 도 1의 상동 재조합 결핍 연관 유전자 도출 장치(100)에 의해 수행될 수 있다.5 is a diagram illustrating a method for deriving homologous recombination deficiency-associated genes according to an exemplary embodiment. The method of deriving homologous recombination deficiency-related genes of FIG. 5 may be performed by the apparatus 100 for deriving homologous recombination-deficient genes of FIG. 1 .

도 5를 참조하면, 상동 재조합 결핍 연관 유전자 도출 장치는 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 HRD 스코어 데이터를 수집할 수 있다(510). 여기서 전사체 데이터는 mRNA 시퀀스 데이터일 수 있다.Referring to FIG. 5 , the apparatus for deriving genes associated with homologous recombination deficiency may collect transcriptome data, gene expression level data, and HRD score data of various types of cancer patients (510). Here, transcriptome data may be mRNA sequence data.

예를 들면, 상동 재조합 결핍 연관 유전자 도출 장치는 외부 데이베이스로부터 전사체 데이터, 유전자 발현량 데이터 및 HRD 스코어 데이터를 수집할 수 있다. For example, the homologous recombination deficiency associated gene derivation device may collect transcriptome data, gene expression level data, and HRD score data from an external database.

다른 예를 들면, 상동 재조합 결핍 연관 유전자 도출 장치는 외부 데이터베이스로부터 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 단일 염기 변이 어레이 데이터를 수신하고, 수신된 단일 염기서열 어레이 데이터를 기반으로 HRD 스코어를 산출함으로써, 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 HRD 스코어를 수집할 수 있다.In another example, the device for deriving homologous recombination deficiency-related genes receives transcript data, gene expression level data, and single nucleotide mutation array data of various types of cancer patients from an external database, and based on the received single nucleotide sequence array data By calculating the HRD score, transcriptome data, gene expression level data, and HRD scores of various types of cancer patients can be collected.

상동 재조합 결핍 연관 유전자 도출 장치는 수집된 데이터를 전처리할 수 있다(520). 예를 들어, 상동 재조합 결핍 연관 유전자 도출 장치는 TPM(transcripts per million), RPKM(reads per kilobase per millions mapped reads), TMM(trimmed means of M values) 등 다양한 정규화 기법을 이용하여 전사체 데이터를 정규화할 수 있다. 또한, 상동 재조합 결핍 연관 유전자 도출 장치는 유전자 발현량 데이터에 로그(log)를 취하여 유전자 발현량 데이터를 정규화할 수 있다.The device for deriving homologous recombination deficiency-associated genes may pre-process the collected data (520). For example, the homologous recombination deficiency associated gene derivation device normalizes transcriptome data using various normalization techniques such as TPM (transcripts per million), RPKM (reads per kilobase per millions mapped reads), and TMM (trimmed means of M values) can do. In addition, the device for deriving genes associated with homologous recombination deficiency may normalize the gene expression level data by taking a logarithm of the gene expression level data.

상동 재조합 결핍 연관 유전자 도출 장치는 차별 발현 유전자(Deferentially Expressed Gene, DEG) 분석 기법을 통해, 전처리된 데이터로부터 후보 유전자를 판단할 수 있다(530). 여기서 후보 유전자는 상동 재조합 결핍과 관련될 가능성이 있는 유전자일 수 있다.The apparatus for deriving homologous recombination deficiency associated genes may determine candidate genes from the preprocessed data through a differentially expressed gene (DEG) analysis technique (530). Here, the candidate gene may be a gene likely to be associated with homologous recombination deficiency.

예를 들면, 상동 재조합 결핍 연관 유전자 도출 장치는 전처리된 데이터를 소정의 비율(예컨대 2:1)에 따라 학습 세트(training set)와 검증 세트(validation set)로 구분하고, DEG 분석 기법을 통해 학습 세트에서 상동 재조합 결핍과 관련된 후보 유전자를 판단할 수 있다.For example, the device for deriving homologous recombination deficiency associated genes divides the preprocessed data into a training set and a validation set according to a predetermined ratio (eg, 2:1), and learns through a DEG analysis technique. Candidate genes associated with homologous recombination deficiency in the set can be determined.

상동 재조합 결핍 연관 유전자 도출 장치는 후보 유전자 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 학습 데이터로 이용하여 기계학습을 통해 주요 유전자 판단 모델을 생성할 수 있다(540). 여기서 주요 유전자는 HRD 스코어 예측과 관련하여 다른 유전자들보다 영향이 큰 중요한 유전자일 수 있다. 기계학습 알고리즘은 엘라스틱넷 회귀(elasticnet regression), 라쏘 회귀(lasso regression), 릿지 회귀(ridge regression), 그래디언트 부스팅(gradient boosting), 서포트 벡터 머신(support vector machine), 다층 퍼셉트론(multi-layer perceptron) 등을 포함할 수 있으나 이는 일 실시예에 불과할 뿐 이에 한정되는 것은 아니다.The apparatus for deriving genes associated with homologous recombination deficiency may generate a main gene determination model through machine learning using candidate gene expression level data, transcriptome data, and HRD score data as learning data (540). Here, the main gene may be an important gene having a greater influence than other genes in relation to HRD score prediction. Machine learning algorithms include elasticnet regression, lasso regression, ridge regression, gradient boosting, support vector machine, and multi-layer perceptron. etc., but this is only one embodiment and is not limited thereto.

예를 들면, 상동 재조합 결핍 연관 유전자 도출 장치는 후보 유전자 발현량 데이터 및 전사체 데이터를 기반으로 HRD 스코어를 예측하도록 기계학습 모델을 학습시킬 수 있다. 예컨대, 상동 재조합 결핍 연관 유전자 도출 장치는 학습 세트의 후보 유전자 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 기반으로 기계학습 모델을 학습시킨 후 학습된 기계학습 모델을 검증 세트로 평가하고, 평가 결과를 기반으로 기계학습 모델의 하이퍼파라미터(hyperparameter)를 조정할 수 있다. 이러한 하이퍼파라미터 조정 과정을 통해 최종적으로 주요 유전자 판단 모델이 생성될 수 있다.For example, the device for deriving homologous recombination deficiency-associated genes may train a machine learning model to predict an HRD score based on candidate gene expression level data and transcriptome data. For example, the device for deriving homologous recombination deficiency-related genes trains a machine learning model based on candidate gene expression data, transcriptome data, and HRD score data of a training set, evaluates the learned machine learning model as a verification set, and evaluates the evaluation result. Based on this, you can adjust the hyperparameters of the machine learning model. Through this hyperparameter adjustment process, a major genetic judgment model can finally be created.

상동 재조합 결핍 연관 유전자 도출 장치는 생성된 주요 유전자 판단 모델을 이용하여 상동 재조합 결핍과 관련된 주요 유전자를 판단할 수 있다(550). 예를 들어, 상동 재조합 결핍 연관 유전자 도출 장치는 주요 유전자 판단 모델로부터 각 유전자의 가중치 또는 중요도를 판단하고, 판단된 가중치의 절대값 또는 중요도가 소정의 임계값 이상인 유전자를 주요 유전자로 판단할 수 있다.The apparatus for deriving genes associated with homologous recombination deficiency may determine major genes related to homologous recombination deficiency using the generated major gene determination model (550). For example, the apparatus for deriving homologous recombination deficiency-related genes may determine the weight or importance of each gene from the main gene determination model, and determine a gene whose absolute value or importance of the determined weight is greater than or equal to a predetermined threshold value as a main gene. .

도 6은 예시적 실시예에 따른 상동 재조합 결핍 판단 모델 생성 방법을 도시한 도면이다. 도 6의 상동 재조합 결핍 판단 모델 생성 방법은 도 2의 상동 재조합 결핍 판단 모델 생성 장치(200)에 의해 수행될 수 있다.6 is a diagram illustrating a method for generating a homologous recombination deficiency determination model according to an exemplary embodiment. The method for generating the homologous recombination deficiency determination model of FIG. 6 may be performed by the apparatus 200 for generating the homologous recombination deficiency determination model of FIG. 2 .

도 6을 참조하면, 상동 재조합 결핍 판단 모델 생성 장치는 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 HRD 스코어 데이터를 수집하고(610), 수집된 데이터를 전처리할 수 있다(620). 예를 들어, 상동 재조합 결핍 판단 모델 생성 장치는 TPM(transcripts per million), RPKM(reads per kilobase per millions mapped reads), TMM(trimmed means of M values) 등 다양한 정규화 기법을 이용하여 전사체 데이터를 정규화할 수 있다. 또한, 상동 재조합 결핍 판단 모델 생성 장치는 유전자 발현량 데이터에 로그(log)를 취하여 유전자 발현량 데이터를 정규화할 수 있다.Referring to FIG. 6 , the apparatus for generating a homologous recombination deficiency determination model may collect transcriptome data, gene expression level data, and HRD score data of various types of cancer patients (610), and pre-process the collected data (620). . For example, the homologous recombination deficiency determination model generator normalizes transcriptome data using various normalization techniques such as TPM (transcripts per million), RPKM (reads per kilobase per millions mapped reads), and TMM (trimmed means of M values) can do. In addition, the homologous recombination deficiency determination model generator may normalize the gene expression level data by taking a log of the gene expression level data.

상동 재조합 결핍 판단 모델 생성 장치는 차별 발현 유전자 분석 기법을 통해, 전처리된 데이터로부터 후보 유전자를 판단하고(630), 후보 유전자 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 학습 데이터로 이용하여 기계학습을 통해 주요 유전자 판단 모델을 생성할 수 있다(640). 예를 들면, 상동 재조합 결핍 연관 유전자 도출 장치는 후보 유전자 발현량 데이터 및 전사체 데이터를 기반으로 HRD 스코어를 예측하도록 기계학습 모델을 학습시킬 수 있다.The apparatus for generating a homologous recombination deficiency determination model determines a candidate gene from the preprocessed data through a differential expression gene analysis technique (630), and uses the candidate gene expression level data, transcriptome data, and HRD score data as learning data to perform machine learning. Through this, it is possible to generate a major genetic judgment model (640). For example, the device for deriving homologous recombination deficiency-associated genes may train a machine learning model to predict an HRD score based on candidate gene expression level data and transcriptome data.

상동 재조합 결핍 판단 모델 생성 장치는 생성된 주요 유전자 판단 모델을 이용하여 상동 재조합 결핍과 관련된 주요 유전자를 판단할 수 있다(650).The homologous recombination deficiency determination model generation apparatus may determine a major gene related to homologous recombination deficiency using the generated major gene determination model (650).

상동 재조합 결핍 판단 모델 생성 장치는 주요 유전자의 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 학습 데이터로 이용하여 기계학습을 통해 상동 재조합 결핍 판단 모델을 생성할 수 있다(660). 기계학습 알고리즘은 엘라스틱넷 회귀(elasticnet regression), 라쏘 회귀(lasso regression), 릿지 회귀(ridge regression), 그래디언트 부스팅(gradient boosting), 서포트 벡터 머신(support vector machine), 다층 퍼셉트론(multi-layer perceptron) 등을 포함할 수 있으나 이는 일 실시예에 불과할 뿐 이에 한정되는 것은 아니다.The apparatus for generating a homologous recombination deficiency determination model may generate a homologous recombination deficiency determination model through machine learning using expression level data of major genes, transcriptome data, and HRD score data as learning data (660). Machine learning algorithms include elasticnet regression, lasso regression, ridge regression, gradient boosting, support vector machine, and multi-layer perceptron. etc., but this is only one embodiment and is not limited thereto.

예를 들면, 상동 재조합 결핍 판단 모델 생성 장치는 주요 유전자 발현량 데이터 및 전사체 데이터를 기반으로 HRD 스코어를 예측하도록 기계학습 모델을 학습시킬 수 있다. 예를 들어, 상동 재조합 결핍 판단 모델 생성 장치는 학습 세트의 주요 유전자 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 기반으로 기계학습 모델을 학습시킨 후 학습된 기계학습 모델을 검증 세트로 평가하고, 평가 결과를 기반으로 기계학습 모델의 하이퍼파라미터를 조정할 수 있다. 이러한 하이퍼파라미터 조정 과정을 통해 최종적으로 상동 재조합 결핍 판단 모델이 생성될 수 있다.For example, the apparatus for generating a homologous recombination deficiency determination model may train a machine learning model to predict an HRD score based on key gene expression level data and transcriptome data. For example, the homologous recombination deficiency determination model generating device trains a machine learning model based on the main gene expression level data, transcriptome data, and HRD score data of the learning set, evaluates the learned machine learning model as a validation set, Based on the evaluation results, the hyperparameters of the machine learning model can be adjusted. Through this process of adjusting the hyperparameters, a homologous recombination deficiency determination model may be finally generated.

도 7은 예시적 실시예에 따른 상동 재조합 결핍 판단 방법을 도시한 도면이다. 도 7의 상동 재조합 결핍 판단은 도 3의 상동 재조합 결핍 판단 장치에 의해 수행될 수 있다.7 is a diagram illustrating a method for determining homologous recombination deficiency according to an exemplary embodiment. Homologous recombination deficiency determination of FIG. 7 may be performed by the homologous recombination deficiency determination apparatus of FIG. 3 .

도 7을 참조하면 상동 재조합 결핍 판단 장치는 암 환자의 주요 유전자 발현량 데이터 및 전사체 데이터를 획득할 수 있다(710).Referring to FIG. 7 , the apparatus for determining homologous recombination deficiency may acquire main gene expression level data and transcript data of a cancer patient (710).

상동 재조합 결핍 판단 장치는 상동 재조합 결핍 판단 모델을 이용하여 획득한 암 환자의 주요 유전자 발현량 데이터 및 전사체 데이터로부터 암 환자의 상동 재조합 결핍 여부를 판단할 수 있다(720).The homologous recombination deficiency determination device may determine whether the cancer patient is deficient in homologous recombination from the main gene expression level data and transcriptome data of the cancer patient obtained using the homologous recombination deficiency determination model (720).

예를 들어, 상동 재조합 결핍 판단 장치는 암 환자의 주요 유전자 발현량 데이터 및 전사체 데이터를 상동 재조합 결핍 판단 모델에 입력하여 해당 암 환자의 HRD 스코어를 판단할 수 있다.For example, the homologous recombination deficiency determination device may determine the HRD score of the cancer patient by inputting key gene expression data and transcriptome data of the cancer patient into a homologous recombination deficiency determination model.

실시예Example

도 1의 상동 재조합 결핍 연관 유전자 도출 장치(100)에 의해 도출된 주요 유전자를 이용하여 6개의 기계학습 모델(엘라스틱넷 회귀, 라쏘 회귀, 릿지 회귀, 그래디언트 부스팅, 서포트 벡터 머신, 다층 퍼셉트론)을 학습시켜 6개의 상동 재조합 결핍 판단 모델을 생성하였다. 생성된 6개의 상동 재조합 결핍 판단 모델에 대하여 Pan-cancer, 난소암(Ovarian cancer, OV), 삼중음성유방암(Triplet negative breast cancer, TNBC)에서의 예측 성능을 평가한 결과 하기 표 1을 획득할 수 있었다. 성능 평가 지표로서 R²와 RMSE(Root Mean Square error)를 이용하였다.Six machine learning models (ElasticNet Regression, Lasso Regression, Ridge Regression, Gradient Boosting, Support Vector Machine, Multilayer Perceptron) are learned using the main genes derived by the homologous recombination deficiency associated gene derivation apparatus 100 of FIG. to generate six homologous recombination deficiency judgment models. For the six generated homologous recombination deficiency judgment models, predictive performance in Pan-cancer, ovarian cancer (OV), and triplet negative breast cancer (TNBC) was evaluated, and Table 1 below can be obtained there was. R ² and root mean square error (RMSE) were used as performance evaluation indicators.

표 1을 참조하면, RMSE 값으로 비교하였을 때에는 Pan-cancer는 모델 간 큰 성능의 차이는 관측되지 않았다. 상동 재조합 결핍 수치가 높은 난소암, 삼중음성유방암에서는 엘라스틱넷 회귀가 가장 낮은 RMSE(OV RMSE=12.7, TNBC RMSE=17.54)와, 가장 높은 R²(OV R²=0.55, TNBC R²=0.54)를 가진다는 것이 확인되었다.Referring to Table 1, when compared with RMSE values, no significant difference in performance was observed between Pan-cancer models. In ovarian cancer and triple-negative breast cancer with high levels of homologous recombination deficiency, ElasticNet regression showed the lowest RMSE (OV RMSE=12.7, TNBC RMSE=17.54) and the highest R ² (OV R ² =0.55, TNBC R ² =0.54). It was confirmed to have

상술한 실시예들은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있다. 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함할 수 있다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 디스크 등을 포함할 수 있다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드로 작성되고 실행될 수 있다.The above-described embodiments may be implemented as computer readable codes on a computer readable recording medium. A computer-readable recording medium may include all types of recording devices storing data that can be read by a computer system. Examples of computer-readable recording media may include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, and the like. In addition, the computer-readable recording medium may be distributed among computer systems connected through a network, and may be written and executed as computer-readable codes in a distributed manner.

이제까지 본 발명에 대하여 그 바람직한 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 따라서, 본 발명의 범위는 전술한 실시 예에 한정되지 않고 특허 청구범위에 기재된 내용과 동등한 범위 내에 있는 다양한 실시 형태가 포함되도록 해석되어야 할 것이다.So far, the present invention has been looked at mainly with its preferred embodiments. Those skilled in the art to which the present invention pertains will be able to understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the scope of the present invention should be construed to include various embodiments within the scope equivalent to those described in the claims without being limited to the above-described embodiments.

100: 상동 재조합 결핍 연관 유전자 도출 장치
200: 상동 재조합 결핍 판단 모델 생성 장치
300: 상동 재조합 결핍 판단 장치
110, 210: 데이터 수집부
120, 220: 전처리부
130, 230: 후보 유전자 판단부
140: 모델 생성부
150, 250: 주요 유전자 판단부
240: 제1 모델 생성부
260: 제2 모델 생성부
310: 데이터 획득부
320: 상동 재조합 결핍 판단부100: homologous recombination deficiency associated gene derivation device
200: Homologous recombination deficiency judgment model generating device
300: homologous recombination deficiency determination device
110, 210: data collection unit
120, 220: pre-processing unit
130, 230: candidate gene determination unit
140: model generating unit
150, 250: main genetic determination unit
240: first model generating unit
260: second model generating unit
310: data acquisition unit
320: homologous recombination deficiency determination unit

Claims

다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 상동 재조합 결핍 스코어 데이터를 수집하는 데이터 수집부;
차별 발현 유전자 분석 기법을 통해, 상기 수집된 데이터로부터 후보 유전자를 판단하는 후보 유전자 판단부;
상기 수집된 데이터 중 상기 후보 유전자의 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 학습데이터로 이용하여 기계학습 모델을 학습시켜 주요 유전자 판단 모델을 생성하는 모델 생성부; 및
상기 생성된 주요 유전자 판단 모델을 이용하여 상동 재조합 결핍과 관련된 주요 유전자를 판단하는 주요 유전자 판단부; 를 포함하는,
상동 재조합 결핍 연관 유전자 도출 장치.Data collection unit for collecting transcriptome data, gene expression level data and homologous recombination deficiency score data of various types of cancer patients;
a candidate gene determination unit for determining a candidate gene from the collected data through a differentially expressed gene analysis technique;
a model generator configured to generate a main gene determination model by learning a machine learning model using expression level data, transcriptome data, and HRD score data of the candidate gene among the collected data as training data; and
a key gene determination unit for determining key genes related to homologous recombination deficiency using the generated key gene determination model; including,
Homologous recombination deficiency-associated gene derivation device.

제1항에 있어서,
상기 기계학습 모델은 엘라스틱넷 회귀, 라쏘 회귀, 릿지 회귀, 그래디언트 부스팅, 서포트 벡터 머신 및 다층 퍼셉트론을 포함하는,
상동 재조합 결핍 연관 유전자 도출 장치.According to claim 1,
The machine learning model includes elasticnet regression, lasso regression, ridge regression, gradient boosting, support vector machine, and multilayer perceptron.
Homologous recombination deficiency-associated gene derivation device.

제1항에 있어서,
상기 주요 유전자 판단부는 상기 생성된 주요 유전자 판단 모델로부터 각 유전자의 가중치 또는 중요도를 판단하고, 상기 판단된 가중치의 절대값 또는 중요도가 임계값 이상인 유전자를 주요 유전자로 판단하는,
상동 재조합 결핍 연관 유전자 도출 장치.According to claim 1,
The main gene determination unit determines the weight or importance of each gene from the generated main gene determination model, and determines a gene whose absolute value or importance of the determined weight is greater than or equal to a threshold value as a main gene,
Homologous recombination deficiency-associated gene derivation device.

제1항에 있어서,
상기 수집된 데이터를 정규화하는 전처리부; 를 더 포함하는,
상동 재조합 결핍 연관 유전자 도출 장치.According to claim 1,
a pre-processing unit normalizing the collected data; Including more,
Homologous recombination deficiency-associated gene derivation device.

다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 상동 재조합 결핍 스코어 데이터를 수집하는 데이터 수집부;
차별 발현 유전자 분석 기법을 통해, 상기 수집된 데이터로부터 후보 유전자를 판단하는 후보 유전자 판단부;
상기 수집된 데이터 중 상기 후보 유전자의 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 이용하여 제1 기계학습 모델을 학습시켜 주요 유전자 판단 모델을 생성하는 제1 모델 생성부;
상기 생성된 주요 유전자 판단 모델을 이용하여 상동 재조합 결핍과 관련된 주요 유전자를 판단하는 주요 유전자 판단부; 및
상기 수집된 데이터 중 상기 주요 유전자의 발현량 데이터, 전사체 데이터 및 HRD 스코어 데이터를 학습 데이터로 이용하여 제2 기계학습 모델을 학습시켜 상동 재조합 결핍 판단 모델을 생성하는 제2 모델 생성부; 를 포함하는,
상동 재조합 결핍 판단 모델 생성 장치.Data collection unit for collecting transcriptome data, gene expression level data and homologous recombination deficiency score data of various types of cancer patients;
a candidate gene determination unit for determining a candidate gene from the collected data through a differentially expressed gene analysis technique;
a first model generation unit configured to generate a main gene determination model by learning a first machine learning model using expression level data, transcriptome data, and HRD score data of the candidate gene among the collected data;
a key gene determination unit for determining key genes related to homologous recombination deficiency using the generated key gene determination model; and
A second model generating unit configured to generate a homologous recombination deficiency determination model by learning a second machine learning model using expression level data, transcriptome data, and HRD score data of the main genes among the collected data as training data; including,
Homologous recombination deficiency judgment model generating device.

제5항에 있어서,
상기 제1 기계학습 모델은 그래디언트 부스팅이고, 상기 제2 기계학습 모델은 엘라스틱넷 회귀인,
상동 재조합 결핍 판단 모델 생성 장치.According to claim 5,
The first machine learning model is gradient boosting, and the second machine learning model is ElasticNet regression.
Homologous recombination deficiency judgment model generating device.

암 환자의 주요 유전자 발현량 데이터 및 전사체 데이터를 획득하는 데이터 획득부; 및
상동 재조합 결핍 판단 모델을 이용하여 상기 획득된 데이터로부터 상기 암 환자의 상동 재조합 결핍 여부를 판단하는 상동 재조합 결핍 판단부; 를 포함하고,
상기 상동 재조합 결핍 판단 모델은 다양한 종류의 암 환자의 주요 유전자의 발현량 데이터, 전사체 데이터 및 상동 재조합 결핍 스코어 데이터를 학습 데이터로 이용하여 기계학습 모델을 학습시켜 생성되고,
상기 주요 유전자는 상동 재조합 결핍과 관련된 유전자로서, 주요 유전자 판단 모델로부터 판단된 가중치 절대값 또는 중요도가 임계값 이상인 유전자인,
상동 재조합 결핍 판단 장치.Data acquisition unit for acquiring the main gene expression data and transcript data of cancer patients; and
a homologous recombination deficiency determination unit determining whether the cancer patient has a homologous recombination deficiency based on the obtained data using a homologous recombination deficiency determination model; including,
The homologous recombination deficiency determination model is generated by learning a machine learning model using expression data, transcriptome data, and homologous recombination deficiency score data of major genes of various types of cancer patients as learning data,
The main gene is a gene related to homologous recombination deficiency, and is a gene whose weight absolute value or importance determined from the main gene determination model is greater than or equal to a threshold value,
Homologous recombination deficiency judgment device.

제7항에 있어서,
상기 주요 유전자 판단 모델은 다양한 종류의 암 환자의 전사체 데이터, 유전자 발현량 데이터 및 상동 재조합 결핍 스코어 데이터를 수집하고, 차별 발현 유전자 분석 기법을 통해 상기 수집된 데이터로부터 후보 유전자를 판단하고, 상기 수집된 데이터 중 상기 후보 유전자의 발현량 데이터, 전사체 데이터 및 상동 재조합 결핍 스코어 데이터를 학습데이터로 이용하여 생성되는,
상동 재조합 결핍 판단 장치.According to claim 7,
The main gene determination model collects transcriptome data, gene expression level data, and homologous recombination deficiency score data of various types of cancer patients, determines candidate genes from the collected data through a differential expression gene analysis technique, and collects the collected data. Generated using the expression level data, transcript data, and homologous recombination deficiency score data of the candidate gene among the obtained data as learning data,
Homologous recombination deficiency judgment device.