KR20190130446A

KR20190130446A - Method and system for performing molecular design using machine learning algorithms

Info

Publication number: KR20190130446A
Application number: KR1020180117878A
Authority: KR
Inventors: 피유쉬 타가드; 산티 판디안; 에스 크리쉬난 할리할란; 파람팔리 샤시세카라 아디가
Original assignee: 삼성전자주식회사
Priority date: 2018-04-24
Filing date: 2018-10-02
Publication date: 2019-11-22

Abstract

A method for designing molecules by using a machine learning algorithm includes representing molecular structures included in a dataset by using a simplified molecular input line entry system (SMILES), in which the SMILES uses a series of characters, converting a SMILES representation of the molecular structures into a binary representation, pre-training a stack of restricted Boltzmann machines (RBMs) by using the binary representation of the molecular structures, constructing a deep Boltzmann machine (DBM) by using the stack of the RBMs, determining limited molecular property data for a subset of the molecule structures in the dataset, training the DBM with the limited molecular property data, combining the pre-trained stack of the RBMs and the trained DBM in a Bayesian inference framework, and generating a sample of molecules with properties that a user wants by using the Bayesian inference framework.

Description

머신 러닝 알고리즘을 이용하여 분자를 설계하는 방법 및 시스템{Method and system for performing molecular design using machine learning algorithms} Method and system for performing molecular design using machine learning algorithms

본 발명은 머신 러닝 알고리즘을 이용하여 분자를 설계하는 방법 및 시스템에 관한 것이다. 보다 구체적으로, 본 발명은 분자 설계 분야에 관한 것으로서, 특히 딥 러닝 베이지언 프레임워크(deep learning Bayesian framework)를 이용한 특성 유도 역 분자 설계에 관한 것이다.The present invention relates to a method and system for designing molecules using machine learning algorithms. More specifically, the present invention relates to the field of molecular design, and more particularly, to character-driven inverse molecular design using the deep learning Bayesian framework.

기존의 매커니즘은 분자 설계를 위한 진화적 최적화 방법을 사용하며, 이는 구조 특성 상관 관계를 얻기 위해 전문가 정보로부터 획득되며 전문가가 직접 설계한 분자 지문을 이용한다. 또한 기존의 매커니즘은 구조 특성 상관 관계를 얻기 위해 얕은 지도 기계 학습 접근법을 사용한다. 그러나 이러한 메커니즘은 수용 가능한 정확도를 위해 많은 데이터 세트를 필요로 한다. 또한, 이러한 메커니즘은 실행 불가능한 분자를 제시할 수도 있다.Existing mechanisms use evolutionary optimization methods for molecular design, which are obtained from expert information and use molecular fingerprints designed by experts to obtain structural property correlations. The existing mechanism also uses a shallow supervised machine learning approach to obtain structural feature correlations. However, this mechanism requires many data sets for acceptable accuracy. In addition, such mechanisms may present infeasible molecules.

다른 기존의 메커니즘에서 기계 학습 방법은 구조 특성 상관 관계를 획득하는데 사용되며, 이는 순방향 예측 문제만을 해결한다. 이 방법은 또한 전문가가 직접 설계한 분자 지문과 지도 기계 학습 접근법을 사용한다.In other existing mechanisms, machine learning methods are used to obtain structural feature correlations, which solve only forward prediction problems. The method also uses molecular fingerprinting and supervised machine learning approaches designed by experts.

또한, 다른 기존의 메커니즘에서 랭킹 기반 방법이 기계 학습 기법에 대한 최적의 트레이닝 세트(training set)의 생성을 위해 사용된다.In addition, in other existing mechanisms, a ranking based method is used for the generation of an optimal training set for machine learning techniques.

머신 러닝 알고리즘을 이용하여 분자를 설계하는 방법 및 시스템 을 제공하는데 있다. To provide a method and system for designing molecules using machine learning algorithms.

본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 이하의 실시예들로부터 또 다른 기술적 과제들이 유추될 수 있다.The technical problem to be achieved by the present embodiment is not limited to the technical problems as described above, and further technical problems can be inferred from the following embodiments.

일 측면에 따르면, 머신 러닝 알고리즘을 이용하여 분자를 설계하는 방법은, SMILES(Simplified Molecular Input Line Entry System) 표현 유닛에 의해, 일련의 문자들을 사용하는 SMILES를 이용하여 분자 구조들을 포함하는 데이터 세트에서 상기 분자 구조들을 표현하는 단계; 이진법 표현 유닛에 의해, 상기 분자 구조들의 SMILES 표현을 이진법 표현으로 변환하는 단계; 분자 구조 생성 유닛에 의해, 상기 분자 구조들의 이진법 표현을 사용하여 RBM(Restricted Boltzmann Machines)들의 스택(stack)을 사전 훈련시키는 단계; 상기 분자 구조 생성 유닛에 의해, 상기 RBM들의 스택을 이용하여 DBM(Deep Boltzmann Machine)을 구성하는 단계; 상기 분자 구조 생성 유닛에 의해, 상기 데이터 세트에서 상기 분자 구조들의 서브 세트에 대해 밀도 함수 이론(DFT, Density Functional Theory)을 적용하여 제한된 분자 특성 데이터를 결정하는 단계; 상기 분자 구조 생성 유닛에 의해, 상기 제한된 분자 특성 데이터로 상기 DBM을 훈련시키는 단계; 상기 분자 구조 생성 유닛에 의해, 사용자가 원하는 특성들을 갖는 분자들의 샘플을 생성하기 위해 상기 사전 훈련된 RBM들의 스택과 상기 훈련된 DBM을 베이지언 추정 프레임워크(Bayesian inference framework) 에서 결합하는 단계; 및 상기 분자 구조 생성 유닛에 의해, 상기 베이지언 추정 프레임워크을 이용하여 사용자가 원하는 특성들을 갖는 분자들의 샘플을 생성하는 단계를 포함한다. According to one aspect, a method of designing a molecule using a machine learning algorithm is performed by a SMILES (Simplified Molecular Input Line Entry System) representation unit in a data set comprising molecular structures using SMILES using a series of characters. Expressing the molecular structures; Converting, by a binary representation unit, a SMILES representation of said molecular structures into a binary representation; Pre-training, by a molecular structure generating unit, a stack of Restricted Boltzmann Machines (RBM) using the binary representation of the molecular structures; Constructing, by the molecular structure generating unit, a Deep Boltzmann Machine (DBM) using the stack of RBMs; Determining, by the molecular structure generating unit, a limited molecular characteristic data by applying a density functional theory (DFT) on the subset of molecular structures in the data set; Training, by the molecular structure generating unit, the DBM with the limited molecular property data; Combining, by the molecular structure generating unit, the trained DBM and the trained DBM in a Bayesian inference framework to produce a sample of molecules with user desired characteristics; And generating, by the molecular structure generating unit, a sample of molecules having properties desired by a user using the Bayesian estimation framework.

다른 측면에 따르면, 머신 러닝 알고리즘을 이용하여 분자를 설계하기 위한 시스템(100)은, 일련의 문자들을 사용하는 SMILES를 이용하여 분자 구조들을 포함하는 데이터 세트에서 상기 분자 구조들을 표현하는 SMILES 표현 유닛; 상기 분자 구조들의 SMILES 표현을 이진법 표현으로 변환하는 이진법 표현 유닛; 및 상기 분자 구조들의 이진법 표현을 사용하여 RBM(Restricted Boltzmann Machines)들의 스택(stack)을 사전 훈련시키고, 상기 RBM들의 스택을 이용하여 DBM(Deep Boltzmann Machine)을 구성하고, 상기 데이터 세트에서 상기 분자 구조들의 서브 세트에 대해 밀도 함수 이론(DFT, Density Functional Theory)을 적용하여 제한된 분자 특성 데이터를 결정하고, 상기 제한된 분자 특성 데이터로 상기 DBM을 훈련시키고, 사용자가 원하는 특성들을 갖는 분자들의 샘플을 생성하기 위해 상기 사전 훈련된 RBM들의 스택과 상기 훈련된 DBM을 베이지언 추정 프레임워크(Bayesian inference framework) 에서 결합하는 분자 구조 생성 유닛을 포함한다.According to another aspect, a system 100 for designing molecules using a machine learning algorithm comprises: a SMILES representation unit for representing the molecular structures in a data set comprising molecular structures using SMILES using a series of letters; A binary representation unit for converting the SMILES representation of the molecular structures into a binary representation; And using a binary representation of the molecular structures to pre-train a stack of Restricted Boltzmann Machines (RBM), constructing a Deep Boltzmann Machine (DBM) using the stack of RBMs, and constructing the molecular structure in the data set. Applying Density Functional Theory (DFT) to a subset of these to determine limited molecular characteristic data, train the DBM with the limited molecular characteristic data, and generate a sample of molecules with the desired properties by the user And a molecular structure generation unit for coupling the stack of pre-trained RBMs and the trained DBM in a Bayesian inference framework.

도 1은 일 실시예에 따른, 시스템의 다양한 유닛들을 나타내는 블록도이다.
도 2는 일 실시예에 따라, 분자를 설계하는 방법을 도시한 순서도이다.
도 3은 일 실시예에 따라, 분자를 설계하기 위한 딥 러닝 베이지언 프레임워크(deep learning Bayesian framework)를 나타내는 흐름도이다.
도 4는 일 실시예에 따라, 분자들의 SMILES(Simplified Molecular Input Line Entry System) 표현을 설명하기 위한 흐름도이다.
도 5는 일 실시예에 따라, RBM(Restricted Boltzmann Machine)을 이용한 분자 구조의 비지도 학습을 나타내는 흐름도이다.
도 6은 일 실시예에 따라, 주어진 분자에 대한 특성을 예측하기 위하여 RBM(Restricted Boltzmann Machine)을 이용한 DBM(Deep Boltzmann machine)의 구성을 나타내는 개략도이다.
도 7은 일 실시예에 따라, 분자의 설계를 위한 베이지언 추정 프레임 워크(Bayesian inference framework) 를 나타내는 흐름도이다.
도 8은 일 실시예에 따라, 4.8V를 초과하는 산화 환원 전위를 갖는 분자를 예측하는 일례를 설명하기 위한 도면이다.1 is a block diagram illustrating various units of a system, according to one embodiment.
2 is a flow chart illustrating a method of designing a molecule, according to one embodiment.
3 is a flow diagram illustrating a deep learning Bayesian framework for designing molecules, according to one embodiment.
4 is a flowchart illustrating a Simplified Molecular Input Line Entry System (SMILES) representation of molecules, according to an embodiment.
5 is a flowchart illustrating unsupervised learning of a molecular structure using a Restricted Boltzmann Machine (RBM), according to one embodiment.
6 is a schematic diagram illustrating a configuration of a Deep Boltzmann machine (DBM) using a Restricted Boltzmann Machine (RBM) to predict characteristics of a given molecule, according to an exemplary embodiment.
7 is a flow diagram illustrating a Bayesian inference framework for the design of molecules, according to one embodiment.
8 is a diagram for explaining an example of predicting a molecule having a redox potential of greater than 4.8V according to one embodiment.

본 실시예들에서 사용되는 용어는 본 실시예들에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 기술분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 임의로 선정된 용어도 있으며, 이 경우 해당 실시예의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서, 본 실시예들에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 실시예들의 전반에 걸친 내용을 토대로 정의되어야 한다.The terminology used in the present embodiments is to select general terms widely used now, considering the functions of the present embodiments, but this will vary depending on the intention or precedent of the person skilled in the art, the emergence of new technologies, etc. Can be. In addition, in certain cases, there is also a term arbitrarily selected, in which case the meaning will be described in detail in the description of the corresponding embodiment. Therefore, the terms used in the present embodiments should be defined based on the meanings of the terms and the contents throughout the embodiments, rather than simply the names of the terms.

실시예들에 대한 설명들에서, 어떤 부분이 다른 부분과 연결되어 있다고 할 때, 이는 직접적으로 연결되어 있는 경우뿐 아니라, 그 중간에 다른 구성요소를 사이에 두고 전기적으로 연결되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 포함한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 실시예들에 기재된 “...부”, “...모듈”의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.In the descriptions of the embodiments, when a part is connected to another part, it includes not only a case where the part is directly connected, but also an electric part connected between other components in between. . In addition, when a part includes a certain component, this means that the component may further include other components, not to exclude other components unless specifically stated otherwise. In addition, the terms "... unit", "... module" described in the embodiments means a unit for processing at least one function or operation, which is implemented in hardware or software, or a combination of hardware and software. Can be implemented.

본 실시예들에서 사용되는 “구성된다” 또는 “포함한다” 등의 용어는 명세서 상에 기재된 여러 구성 요소들, 도는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.Terms such as “consisting of” or “comprising” as used in the present embodiments should not be construed as necessarily including all of the various components or steps described in the specification, and some of the components or It is to be understood that some steps may not be included or may further include additional components or steps.

하기 실시예들에 대한 설명은 권리범위를 제한하는 것으로 해석되지 말아야 하며, 해당 기술분야의 당업자가 용이하게 유추할 수 있는 것은 실시예들의 권리범위에 속하는 것으로 해석되어야 할 것이다. 이하 첨부된 도면들을 참조하면서 오로지 예시를 위한 실시예들을 상세히 설명하기로 한다.The description of the following embodiments should not be construed as limiting the scope of rights, and it should be construed as belonging to the scope of the embodiments as can be easily inferred by those skilled in the art. Hereinafter, only exemplary embodiments will be described in detail with reference to the accompanying drawings.

일 실시예는 머신 러닝 알고리즘을 이용하여 분자를 설계하기 위한 방법 및 시스템을 개시한다.One embodiment discloses a method and system for designing molecules using a machine learning algorithm.

일 실시예는 SMILES(Simplified Molecular Input Line Entry System)를 사용하여 분자 구조들의 대용량 데이터 세트에서 모든 분자를 표현하는 단계를 포함하는데, SMILES는 분자 구조들의 대용량 데이터 세트에서 모든 분자를 나타내기 위해 일련의 문자들을 사용할 수 있다. One embodiment includes expressing all molecules in a large data set of molecular structures using the Simplified Molecular Input Line Entry System (SMILES), wherein SMILES is a set of sequences for representing all molecules in a large data set of molecular structures. You can use characters.

또한, 일 실시예는 분자 구조들의 대용량 데이터 세트에서 모든 분자의 SMILES 표현을 이진법으로 변환하는 단계를 포함할 수 있다. In addition, one embodiment may include converting the SMILES representation of all molecules to binary in a large data set of molecular structures.

또한, 일 실시예는 DBM(Deep Boltzmann Machine)을 구성하기 위해 분자 구조들의 대용량 데이터 세트의 이진법 표현을 사용하여 RBM(Restricted Boltzmann Machines)의 스택(Stack)을 사전 훈련하는 단계를 포함할 수 있다.In addition, one embodiment may include pre-training a stack of Restricted Boltzmann Machines (RBMs) using a binary representation of a large data set of molecular structures to construct a Deep Boltzmann Machine (DBM).

또한, 일 실시예는 분자 구조들의 대용량 데이터 세트에서 분자 구조들의 서브 세트에 대한 제한된 분자 특성 데이터(limited molecular property data)를 결정하는 단계와 제한된 분자 특성 데이터로 DBM을 훈련시키는 단계를 포함할 수 있다. 제한된 분자 특성 데이터는 세트에서 분자 구조들의 서브 세트에 대하여 밀도 함수 이론(DFT, Density Functional Theory)를 적용하여 획득할 수 있다. In addition, one embodiment may include determining limited molecular property data for a subset of molecular structures in a large data set of molecular structures and training the DBM with limited molecular property data. . Limited molecular characterization data can be obtained by applying Density Functional Theory (DFT) on a subset of molecular structures in the set.

또한, 일 실시예는 사용자가 원하는 특성을 갖는 분자들의 샘플을 생성하기 위해 사전 훈련된 RBM의 스택과 훈련된 DBM을 제한된 분자 특성 데이터와 함께 베이지언 추정 프레임워크(Bayesian inference framework) 에서 결합하는 단계를 포함할 수 있다.In addition, one embodiment combines a trained DBM and a trained DBM in a Bayesian inference framework with limited molecular characterization data to generate a sample of molecules with a user desired characteristic. It may include.

도 1은 일 실시예에 따른, 시스템의 다양한 유닛들을 나타내는 블록도이다.1 is a block diagram illustrating various units of a system, according to one embodiment.

시스템(100)은 이동 전화, 스마트 폰, 태블릿, 패블릿, PDA(personal digital assistant), 랩탑(laptop), 컴퓨터, 웨어러블 컴퓨팅 장치(a wearable computing device), IoT(Internet of Things) 장치, 컴퓨팅 장치 중 적어도 하나일 수 있으나 이제 제한되는 것은 아니다. The system 100 may be a mobile phone, a smartphone, a tablet, a tablet, a personal digital assistant, a laptop, a computer, a wearable computing device, an Internet of Things device, a computing device. May be at least one of, but is not limited thereto.

일 실시예는 분자를 설계하기 위한 시스템(100)을 제공한다.One embodiment provides a system 100 for designing molecules.

시스템(100)은 SMILES 표현 유닛(102), 이진법 표현 유닛(104), 분자 구조 생성 유닛(106), 통신 인터페이스 유닛(108) 및 메모리(110)를 포함한다.System 100 includes SMILES representation unit 102, binary representation unit 104, molecular structure generation unit 106, communication interface unit 108, and memory 110.

통신 인터페이스 유닛(108)은 시스템(100)과 외부 분자 구조 데이터베이스 사이의 통신이 이루어지도록 구성될 수 있으며, 외부 분자 데이터베이스는 분자 구조들의 대용량 데이터 세트, 분자 구조들의 대용량 데이터 세트에서 분자 구조들의 서브 세트에 대한 실험적 특성들 및 계산된 특성들을 포함할 수 있다. The communication interface unit 108 may be configured to allow communication between the system 100 and an external molecular structure database, wherein the external molecular database is a subset of molecular structures in a large data set of molecular structures, a large data set of molecular structures It can include experimental properties and calculated properties for.

SMILES 표현 유닛(102)은 SMILES(Simplified Molecular Input Line Entry System)를 사용하여 분자 구조들의 대용량 데이터 세트의 모든 분자를 표현하도록 구성될 수 있으며, SMILES는 대용량 데이터 세트의 모든 분자를 나타내기 위해 일련의 문자를 사용할 수 있다. 예를 들어, 벤젠 고리(Benzene ring)은 SMILES를 사용하여 C1 = CC = CC = C1으로 표시될 수 있다.SMILES representation unit 102 may be configured to represent all molecules of a large data set of molecular structures using the Simplified Molecular Input Line Entry System (SMILES), which is a series of characters to represent all molecules of the large data set. Character can be used. For example, the Benzene ring can be represented by C1 = CC = CC = C1 using SMILES.

이진법 표현 유닛(104)은 분자 구조들의 대용량 데이터 세트에서 모든 분자의 SMILES 표현을 이진법으로 변환하도록 구성될 수 있다. 또한, 이진법 표현 유닛(104)은 SMILES 표현의 각 문자를 그에 대응되는 ASCII 표현으로 변환하도록 구성 될 수 있으며, ASCII 표현은 이후에 이진수로 변환될 수 있다. Binary representation unit 104 may be configured to convert the SMILES representation of all molecules in binary in a large data set of molecular structures. In addition, the binary representation unit 104 may be configured to convert each character of the SMILES representation into its corresponding ASCII representation, which in turn may be converted to binary.

분자 구조 생성 유닛(106)은 DBM(Boltzmann Machine)을 구성하기 위해 분자 구조들의 대용량 데이터 세트의 이진법 표현을 사용하여 RBM(Restricted Boltzmann Machines)의 스택을 사전 훈련하도록 구성될 수 있다. 또한, 분자 구조 생성 유닛(106)은 분자 구조들의 대용량 데이터 세트에서 분자 구조들의 서브 세트에 대한 제한된 분자 특성 데이터(limited molecular property data)를 결정하고, 제한된 분자 특성 데이터로 DBM을 훈련시킬 수 있다. 이 때, 제한된 분자 특성 데이터는 분자 구조들의 서브 세트에 대하여 DFT(Density Functional Theory)를 적용하여 획득될 수 있다.Molecular structure generation unit 106 may be configured to pre-train the stack of Restricted Boltzmann Machines (RBM) using a binary representation of a large data set of molecular structures to construct a Boltzmann Machine (DBM). In addition, molecular structure generation unit 106 may determine limited molecular property data for a subset of molecular structures in a large data set of molecular structures, and train the DBM with limited molecular property data. At this time, limited molecular characteristic data can be obtained by applying Density Functional Theory (DFT) to a subset of molecular structures.

또한, 분자 구조 생성 유닛(106)은 사용자가 원하는 특성들을 갖는 분자들의 샘플을 생성하기 위해 사전 훈련된 RBM들의 스택과 훈련된 DBM을 제한된 분자 특성 데이터와 함께 베이지언 추정 프레임워크(Bayesian inference framework) 에서 결합할 수 있다.In addition, the molecular structure generation unit 106 uses a stack of pre-trained RBMs and trained DBMs with limited molecular characterization data together with a limited molecular characterization data to produce a sample of molecules with user desired characteristics. Can be combined in

일 실시예는, MCMC(Markov Chain Monte Carlo) 샘플링을 사용하여 원하는 특성 및 하위 구조를 갖는 분자들의 샘플을 결정하는 단계를 더 포함할 수 있다.One embodiment may further include determining a sample of molecules having desired properties and substructures using Markov Chain Monte Carlo (MCMC) sampling.

일 실시예는, 새로운 적용을 위한 개선된 재료를 설계하기 위해 사전 훈련된 RBM들의 파라미터들을 저장하는 단계를 더 포함할 수 있다.One embodiment may further comprise storing parameters of pre-trained RBMs to design the improved material for the new application.

일 실시예는, 훈련된 DBM을 사용하여 주어진 분자에 대한 특성을 예측하는 단계를 더 포함할 수 있다.One embodiment may further comprise predicting the characteristics for a given molecule using the trained DBM.

일 실시예는, 새로운 분자들에 대한 계산 또는 실험을 수행하는 것을 사용자에게 안내하는 단계를 더 포함할 수 있다.One embodiment may further comprise guiding a user to perform a calculation or experiment on new molecules.

또한, 일 실시예는, 사전 훈련된 RBM을 사용하여 유효한 분자 구조를 구성하는 단계를 더 포함할 수 있다.In addition, one embodiment may further comprise constructing an effective molecular structure using pre-trained RBM.

원하는 특성들을 가진 분자의 설계는 공학 시스템의 성능과 안전성을 향상시키는 데 중요한 역할을 할 수 있다. 예를 들어, 리튬 이온 전지의 성능과 안전성은 원하는 산화 환원 안정성 및 전도성을 갖는 전해질을 설계함으로써 크게 향상될 수 있다. 일 실시예는, 원하는 물리적, 화학적, 광전자적, 기능적 및/ 또는 생체 활성적 특성들을 갖는 분자 구조를 생성하는 방법 및 시스템을 개시한다.The design of molecules with the desired properties can play an important role in improving the performance and safety of engineering systems. For example, the performance and safety of lithium ion batteries can be greatly improved by designing electrolytes with the desired redox stability and conductivity. One embodiment discloses a method and system for creating a molecular structure having desired physical, chemical, optoelectronic, functional and / or bioactive properties.

분자들의 초기 모집단은 모집단을 구성하는 다수의 분자들의 표현으로 제공 되며, 구조 특성 관계를 종합하기 위해 이들 분자의 하나 이상의 물리적, 화학적, 기능적 및/또는 생체 활성적 특성을 획득하고 분석할 수 있다.The initial population of molecules is provided in the representation of a number of molecules that make up the population, and one or more physical, chemical, functional and / or bioactive properties of these molecules can be obtained and analyzed to synthesize structural property relationships.

구조적 정보는 SMILES 기반의 그래프 이론적 표현을 사용하여 디지털적으로 종합될 수 있고, 구조 특성 상관 관계는 DBM을 이용하여 처리될 수 있다. 다음으로, 역 분자 설계 접근법을 사용하여 원하는 범위의 하나 이상의 물리적, 화학적, 광전자적, 기능적 및/또는 생체 활성적 특성을 갖는 분자들의 세트를 생성할 수 있다.Structural information can be digitally synthesized using SMILES-based graph theoretical representations, and structural property correlations can be processed using DBM. Next, an inverse molecular design approach can be used to generate a set of molecules with one or more physical, chemical, optoelectronic, functional and / or bioactive properties in a desired range.

일 실시예에서 특성 유도 분자/화학 설계는 베이지언 추정 프레임워크(Bayesian inference framework)에 기초할 수 있다.In one embodiment the characteristic derived molecular / chemical design may be based on a Bayesian inference framework.

또한, 일 실시예에서 이론적 계산을 통한 자동 타켓 검증을 사용하여 이러한 역 예측의 정확성을 향상시킬 수 있다.In addition, In one embodiment, automatic target verification through theoretical calculations can be used to improve the accuracy of this inverse prediction.

일 실시예에서 개시하는 방법 및 시스템(100)은 인간 개입이 요구되지 않는 완전히 자동화된 인공 지능(AI, Artificial Intelligence) 기반의 접근법을 사용할 수 있다.The method and system 100 disclosed in one embodiment may use a fully automated Artificial Intelligence (AI) based approach that does not require human intervention.

또한, 일 실시예에서 개시하는 방법 및 시스템(100)은 최소한의 특성 데이터로 준 지도 학습 접근법을 사용할 수 있다.In addition, the method and system 100 disclosed in one embodiment may use a semi-supervised approach with minimal feature data.

일 실시예에서 개시하는 방법 및 시스템(100)은 다양한 적용을 위한 특성 유도 분자 설계에 사용될 수 있다. 또한 조건부 MCMC(Markov Chain Monte Carlo) 샘플링을 통해 분자 설계를 위한 기본 분자 구조(backbone)를 수정할 수 있다.The method and system 100 disclosed in one embodiment can be used in the design of characteristic derived molecules for various applications. Conditional Markov Chain Monte Carlo (MCMC) sampling also allows modification of the underlying molecular backbone for molecular design.

메모리(110)는 분자 구조들의 대용량 데이터 세트의 SMILES 표현을 저장하도록 구성될 수 있다. 또한, 메모리(110)는 분자 구조들의 이진법 표현을 저장하도록 구성될 수 있다. 또한, 메모리(110)는 사용자가 원하는 특성을 가진 분자들의 샘플을 저장하도록 구성될 수 있다.Memory 110 may be configured to store a SMILES representation of a large data set of molecular structures. In addition, memory 110 may be configured to store a binary representation of molecular structures. In addition, the memory 110 may be configured to store a sample of molecules having a user desired characteristic.

메모리(110)는 하나 이상의 컴퓨터 판독 가능 저장 매체를 포함 할 수 있다. 또한, 메모리(110)는 비 휘발성 저장 요소를 포함할 수 있다. 이러한 비 휘발성 저장 요소의 예로는 자기 하드 디스크, 광 디스크, 플로피 디스크, 플래시 메모리, EPROM(electrically programmable memory) 또는 EEPROM(electrically erasable programmable read-only memory)를 포함할 수 있다.Memory 110 may include one or more computer readable storage media. In addition, memory 110 may include non-volatile storage elements. Examples of such non-volatile storage elements may include magnetic hard disks, optical disks, floppy disks, flash memory, electrically programmable memory (EPROM) or electrically erasable programmable read-only memory (EEPROM).

또한, 메모리(110)는 일 실시예에서 비일시적인 저장 매체로 간주될 수 있다. "비일시적"은 저장 매체가 반송파 또는 전파된 신호로 구현되지 않았음을 나타낼 수 있다. 그러나, "비일시적"이라는 용어는 메모리(110)가 이동 불가능하다는 것을 의미하는 것으로 해석되어서는 안 된다. 일 실시예에서, 비일시적인 저장 매체는 들어, RAM(Random Access Memory) 또는 캐시(cache)에서, 시간이 지남에 따라 변화할 수 있는 데이터를 저장할 수 있다.Further, memory 110 may be considered a non-transitory storage medium in one embodiment. “Non-transitory” may indicate that the storage medium is not implemented with a carrier wave or a propagated signal. However, the term "non-transitory" should not be interpreted to mean that the memory 110 is immovable. In one embodiment, the non-transitory storage medium may store data that may change over time, such as in a random access memory (RAM) or cache.

도 1은 시스템(100)의 예시적인 유닛들을 도시하고 있으나, 이에 한정되는 것은 아니다. 일 실시예에서, 시스템(100)은 더 적은 또는 더 많은 수의 유닛들을 포함할 수 있다. 또한, 유닛들의 라벨 또는 명칭은 단지 예시적인 목적을 위해 사용되었으며 실시예들의 범위를 제한하지 않는다. 또한, 시스템(100)에서 동일하거나 실질적으로 유사한 기능을 수행하기 위해 하나 이상의 유닛들이 결합 될 수 있다.1 illustrates exemplary units of system 100, but is not limited to such. In one embodiment, the system 100 may include fewer or more units. In addition, the label or name of the units has been used for illustrative purposes only and does not limit the scope of the embodiments. In addition, one or more units may be combined to perform the same or substantially similar function in the system 100.

도 2는 일 실시예에 따라, 분자를 설계하는 방법을 도시한 순서도이다.2 is a flow chart illustrating a method of designing a molecule, according to one embodiment.

단계 202에서, 분자를 설계하는 방법은 외부 분자 구조 데이터베이스로부터 분자 구조들의 대용량 데이터 세트를 수집 또는 수신하는 단계를 포함할 수 있고, 외부 분자 구조 데이터베이스는 예를 들어 PubChem 데이터베이스일 수 있다. 분자를 설계하는 방법은 시스템(100)이 외부 분자 구조 데이터베이스로부터 분자 구조들의 대용량 데이터 세트를 수집할 수 있도록 한다. In step 202, the method of designing a molecule may include collecting or receiving a large data set of molecular structures from an external molecular structure database, which may be, for example, a PubChem database. The method of designing molecules allows the system 100 to collect a large data set of molecular structures from an external molecular structure database.

단계 204에서, 분자를 설계하는 방법은 SMILES를 사용하여 모든 분자들의 구조를 나타내는 단계를 포함한다. 상기 단계는 SMILES 표현 유닛(102)이 SMILES를 사용하여 모든 분자들의 구조를 나타낼 수 있게 한다. SMILES는 분자 구조를 나타내기 위해 일련의 문자를 사용할 수 있다. 예를 들어, 벤젠 고리(Benzene ring)은 SMILES를 사용하여 C1 = CC = CC = C1으로 표시될 수 있다.In step 204, the method of designing a molecule includes representing the structure of all molecules using SMILES. This step allows the SMILES representation unit 102 to represent the structure of all molecules using SMILES. SMILES can use a series of letters to represent its molecular structure. For example, the Benzene ring can be represented by C1 = CC = CC = C1 using SMILES.

단계 206에서, 분자를 설계하는 방법은 분자 구조들의 대용량 데이터 세트에서 모든 분자의 SMILES 표현을 이진법으로 변환하는 단계를 포함할 수 있다. 상기 단계는 이진법 표현 유닛(104)이 분자 구조들의 대용량 데이터 세트에서 모든 분자의 SMILES 표현을 이진법으로 변환하도록 할 수 있다. SMILES 표현의 각 문자는 그와 대응되는 ASCII 표현으로 변환될 수 있고, ASCII 표현은 이후에 이진수로 변환될 수 있다. In step 206, the method of designing a molecule can include converting the SMILES representation of all molecules into binary in a large data set of molecular structures. This step may cause the binary representation unit 104 to convert the SMILES representation of all molecules to binary in a large data set of molecular structures. Each character of the SMILES representation can be converted to its corresponding ASCII representation, which in turn can be converted to binary.

단계 208에서, 분자를 설계하는 방법은 분자 구조들의 대용량 데이터 세트의 이진법 표현을 사용하여 RBM(Restricted Boltzmann Machines)들의 스택을 사전 훈련하는 단계를 포함할 수 있다. 대조적 발산 알고리즘(Contrastive divergence algorithm)은 RBM의 스택을 사전 훈련하는 데 사용될 수 있다. 일 실시예는 분자 구조 생성 유닛(106)이 분자 구조들의 대용량 데이터 세트의 이진법 표현을 사용하여 RBM들의 스택을 사전 훈련하도록 할 수 있다.At step 208, the method of designing a molecule may include pre-training a stack of Restricted Boltzmann Machines (RBM) using a binary representation of a large data set of molecular structures. Contrastive divergence algorithms can be used to pre-train the stack of RBMs. One embodiment may allow molecular structure generation unit 106 to pre-train a stack of RBMs using a binary representation of a large data set of molecular structures.

단계 210에서, 분자를 설계하는 방법은 사전 훈련된 RBM들을 함께 적층함으로써 DBM(Deep Boltzmann Machine)을 구성하는 단계를 포함할 수 있다. 일 실시예는 분자 구조 생성 유닛(106)이 사전 훈련된 RBM들을 함께 적층함으로써 DBM 을 구성하도록 할 수 있다.In step 210, the method of designing a molecule can include constructing a Deep Boltzmann Machine (DBM) by stacking together pre-trained RBMs. One embodiment may allow molecular structure generation unit 106 to construct a DBM by stacking together pre-trained RBMs.

단계 212에서, 분자를 설계하는 방법은 분자 구조들의 대용량 데이터 세트에서 분자 구조들의 서브 세트에 대한 제한된 분자 특성 데이터를 결정하는 단계를 포함할 수 있다. 일 실시예는 분자 구조 생성 유닛(106)이 분자 구조들의 대용량 데이터 세트에서 분자 구조들의 서브 세트에 대한 제한된 분자 특성 데이터를 결정할 수 있도록 한다. 제한된 분자 특성 데이터는 분자 구조들의 서브 세트에 대하여 DFT(Density Functional Theory)를 적용하여 획득한 실험 결과 및 양자 계산에 기초하여 결정할 수 있다.In step 212, the method of designing a molecule can include determining limited molecular property data for a subset of molecular structures in a large data set of molecular structures. One embodiment enables molecular structure generation unit 106 to determine limited molecular property data for a subset of molecular structures in a large data set of molecular structures. Limited molecular characterization data can be determined based on experimental results and quantum calculations obtained by applying Density Functional Theory (DFT) to a subset of molecular structures.

단계 214에서, 분자를 설계하는 방법은 제한된 분자 특성 데이터와 함께 DBM을 훈련하는 단계를 포함할 수 있다. 일 실시예는 분자 구조 생성 유닛(106)이 제한된 분자 특성 데이터로 DBM을 훈련시키도록 할 수 있다. 훈련된 DBM은 분자 구조가 주어졌을 때, 분자의 특성을 예측하는 데 이용할 수 있다.At step 214, the method of designing a molecule can include training the DBM with limited molecular property data. One embodiment may allow molecular structure generation unit 106 to train a DBM with limited molecular property data. Trained DBMs can be used to predict molecular properties given a molecular structure.

단계 216에서, 분자를 설계하는 방법은 사용자가 원하는 특성들을 갖는 분자들의 샘플을 생성하기 위해 사전 훈련된 RBM들의 스택과 훈련된 DBM을 제한된 분자 특성 데이터와 함께 베이지언 추정 프레임워크(Bayesian inference framework) 에서 결합하는 단계를 포함한다. 일 실시예는 분자 구조 생성 유닛(106)이 사용자가 원하는 특성들을 갖는 분자들의 샘플을 생성하기 위해 사전 훈련된 RBM들의 스택과 훈련된 DBM을 제한된 분자 특성 데이터와 함께 베이지언 추정 프레임워크(Bayesian inference framework)에서 결합할 수 있도록 할 수 있다. In step 216, the method of designing a molecule includes a stack of pre-trained RBMs and a trained DBM with limited molecular characterization data along with limited molecular characterization data to produce a sample of molecules with user desired characteristics. Combining in the. One embodiment provides a Bayesian inference with a stack of pre-trained RBMs and a trained DBM along with limited molecular characterization data for molecular structure generation unit 106 to generate a sample of molecules with user desired characteristics. in the framework).

단계 218에서, 분자를 설계하는 방법은 특정 적용을 위해 사용자에 의해 요구되는 특성들을 결정하는 단계를 포함한다. 일 실시예는 분자 구조 생성 유닛(106)이 특정 적용을 위해 사용자에 의해 요구되는 특성들을 결정하도록 할 수 있다. 시스템(100)은 사용자 입력으로서 원하는 특성들을 수신하도록 구성될 수 있다. 또한, 사용자가 요구한 특성들에 기초하여, 분자 구조 생성 유닛(106)은 특정 적용을 위해 사용자에 의해 지정된 특성들을 결정하도록 하도록 구성될 수 있다. 예를 들어, 원하는 환원 전위 및 산화 전위 값은 리튬 이온 배터리의 안정한 전해질 설계를 위해 특정될 수 있다.In step 218, the method of designing a molecule includes determining properties required by a user for a particular application. One embodiment may enable molecular structure generation unit 106 to determine the properties required by a user for a particular application. System 100 may be configured to receive desired characteristics as a user input. Further, based on the properties required by the user, molecular structure generation unit 106 may be configured to determine the properties specified by the user for a particular application. For example, the desired reduction potential and oxidation potential values can be specified for stable electrolyte design of lithium ion batteries.

단계 220에서, 분자를 설계하는 방법은 특정 하위 구조가 필요한지 여부를 결정하는 단계를 포함할 수 있다. 하위 구조는 사용자의 요구 사항에 따라 지정될 수 있다.At step 220, the method of designing molecules can include determining whether a particular substructure is needed. Substructures can be specified according to user requirements.

단계 220에서, 분자 구조 생성 유닛(106)이 특정 하위 구조가 필요하지 않다고 결정하면, 단계 122에서 분자 구조 생성 유닛(106)은 MCMC (Markov Chain Monte Carlo) 방법을 사용하여 베이지안 추정의 사후 분포로부터 원하는 특성을 가진 분자들의 집합을 결정할 수 있다.In step 220, if molecular structure generation unit 106 determines that a particular substructure is not needed, then in step 122 molecular structure generation unit 106 uses the Markov Chain Monte Carlo (MCMC) method from the post-distribution of Bayesian estimation. A set of molecules with desired properties can be determined.

단계 220에서, 분자 구조 생성 유닛(106)이 특정 하위 구조가 필요하다고 결정하면, 단계 124에서 분자 구조 생성 유닛(106)은 이러한 분자를 얻기 위해 조건부 MCMC 샘플링을 사용하도록 구성될 수 있다. 예를 들어, 고분자의 사슬 끝에 고정된 에틸렌 옥사이드(ethylene oxide)와 같이 고정된 특성을 갖는 분자가 필요하다면, 조건부 MCMC 샘플링을 사용하여 고정된 특성을 갖는 분자를 얻을 수 있다.In step 220, if molecular structure generation unit 106 determines that a particular substructure is needed, then in step 124 molecular structure generation unit 106 may be configured to use conditional MCMC sampling to obtain such molecules. For example, if a molecule having fixed properties such as ethylene oxide immobilized at the chain end of the polymer is needed, conditional MCMC sampling can be used to obtain molecules with fixed properties.

흐름도(200)에서의 다양한 동작들, 블록들 또는 단계들 등은 각각 다른 순서로 수행될 수 도 있고, 동시에 수행될 수도 있다. Various operations, blocks, or steps in the flowchart 200 may be performed in different orders, or may be performed simultaneously.

또한, 일 실시예에서 본 개시의 범위를 벗어나지 않는 한도에서, 동작들, 블록들 또는 단계들 등의 일부는 생략되거나, 추가되거나 또는 수정될 수 있다.In addition, in some embodiments, some of the operations, blocks, or steps may be omitted, added, or modified without departing from the scope of the present disclosure.

도 3은 일 실시예에 따라, 분자를 설계하기 위한 딥 러닝 베이지언 프레임워크(deep learning Bayesian framework)를 나타내는 흐름도이다.3 is a flow diagram illustrating a deep learning Bayesian framework for designing molecules, according to one embodiment.

딥 러닝 베이지언 프레임워크(deep learning Bayesian framework)는 단계 311, 312, 313 및 314 를 포함하는 전처리 단계(310)를 수행할 수 있다. The deep learning Bayesian framework may perform a preprocessing step 310 that includes steps 311, 312, 313, and 314.

단계 311에서, 외부 분자 구조 데이터베이스로부터 분자 구조들의 대용량 데이터 세트를 수집할 수 있다. In step 311, a large data set of molecular structures can be collected from an external molecular structure database.

단계 312에서, SMILES를 이용하여 대용량 데이터 세트에 포함된 분자들의 구조를 나타낼 수 있다. SMILES는 분자 구조를 나타내기 위해 일련의 문자를 사용할 수 있다. In step 312, SMILES can be used to represent the structure of the molecules included in the large data set. SMILES can use a series of letters to represent its molecular structure.

단계 313에서, 대용량 데이터 세트에 포함된 분자들의 SMILES 표현을 이진법으로 변환할 수 있다. 구체적으로, SMILES 표현의 각 문자는 그와 대응되는 ASCII 표현으로 변환될 수 있으며, ASCII 표현은 이후에 8 비트의 이진수로 변환될 수 있다. In step 313, the SMILES representation of the molecules included in the large data set may be converted to binary. Specifically, each character of the SMILES representation can be converted to its corresponding ASCII representation, which in turn can be converted to an 8-bit binary number.

단계 314에서, 시스템(100)은 분자 구조들을 포함하는 작은 데이터 서브 세트에서 특성 데이터 세트를 결정할 수 있다. In step 314, the system 100 may determine a feature data set in a small data subset that includes molecular structures.

단계 320에서, 8 비트의 이진수를 사용하여 RBM(Restricted Boltzmann Machines)이라고 하는 기계 학습 방법을 비지도 학습 방식(Unsupervised Learning)으로 훈련시킬 수 있다. 또한, 흐름도에 도시되어 있지는 않지만, 작은 데이터 서브 세트로부터 결정한 특성 데이터 세트를 이용하여 또 다른 RBM을 지도 학습 방식(Supervised Learning)으로 훈련시킬 수 있다. 또 다른 RBM은 Gaussian-Bernoulli Restricted Boltzmann Machine (GBRBM) 기계 학습 방식에 해당할 수 있다. In step 320, an 8-bit binary number may be used to train a machine learning method called Restricted Boltzmann Machines (RBM) in an unsupervised learning manner. Also, although not shown in the flowchart, another RBM can be trained in supervised learning using a feature data set determined from a small subset of data. Another RBM may correspond to the Gaussian-Bernoulli Restricted Boltzmann Machine (GBRBM) machine learning method.

또한, 단계 330에서 시스템(100)은 분자들의 이진법 표현을 사용하여 훈련된 RBM의 스택을 특성 데이터 세트를 이용하여 훈련된 GBRBM과 결합하여 DBM(Deep Boltzmann Machine)을 구성할 수 있다. 또한, 시스템(100)은 결합된 RBM 및 GBRBM과 함께 DBM을 지도 학습 방식으로 훈련시킬 수 있다. DBM을 훈련시킴으로써, 분자들의 구조와 특성 간의 상관 관계에 관한 정보를 획득할 수 있으며 이를 통해 주어진 분자의 특성을 예측할 수 있다. Further, in step 330 the system 100 may construct a Deep Boltzmann Machine (DBM) by combining a stack of trained RBMs with a GBRBM trained using a characteristic data set using a binary representation of the molecules. In addition, the system 100 may train the DBM in a supervised learning manner with the combined RBM and GBRBM. By training the DBM, we can obtain information about the correlation between the structure and the properties of molecules and thereby predict the properties of a given molecule.

단계 340에서, 시스템(100)의 분자 구조 생성 유닛(106)은 미리 훈련된 RBM들의 스택과 훈련된 DBM을 베이지언 추정 프레임워크에서 제한된 분자 특성 데이터와 함께 결합하여 사용자가 원하는 특성을 갖는 분자 샘플을 생성하도록 구성될 수 있다.In step 340, molecular structure generation unit 106 of system 100 combines the stack of pre-trained RBMs and the trained DBM with limited molecular characterization data in the Bayesian estimation framework to obtain a molecular sample with the desired characteristics. It can be configured to generate.

베이지언 프레임워크는 사후 확률이 우도 확률(likelihood probability)에 사전 확률(prior probability)을 곱한 값에 직접적으로 비례한다는 원칙을 이용한다. 사전 확률은 주어진 분자 구조에 대한 기존 지식을 나타낼 수 있다. 또한, 사전 확률은 주어진 분자 구조가 유효한 분자 구조인지 여부를 나타낼 수 있다. 우도 확률은 원하는 특성들의 확률 분포를 나타낼 수 있다. 우도 확률은 새로운 적용을 위해 필요한 모든 특성들이 무엇인가에 관련하여 정의될 수 있다. 또한, 우도 확률은 DBM에 의해 결정될 수 있다.The Bayesian framework uses the principle that the posterior probability is directly proportional to the likelihood probability multiplied by the prior probability. Prior probabilities may represent existing knowledge of a given molecular structure. In addition, the prior probability may indicate whether a given molecular structure is a valid molecular structure. The likelihood probability may represent a probability distribution of desired characteristics. Likelihood probabilities can be defined in terms of what all the properties needed for the new application are. In addition, the likelihood probability may be determined by the DBM.

도 4는 일 실시예에 따라, 분자들의 SMILES(Simplified Molecular Input Line Entry System) 표현을 설명하기 위한 흐름도이다. SMILES는 원하는 특성들을 가진 분자들을 설계하기 위한 입력이 될 수 있다. 4 is a flowchart illustrating a SMILES (Simplified Molecular Input Line Entry System) representation of molecules, according to an embodiment. SMILES can be an input for designing molecules with desired properties.

410 단계에서, 분자들 및 분자들 각각의 특성은 예를 들어, PubChem, KHAZANA 등 과 같은 다양한 데이터베이스로부터 획득될 수 있다.At 410, the molecules and the properties of each of the molecules can be obtained from various databases such as, for example, PubChem, KHAZANA, and the like.

420 단계에서, 분자들은 SMILES 를 이용하여 표현될 수 있으며, 분자들의 SMILES 표현은 분자의 구조를 식별할 수 있게 한다. In step 420, the molecules can be represented using SMILES, and the SMILES representation of the molecules makes it possible to identify the structure of the molecule.

430 단계에서, SMILES 표현의 각 문자는 8비트의 이진수로 변환될 수 있다. 구체적으로, 처음에 SMILES 표현의 각 문자는 그에 대응되는 ASCII 표현으로 변환될 수 있으며, ASCII 표현은 이후에 8비트의 이진수로 변환될 수 있다. 8비트의 이진수는 기계 학습 과정에서 이용될 수 있다. In operation 430, each character of the SMILES representation may be converted to an 8-bit binary number. Specifically, each character of the SMILES representation may initially be converted to its corresponding ASCII representation, which in turn may be converted to an 8-bit binary number. An 8-bit binary number can be used in the machine learning process.

도 5는 일 실시예에 따라, RBM(Restricted Boltzmann Machine)을 이용한 분자 구조의 비지도 학습을 나타내는 흐름도이다.5 is a flowchart illustrating unsupervised learning of molecular structures using a Restricted Boltzmann Machine (RBM), according to one embodiment.

일 실시예는 RBM 을 이용하여 사용자가 원하는 특성들을 갖는 것으로 예상되는 후보 분자의 구조가 유효한 분자 구조가 될 확률을 추정할 수 있다.One embodiment may use RBM to estimate the probability that the structure of the candidate molecule expected to have the desired characteristics of the user would be an effective molecular structure.

단계 510에서, 시스템(100)은 분자들의 구조를 SMILES를 이용하여 나타낼 수 있다. SMILES는 예를 들어, PubChem 데이터베이스와 같은 대용량 데이터의 모든 분자를 나타내기 위해 문자들의 세트를 사용할 수 있다.In step 510, the system 100 may represent the structure of the molecules using SMILES. SMILES can use a set of letters to represent all molecules of large data, such as, for example, the PubChem database.

단계 520에서, 시스템(100)은 분자 구조의 SMILES 표현을 이진법 표현 (이진수의 확률 변수들의 데이터베이스)으로 변환할 수 있다.At step 520, the system 100 may convert the SMILES representation of the molecular structure into a binary representation (database of binary random variables).

단계 530에서, 시스템(100)은 분자 구조들의 확률 밀도 함수를 구하기 위하여 분자 구조들의 이진법 표현을 사용하여 RBM(즉, DBN)을 훈련시킬 수 있다. 분자 구조의 확률 밀도 함수는 분자 구조가 유효한 분자 구조인지 여부를 결정하기 위해 이용될 수 있다. At step 530, the system 100 may train the RBM (ie, DBN) using the binary representation of the molecular structures to obtain a probability density function of the molecular structures. The probability density function of the molecular structure can be used to determine whether the molecular structure is a valid molecular structure.

도 6은 일 실시예에 따라, 주어진 분자에 대한 특성을 예측하기 위하여 RBM(Restricted Boltzmann Machine)을 이용한 DBM(Deep Boltzmann machine)의 구성을 나타내는 개략도이다.FIG. 6 is a schematic diagram illustrating a configuration of a Deep Boltzmann machine (DBM) using a Restricted Boltzmann Machine (RBM) to predict characteristics of a given molecule, according to one embodiment.

일 실시예는 DBM 을 이용하여 주어진 분자 구조에 대한 특성을 예측할 수 있다. 주어진 분자 구조의 특성을 예측하는 단계는 다음과 같다.One embodiment may use DBM to predict properties for a given molecular structure. Predicting the properties of a given molecular structure is as follows.

1 단계: 다층 DBN(Deep Belief Network)는 분자 구조를 훈련시키기 위해 이용될 수 있다. 일 실시예는 SMILES를 사용하는 분자 구조를 나타내며, 여기서 SMILES는 분자 구조들의 대용량 데이터 세트에서 모든 분자를 나타내기 위해 일련의 문자를 사용할 수 있다. 또한, 일 실시예는 분자 구조의 SMILES 표현을 이진법 표현으로 변환하는 단계를 포함할 수 있다. 또한, 분자 구조의 이진법 표현을 사용하여 RBM들(즉, DBN)을 훈련하는 단계를 포함할 수 있다.Step 1: Multilayer Deep Belief Network (DBN) can be used to train molecular structure. One embodiment represents a molecular structure using SMILES, where SMILES may use a series of letters to represent all molecules in a large data set of molecular structures. In addition, one embodiment may include converting a SMILES representation of a molecular structure into a binary representation. It may also include training RBMs (ie, DBNs) using a binary representation of the molecular structure.

2 단계: 2개의 레이어(layer)로 구성된 RBM(Restricted Boltzman Machines)은 사용 가능한 분자 구조의 특성들을 훈련하기 위해 이용될 수 있다. 분자 구조의 특성들은 예를 들어 전도도 등이 될 수 있다. 일 실시예는 RBM(즉, GB-RBM(Gaussian Bernoulli Restricted Boltzman Machine))을 훈련시키기 위해 분자 구조들의 특성들을 이용할 수 있다.Step 2: Restricted Boltzman Machines (RBM) consisting of two layers can be used to train the properties of the available molecular structures. The properties of the molecular structure can be, for example, conductivity. One embodiment may use the properties of molecular structures to train RBM (ie, Gaussian Bernoulli Restricted Boltzman Machine (GB-RBM)).

3 단계: 분자 구조와 특성을 상호 연관시키기 위해 DBN을 RBM과 연결하여 DBM(Deep Boltzmann Machine)을 구성할 수 있다. 또한, 주어진 분자 구조에 대한 특성 값을 예측하기 위해 분자 구조 및 특성들로 DBM을 훈련시킬 수 있다.Step 3: The DBN (Deep Boltzmann Machine) can be constructed by linking DBNs with RBMs to correlate molecular structures and properties. In addition, DBM can be trained with molecular structures and properties to predict property values for a given molecular structure.

도 7은 일 실시예에 따라, 분자의 설계를 위한 베이지언 추정 프레임 워크(Bayesian inference framework) 를 나타내는 흐름도이다.7 is a flow diagram illustrating a Bayesian inference framework for the design of molecules, in accordance with an embodiment.

일 실시예는 베이지언 추정 프레임 워크를 이용하여 사용자가 원하는 특성들을 갖는 분자들의 샘플을 생성하기 위해, 사후 분포로부터 샘플링을 위한 MCMC(Markov Chain Monte Carlo) 기법을 사용할 수 있다. One embodiment may use a Markov Chain Monte Carlo (MCMC) technique for sampling from a post-distribution to generate a sample of molecules with user desired characteristics using a Bayesian estimation framework.

단계 710에서, MCMC 기법은 임의의 분자 구조에서 출발할 수 있다. In step 710, the MCMC technique can start at any molecular structure.

단계 720에서, RBM을 이용하여 사전 확률을 예측할 수 있다. 사전 확률은 주어진 분자 구조가 유효한 분자 구조인지 여부를 나타낼 수 있다. 훈련된 RBM은 분자 구조의 사전 확률을 획득하기 위해 이용될 수 있다. In operation 720, the RBM may be used to predict a prior probability. The prior probability may indicate whether a given molecular structure is a valid molecular structure. Trained RBMs can be used to obtain prior probabilities of molecular structure.

단계 730에서, 분자 구조의 특성들을 예측할 수 있으며, 훈련된 DBM은 분자 구조들의 특성을 예측하기 위해 사용될 수 있다.In step 730, the properties of the molecular structure can be predicted, and the trained DBM can be used to predict the properties of the molecular structures.

예측되는 특성값의 우도 확률은 예측되는 특성값들과 사용자가 지정한 특성값들을 비교하여 계산될 수 있다. 사후 확률은 우도 확률과 사전 확률을 곱하여 계산될 수 있다. The likelihood probability of the predicted characteristic value may be calculated by comparing the predicted characteristic values with the user-specified characteristic values. The posterior probability may be calculated by multiplying the likelihood probability and the prior probability.

단계 740에서, 제안 분포(proposal distribution)에서 사용자가 원하는 특성을 가지는 것으로 예측되는 분자를 샘플링할 수 있다. 구체적으로, 제안된 분자 구조는 가열된 RBM(heated RBM)에서 샘플링하여 생성될 수 있고, 가열된 RBM은 훈련된 RBM의 가중치 및 바이어스(bias)들에 균일한 난수를 곱하여 정의될 수 있다.In step 740, a molecule may be sampled that is predicted to have a desired characteristic in a proposal distribution. Specifically, the proposed molecular structure can be generated by sampling in a heated RBM, and the heated RBM can be defined by multiplying the weights and biases of the trained RBM by a uniform random number.

단계 750에서, 제안된 분자의 사전 확률을 획득할 수 있다. 훈련된 RBM은 제안된 분자의 사전 확률을 얻는 데 사용될 수 있다. In step 750, the prior probability of the proposed molecule can be obtained. The trained RBM can be used to obtain the prior probability of the proposed molecule.

단계 760에서, 제안된 분자의 특성들을 예측할 수 있다. 훈련된 DBM은 제안된 분자의 특성을 예측하는 데 사용될 수 있다. 제안된 분자의 우도 확률(likelihood probability)을 획득하기 위해 제안된 분자에서 예측되는 특성들은 사용자가 원하는 특성들과 비교될 수 있다. 제안된 분자의 사후 확률은 우도 확률과 사전 확률을 곱하여 획득할 수 있다.In step 760, the properties of the proposed molecule can be predicted. The trained DBM can be used to predict the properties of the proposed molecule. The properties predicted in the proposed molecule to obtain the likelihood probability of the proposed molecule can be compared with the properties desired by the user. The posterior probability of the proposed molecule can be obtained by multiplying the likelihood probability and the prior probability.

단계 770에서, 제안된 분자의 수용 확률을 계산할 수 있다. 수용 확률은 제안된 분자와 현재 분자의 사후 확률의 비율로 정의될 수 있다. In step 770, it is possible to calculate the acceptance probability of the proposed molecule. The acceptance probability can be defined as the ratio of the posterior probability of the proposed molecule to the current molecule.

단계 780에서, 제안된 분자는 수용 확률에 의해 주어진 확률로 수용될 수 있다. 이러한 절차는 기 설정된 반복 횟수만큼 반복될 수 있다.In step 780, the proposed molecule can be accepted with the probability given by the acceptance probability. This procedure may be repeated a predetermined number of times.

도 8은 일 실시예에 따라, 4.8V를 초과하는 산화 환원 전위를 갖는 분자를 예측하는 일례를 설명하기 위한 도면이다. 8 is a diagram for describing an example of predicting a molecule having a redox potential of greater than 4.8V according to one embodiment.

시스템(100)은 베이지언 추정 프레임워크를 사용하여 4.8V보다 대용량 산화 환원 전위를 갖는 분자들을 생성할 수 있다.The system 100 can produce molecules with a large redox potential of greater than 4.8V using the Bayesian estimation framework.

예를 들어, 베이지언 추정 프레임워크는 4.8V 이상의 산화 환원 전위를 갖는 총 5개의 유효한 분자 구조들을 예측할 수 있다.For example, the Bayesian estimation framework can predict a total of five valid molecular structures with redox potentials of 4.8 V or higher.

앞서 설명된 머신 러닝 알고리즘을 이용하여 분자를 설계하기 위한 과정들을 요약하여 설명하면, 하기 표 1과 같이 정리될 수 있다. The process for designing molecules using the machine learning algorithm described above can be summarized as shown in Table 1 below.

기능function 방식system 분자 표현자(Molecular descriptor)Molecular descriptor SMILES에 기초한 분자 구조 표현Representation of molecular structure based on SMILES 구조-특성 상관관계Structure-Characteristic Correlation Deep Boltzmann Machine(DBM)Deep Boltzmann Machine (DBM) 분자 구조의 유효성 검증Validation of Molecular Structure Restricted Boltzmann Machine(RBM)Restricted Boltzmann Machine (RBM) 사용자가 원하는 특성들을 갖는 분자 생성Create molecules with the properties you want MCMC(Markov Chain Monte Carlo) 기법을 사용한 베이지언 추정(Bayesian inference)Bayesian inference using Markov Chain Monte Carlo (MCMC) technique

표 1에 따르면, 분자 표현자(Molecular descriptor)로서 SMILES를 이용하여 분자 구조를 표현할 수 있다. 분자들의 구조와 특성 간의 상관 관계에 관한 정보는 DBM(Deep Boltzmann Machine)을 훈련시킴으로써 획득될 수 있다. According to Table 1, the molecular structure can be expressed using SMILES as a molecular descriptor. Information about the correlation between the structure and properties of the molecules can be obtained by training the Deep Boltzmann Machine (DBM).

또한, RBM을 이용하여 분자 구조가 유효한 분자 구조인지 여부를 검증할 수 있다. RBM can also be used to verify whether the molecular structure is an effective molecular structure.

마지막으로, MCMC(Markov Chain Monte Carlo) 기법을 사용한 베이지언 추정(Bayesian inference)을 이용하여 사용자가 원하는 특성들을 갖는 분자들을 생성할 수 있다.Finally, Bayesian inference using Markov Chain Monte Carlo (MCMC) technique can be used to generate molecules with desired characteristics.

일 실시예는 적어도 하나의 하드웨어 장치상에서 실행되는 적어도 하나의 소프트웨어 프로그램을 통해 구현될 수 있고, 구성 요소들을 제어하기 위해 네트워크 관리 기능들을 수행할 수 있다. 한편, 도 1에 도시된 구성 요소들은, 하드웨어 장치 또는 하드웨어 장치와 소프트웨어 모듈의 조합 중 적어도 하나가 될 수 있다.One embodiment may be implemented through at least one software program running on at least one hardware device, and may perform network management functions to control components. Meanwhile, the components shown in FIG. 1 may be at least one of a hardware device or a combination of a hardware device and a software module.

본 실시예들에 따른 장치는 프로세서, 프로그램 데이터를 저장하고 실행하는 메모리, 디스크 드라이브와 같은 영구 저장부(permanent storage), 외부 장치와 통신하는 통신 포트, 터치 패널, 키(key), 버튼 등과 같은 사용자 인터페이스 장치 등을 포함할 수 있다. 소프트웨어 모듈 또는 알고리즘으로 구현되는 방법들은 상기 프로세서상에서 실행 가능한 컴퓨터가 읽을 수 있는 코드들 또는 프로그램 명령들로서 컴퓨터가 읽을 수 있는 기록 매체 상에 저장될 수 있다. 여기서 컴퓨터가 읽을 수 있는 기록 매체로 마그네틱 저장 매체(예컨대, ROM(read-only memory), RAM(random-access memory), 플로피 디스크, 하드 디스크 등) 및 광학적 판독 매체(예컨대, 시디롬(CD-ROM), 디브이디(DVD: Digital Versatile Disc)) 등이 있다. 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템들에 분산되어, 분산 방식으로 컴퓨터가 판독 가능한 코드가 저장되고 실행될 수 있다. 매체는 컴퓨터에 의해 판독가능하며, 메모리에 저장되고, 프로세서에서 실행될 수 있다. The device according to the embodiments may include a processor, a memory for storing and executing program data, a persistent storage such as a disk drive, a communication port for communicating with an external device, a touch panel, a key, a button, and the like. And a user interface device. Methods implemented by software modules or algorithms may be stored on a computer readable recording medium as computer readable codes or program instructions executable on the processor. The computer-readable recording medium may be a magnetic storage medium (eg, read-only memory (ROM), random-access memory (RAM), floppy disk, hard disk, etc.) and an optical reading medium (eg, CD-ROM). ) And DVD (Digital Versatile Disc). The computer readable recording medium can be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion. The medium is readable by the computer, stored in the memory, and can be executed by the processor.

본 실시예는 기능적인 블록 구성들 및 다양한 처리 단계들로 나타내어질 수 있다. 이러한 기능 블록들은 특정 기능들을 실행하는 다양한 개수의 하드웨어 또는/및 소프트웨어 구성들로 구현될 수 있다. 예를 들어, 실시 예는 하나 이상의 마이크로프로세서들의 제어 또는 다른 제어 장치들에 의해서 다양한 기능들을 실행할 수 있는, 메모리, 프로세싱, 로직(logic), 룩 업 테이블(look-up table) 등과 같은 직접 회로 구성들을 채용할 수 있다. 구성 요소들이 소프트웨어 프로그래밍 또는 소프트웨어 요소들로 실행될 수 있는 것과 유사하게, 본 실시예는 데이터 구조, 프로세스들, 루틴들 또는 다른 프로그래밍 구성들의 조합으로 구현되는 다양한 알고리즘을 포함하여, C, C++, 자바(Java), 어셈블러(assembler) 등과 같은 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능적인 측면들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다. 또한, 본 실시예는 전자적인 환경 설정, 신호 처리, 및/또는 데이터 처리 등을 위하여 종래 기술을 채용할 수 있다. “매커니즘”, “요소”, “수단”, “구성”과 같은 용어는 넓게 사용될 수 있으며, 기계적이고 물리적인 구성들로서 한정되는 것은 아니다. 상기 용어는 프로세서 등과 연계하여 소프트웨어의 일련의 처리들(routines)의 의미를 포함할 수 있다.This embodiment can be represented by functional block configurations and various processing steps. Such functional blocks may be implemented in various numbers of hardware or / and software configurations that perform particular functions. For example, an embodiment may include an integrated circuit configuration such as memory, processing, logic, look-up table, etc. that may execute various functions by the control of one or more microprocessors or other control devices. You can employ them. Similar to the components that may be implemented in software programming or software elements, the present embodiment includes various algorithms implemented in C, C ++, Java (data structures, processes, routines or other combinations of programming constructs). It may be implemented in a programming or scripting language such as Java), an assembler, or the like. The functional aspects may be implemented with an algorithm running on one or more processors. In addition, the present embodiment may employ the prior art for electronic environment setting, signal processing, and / or data processing. Terms such as "mechanism", "element", "means" and "configuration" can be used widely and are not limited to mechanical and physical configurations. The term may include the meaning of a series of routines of software in conjunction with a processor or the like.

본 실시예에서 설명하는 특정 실행들은 예시들로서, 어떠한 방법으로도 기술적 범위를 한정하는 것은 아니다. 명세서의 간결함을 위하여, 종래 전자적인 구성들, 제어 시스템들, 소프트웨어, 상기 시스템들의 다른 기능적인 측면들의 기재는 생략될 수 있다. 또한, 도면에 도시된 구성 요소들 간의 선들의 연결 또는 연결 부재들은 기능적인 연결 및/또는 물리적 또는 회로적 연결들을 예시적으로 나타낸 것으로서, 실제 장치에서는 대체 가능하거나 추가의 다양한 기능적인 연결, 물리적인 연결, 또는 회로 연결들로서 나타내어질 수 있다. Specific implementations described in this embodiment are examples, and do not limit the technical scope in any way. For brevity of description, descriptions of conventional electronic configurations, control systems, software, and other functional aspects of the systems may be omitted. In addition, the connection or connection members of the lines between the components shown in the drawings by way of example shows a functional connection and / or physical or circuit connections, in the actual device replaceable or additional various functional connections, physical It may be represented as a connection, or circuit connections.

본 명세서(특히 특허청구범위에서)에서 “상기”의 용어 및 이와 유사한 지시 용어의 사용은 단수 및 복수 모두에 해당하는 것일 수 있다. 또한, 범위(range)를 기재한 경우 상기 범위에 속하는 개별적인 값을 포함하는 것으로서(이에 반하는 기재가 없다면), 상세한 설명에 상기 범위를 구성하는 각 개별적인 값을 기재한 것과 같다. 마지막으로, 방법을 구성하는 단계들에 대하여 명백하게 순서를 기재하거나 반하는 기재가 없다면, 상기 단계들은 적당한 순서로 행해질 수 있다. 반드시 상기 단계들의 기재 순서에 한정되는 것은 아니다.In the present specification (particularly in the claims), the use of the term “above” and similar indicating terminology may correspond to both the singular and the plural. In addition, when a range is described, it includes the individual values which belong to the said range (if there is no description contrary to it), and it is the same as describing each individual value which comprises the said range in detailed description. Finally, if there is no explicit order or contrary to the steps constituting the method, the steps may be performed in a suitable order. It is not necessarily limited to the order of description of the above steps.

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far I looked at the center of the preferred embodiment for the present invention. Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. The scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the scope will be construed as being included in the present invention.

Claims

머신 러닝 알고리즘을 이용하여 분자를 설계하는 방법은,
SMILES(Simplified Molecular Input Line Entry System) 표현 유닛에 의해, 일련의 문자들을 사용하는 SMILES를 이용하여 데이터 세트에 포함된 분자 구조들을 표현하는 단계;
이진법 표현 유닛에 의해, 상기 분자 구조들의 SMILES 표현을 이진법 표현으로 변환하는 단계;
분자 구조 생성 유닛에 의해, 상기 분자 구조들의 이진법 표현을 사용하여 RBM(Restricted Boltzmann Machines)들의 스택(stack)을 사전 훈련시키는 단계;
상기 분자 구조 생성 유닛에 의해, 상기 RBM들의 스택을 이용하여 DBM(Deep Boltzmann Machine)을 구성하는 단계;
상기 분자 구조 생성 유닛에 의해, 상기 데이터 세트에서 상기 분자 구조들의 서브 세트에 대해 밀도 함수 이론(DFT, Density Functional Theory)을 적용하여 제한된 분자 특성 데이터를 결정하는 단계;
상기 분자 구조 생성 유닛에 의해, 상기 제한된 분자 특성 데이터로 상기 DBM을 훈련시키는 단계;
상기 분자 구조 생성 유닛에 의해, 상기 사전 훈련된 RBM들의 스택과 상기 훈련된 DBM을 베이지언 추정 프레임워크(Bayesian inference framework)에서 결합하는 단계; 및
상기 분자 구조 생성 유닛에 의해, 상기 베이지언 추정 프레임워크을 이용하여 사용자가 원하는 특성들을 갖는 분자들의 샘플을 생성하는 단계를 포함하는 방법. A method of designing molecules using machine learning algorithms,
Representing, by the Simplified Molecular Input Line Entry System (SMILES) representation unit, molecular structures included in the data set using SMILES using a series of characters;
Converting, by a binary representation unit, a SMILES representation of said molecular structures into a binary representation;
Pre-training, by a molecular structure generating unit, a stack of Restricted Boltzmann Machines (RBMs) using the binary representation of the molecular structures;
Constructing, by the molecular structure generating unit, a Deep Boltzmann Machine (DBM) using the stack of RBMs;
Determining, by the molecular structure generation unit, limited molecular characteristic data by applying a density functional theory (DFT) on the subset of molecular structures in the data set;
Training, by the molecular structure generating unit, the DBM with the limited molecular property data;
Combining, by the molecular structure generating unit, the stack of pre-trained RBMs and the trained DBM in a Bayesian inference framework; And
Generating, by the molecular structure generating unit, a sample of molecules having properties desired by a user using the Bayesian estimation framework.

제 1 항에 있어서, 상기 생성하는 단계는
상기 분자 구조 생성 유닛에 의해 조건부 MCMC(Markov Chain Monte Carlo) 샘플링을 이용하여 상기 사용자가 원하는 특성들 및 하위 구조들을 갖는 분자들을 결정하는 단계를 더 포함하는 방법.The method of claim 1, wherein the generating step
Determining, by the molecular structure generating unit, molecules having conditional and substructures desired by the user using conditional Markov Chain Monte Carlo (MCMC) sampling.

제 1 항에 있어서, 상기 머신 러닝 알고리즘을 이용하여 분자를 설계하는 방법은
상기 분자 구조 생성 유닛에 의해 상기 사전 훈련된 RBM들의 파라미터들을 저장하는 단계를 더 포함하는 방법.The method of claim 1, wherein the method of designing molecules using the machine learning algorithm
Storing the parameters of the pre-trained RBMs by the molecular structure generation unit.

제 1 항에 있어서, 상기 머신 러닝 알고리즘을 이용하여 분자를 설계하는 방법은
상기 분자 구조 생성 유닛에 의해 상기 훈련된 DBM을 사용하여 주어진 분자의 특성들을 예측하는 단계를 더 포함하는 방법.The method of claim 1, wherein the method of designing molecules using the machine learning algorithm
Predicting the properties of a given molecule using the trained DBM by the molecular structure generation unit.

제 1 항에 있어서, 상기 머신 러닝 알고리즘을 이용하여 분자를 설계하는 방법은
대조적 발산 알고리즘(Contrastive divergence algorithm)을 이용하여 상기 RBM들의 스택을 사전 훈련시키는 방법.The method of claim 1, wherein the method of designing molecules using the machine learning algorithm
Pre-training the stack of RBMs using a contrasting divergence algorithm.

제 1 항에 있어서, 상기 머신 러닝 알고리즘을 이용하여 분자를 설계하는 방법은
상기 분자 구조 생성 유닛에 의해 상기 사전 훈련된 RBM을 이용하여 분자 구조의 사전 확률을 획득하여 상기 분자 구조가 유효한지 여부를 검증하는 단계를 더 포함하는 방법.The method of claim 1, wherein the method of designing molecules using the machine learning algorithm
Obtaining, by the molecular structure generating unit, a prior probability of molecular structure using the pre-trained RBM to verify whether the molecular structure is valid.

머신 러닝 알고리즘을 이용하여 분자를 설계하기 위한 시스템은
일련의 문자들을 사용하는 SMILES를 이용하여 데이터 세트에 포함된 분자 구조들을 표현하는 SMILES 표현 유닛;
상기 분자 구조들의 SMILES 표현을 이진법 표현으로 변환하는 이진법 표현 유닛; 및
상기 분자 구조들의 이진법 표현을 사용하여 RBM(Restricted Boltzmann Machines)들의 스택(stack)을 사전 훈련시키고, 상기 RBM들의 스택을 이용하여 DBM(Deep Boltzmann Machine)을 구성하고, 상기 데이터 세트에서 상기 분자 구조들의 서브 세트에 대해 밀도 함수 이론(DFT, Density Functional Theory)을 적용하여 제한된 분자 특성 데이터를 결정하고, 상기 제한된 분자 특성 데이터로 상기 DBM을 훈련시키고, 상기 사전 훈련된 RBM들의 스택과 상기 훈련된 DBM을 베이지언 추정 프레임워크(Bayesian inference framework)에서 결합하고, 상기 베이지언 추정 프레임워크을 이용하여 사용자가 원하는 특성들을 갖는 분자들의 샘플을 생성하는 분자 구조 생성 유닛을 포함하는 시스템.A system for designing molecules using machine learning algorithms
A SMILES representation unit for representing molecular structures included in a data set using SMILES using a series of characters;
A binary representation unit for converting the SMILES representation of the molecular structures into a binary representation; And
Pre-train a stack of Restricted Boltzmann Machines (RBMs) using the binary representation of the molecular structures, construct a Deep Boltzmann Machine (DBM) using the stack of RBMs, and Apply Density Functional Theory (DFT) to a subset to determine limited molecular characteristic data, train the DBM with the limited molecular characteristic data, and stack the trained DBM with the stack of pre-trained RBMs. And a molecular structure generating unit coupled in a Bayesian inference framework and generating a sample of molecules having user desired characteristics using the Bayesian inference framework.

제 7 항에 있어서, 상기 분자 구조 생성 유닛은 조건부 MCMC(Markov Chain Monte Carlo) 샘플링을 사용하여 원하는 특성들 및 하위 구조들을 갖는 분자들을 결정하는 시스템.8. The system of claim 7, wherein the molecular structure generation unit determines conditions having desired properties and substructures using conditional Markov Chain Monte Carlo (MCMC) sampling.

제 7 항에 있어서, 상기 분자 구조 생성 유닛은 상기 사전 훈련된 RBM들의 파라미터들을 저장하는 시스템.8. The system of claim 7, wherein the molecular structure generation unit stores the parameters of the pre-trained RBMs.

제 7 항에 있어서, 상기 분자 구조 생성 유닛은 상기 훈련된 DBM을 사용하여 주어진 분자의 특성들을 예측하는 시스템.8. The system of claim 7, wherein the molecular structure generation unit predicts the properties of a given molecule using the trained DBM.

제 7 항에 있어서, 상기 분자 구조 생성 유닛은 대조적 발산 알고리즘(Contrastive divergence algorithm)을 이용하여 상기 RBM들의 스택을 사전 훈련시키는 시스템. 8. The system of claim 7, wherein the molecular structure generation unit pretrains the stack of RBMs using a contrasting divergence algorithm.

제 7 항에 있어서, 상기 분자 구조 생성 유닛은 상기 사전 훈련된 RBM을 이용하여 분자 구조의 사전 확률을 획득하여 상기 분자 구조가 유효한지 여부를 검증하는 단계를 더 포함하는 방법.8. The method of claim 7, wherein the molecular structure generating unit further comprises obtaining a prior probability of molecular structure using the pre-trained RBM to verify whether the molecular structure is valid.