WO2024066489A1

WO2024066489A1 - Configuration method for drug research and development database, and system

Info

Publication number: WO2024066489A1
Application number: PCT/CN2023/100464
Authority: WO
Inventors: 倪海洪; 罗子涵
Original assignee: 苏州雅深智慧科技有限公司
Priority date: 2022-09-30
Filing date: 2023-06-15
Publication date: 2024-04-04
Also published as: CN117854628A

Abstract

Provided in the present invention is a configuration method for a drug research and development database, comprising: acquiring related data from a public database; performing processing, association and matching on the related data; retrieving and displaying data that a user needs to query; reprocessing a protein crystallization structure in which a ligand is located, so as to allow the binding mode of the ligand and the protein to be easier to understand and display; aligning the amino acid sequences of a plurality of targets so as to visually display the differences and identity between said sequences; and processing a plurality of protein crystallization structures so as to visually display the structural relationship between proteins. The present invention can effectively improve the research and development efficiency of drug development researchers and provide more ideas for research and development, and can increase the degree of data integration between a large number of databases, thereby lowering the threshold of research and development and improving research and development efficiency.

Description

一种药物研发数据库的配置方法及***A configuration method and system for drug development database

技术领域Technical Field

本发明涉及数据监管领域，尤其涉及一种药物研发数据库的配置方法及***。The present invention relates to the field of data supervision, and in particular to a configuration method and system for a drug research and development database.

背景技术Background technique

药物研发需要大量人力物力投入，尤其在最初始的药物设计阶段，需要大量数据支撑研发人员的设计工作。由于药物化学领域的特殊性，与药物设计相关的各项数据都分散在各类公开数据库中，不利于研发人员的查找和使用。另一方面，研发人员进行药物设计时所使用的工具通常为各类客户端形式，软件间相互交流困难，使用有一定技术门槛。Drug R&D requires a lot of human and material resources, especially in the initial drug design stage, which requires a lot of data to support the design work of R&D personnel. Due to the particularity of the field of medicinal chemistry, various data related to drug design are scattered in various public databases, which is not conducive to the search and use of R&D personnel. On the other hand, the tools used by R&D personnel for drug design are usually in the form of various clients, and it is difficult for software to communicate with each other, and there is a certain technical threshold for use.

提高大量数据库之间的数据整合度，降低研发门槛，提高研发效率，一直是本领域的研究重点和技术难题。Improving data integration between a large number of databases, lowering R&D thresholds, and improving R&D efficiency have always been the research focus and technical challenges in this field.

发明内容Summary of the invention

鉴于上述问题，提出了本发明以便提供克服上述问题或者至少部分地解决上述问题的一种药物研发数据库的配置方法及***。In view of the above problems, the present invention is proposed to provide a method and system for configuring a drug research and development database that overcomes the above problems or at least partially solves the above problems.

根据本发明的一个方面，提供了一种药物研发数据库的配置方法包括：According to one aspect of the present invention, a method for configuring a drug development database is provided, comprising:

从公开数据库获取相关数据；Obtain relevant data from public databases;

对所述相关数据进行处理、关联和匹配；Processing, associating and matching the relevant data;

对用户需要查询的数据进行检索和展示；Retrieve and display the data that users need to query;

对配体所在蛋白结晶结构进行再处理，使配体与蛋白的结合方式更易理解和展示；Reprocess the protein crystal structure where the ligand is located to make the binding mode between the ligand and the protein Easier to understand and present;

对多个靶点的氨基酸序列进行对齐，用于直观展示序列间的异同；Align the amino acid sequences of multiple targets to visually display the similarities and differences between sequences;

对多个蛋白结晶结构进行处理，用于使蛋白之间的结构关系能够直观展示。Multiple protein crystal structures are processed to enable intuitive display of the structural relationships between proteins.

可选的，所述相关数据具体包括：Optionally, the relevant data specifically includes:

药物数据、靶点数据、蛋白结晶结构数据、适应症数据、生物活性数据、先导化合物数据和突变数据。Drug data, target data, protein crystal structure data, indication data, biological activity data, lead compound data and mutation data.

可选的，所述药物数据的处理方法包括：Optionally, the method for processing the drug data includes:

从各个药物数据表中获取对药物类型进行定义的数据，将标记为小分子药物筛选出来，其他类型的药物单独存放；Obtain data defining drug types from each drug data table, screen out drugs marked as small molecules, and store other types of drugs separately;

从小分子药物中，找到对药物结构进行定义的数据，直接使用SMILES作为识别方式；From small molecule drugs, find data that defines the drug structure and directly use SMILES as the identification method;

将所述SMILES导入开源模组RDkit中，使用所述RDkit将所述SMILES转化为统一的RDkit_SMILES；Import the SMILES into the open source module RDkit, and use the RDkit to convert the SMILES into a unified RDkit_SMILES;

直接比对所有的RDkit_SMILES，认定具有相同所述RDkit_SMILES的不同数据来源的药物为同一药物，合并数据；Directly compare all RDkit_SMILES, identify drugs from different data sources with the same RDkit_SMILES as the same drug, and merge the data;

通过匹配DRUGBANK数据库的数据，获取DRUGBANK ID，并作为数据表的主键，与其他数据表进行关联。By matching the data in the DRUGBANK database, we can obtain the DRUGBANK ID and use it as the primary key of the data table to associate it with other data tables.

可选的，所述靶点数据的处理方法包括：Optionally, the target data processing method includes:

从靶点数据表中获取靶点的分类数据；Get the classification data of the target from the target data table;

将靶点按照的分类方式进行分类；Classify the targets according to the classification method;

对于缺少分类信息的靶点，将分类标记为TBD，等待其他方式确认分类；For targets lacking classification information, the classification was marked as TBD, pending confirmation of the classification by other means;

将所述靶点数据通过Uniprot ID进行合并，并作为唯一主键与其他数据表进行关联。The target data are merged through Uniprot ID and associated with other data tables as the unique primary key.

可选的，所述蛋白结晶结构数据的处理方法包括：Optionally, the method for processing protein crystal structure data includes:

对每个蛋白质三维数据文件进行数据抽取，获取蛋白质三维数据文件的基本信息；Extract data from each protein 3D data file to obtain protein 3D data Basic information of the file;

将HEADER中不属于蛋白的PDB忽略，仅保留属于蛋白的所述蛋白质三维数据文件；Ignore the PDB files that do not belong to proteins in HEADER, and only keep the three-dimensional data files of the proteins that belong to proteins;

在每个所述蛋白质三维数据文件的详细信息中获取Uniprot ID，用于与靶点数据进行关联；Obtaining a Uniprot ID from the detailed information of each of the three-dimensional protein data files for associating with the target data;

将每个所述蛋白质三维数据文件的PDB ID作为主键，与其他数据表进行关联。The PDB ID of each protein three-dimensional data file is used as the primary key to associate with other data tables.

可选的，所述适应症数据的处理方法包括：Optionally, the method for processing the indication data includes:

从数据库中获取的适应症数据，将数据通过同义词进行匹配，合并名称相同的适应症；The indication data obtained from the database were matched by synonyms and the indications with the same name were merged;

将所述适应症通过DRUGBANK ID与药物数据关联，通过NCT NUMBER与Clinical Trials数据库中的临床实验信息进行关联。The indications are associated with drug data through DRUGBANK ID and with clinical trial information in the Clinical Trials database through NCT NUMBER.

可选的，所述生物活性数据的处理方法包括：Optionally, the method for processing the biological activity data includes:

从数据库中获取生物活性数据，包括化合物数据、靶点数据以及二者之间实验结果的数据；Obtain biological activity data from the database, including compound data, target data, and data on experimental results between the two;

将所述化合物数据、所述靶点数据分别与药物数据、靶点数据通过SMILES和Uniprot ID关联，便于后续调用。The compound data and the target data are respectively associated with the drug data and the target data through SMILES and Uniprot ID to facilitate subsequent calls.

可选的，所述先导化合物数据的处理方法具体包括：Optionally, the method for processing the lead compound data specifically includes:

从生物活性测试数据中获取全部的化合物数据，并对化合物数据进行筛选，选取数据类型和数据值均符合要求的数据；Obtain all compound data from the biological activity test data, screen the compound data, and select data whose data types and data values meet the requirements;

将这部分数据的SMILES进行识别，合并相同分子的数据；Identify the SMILES of this part of the data and merge the data of the same molecules;

通过CHEMBL数据库匹配分子后，使用CHEMBL ID作为主键，与其他数据进行关联。After matching molecules through the CHEMBL database, use the CHEMBL ID as the primary key to associate with other data.

可选的，所述突变数据的处理方法具体包括：Optionally, the method for processing the mutation data specifically includes:

从数据库中获取突变数据，按照与疾病相关突变和配体相关突变进行分类；Obtain mutation data from the database and classify them according to disease-related mutations and ligand-related mutations;

对于疾病相关突变，在突变位点信息之外，需要将Uniprot ID与疾病名称关联；For disease-related mutations, in addition to the mutation site information, the Uniprot ID needs to be Disease name association;

对于配体相关突变，需要将所述Uniprot ID与配体信息关联；For ligand-related mutations, the Uniprot ID needs to be associated with the ligand information;

按照所述Uniprot ID整理完成后，通过所述Uniprot ID与靶点进行关联。After the sorting is completed according to the Uniprot ID, it is associated with the target through the Uniprot ID.

本发明还提供了一种药物研发数据库的配置***，所述配置***包括：The present invention also provides a configuration system for a drug development database, the configuration system comprising:

数据获取模块，用于从公开数据库获取相关数据；A data acquisition module is used to obtain relevant data from a public database;

数据处理模块，用于对所述相关数据进行处理、关联和匹配；A data processing module, used for processing, associating and matching the relevant data;

检索匹配模块，用于对用户需要查询的数据进行检索和展示；The search and matching module is used to search and display the data that the user needs to query;

配体展示模块，用于对配体所在蛋白结晶结构进行再处理，使配体与蛋白的结合方式更易理解和展示；The ligand display module is used to reprocess the protein crystal structure where the ligand is located, making the binding mode between the ligand and the protein easier to understand and display;

序列对齐模块，用于对多个靶点的氨基酸序列进行对齐，用于直观展示序列间的异同；The sequence alignment module is used to align the amino acid sequences of multiple targets and to intuitively display the similarities and differences between sequences;

结构对齐模块，用于对多个蛋白结晶结构进行处理，用于使蛋白之间的结构关系能够直观展示。The structural alignment module is used to process multiple protein crystal structures to enable intuitive display of the structural relationships between proteins.

本发明提供的一种药物研发数据库的配置方法包括：从公开数据库获取相关数据；对所述相关数据进行处理、关联和匹配；对用户需要查询的数据进行检索和展示；对配体所在蛋白结晶结构进行再处理，使配体与蛋白的结合方式更易理解和展示；对多个靶点的氨基酸序列进行对齐，用于直观展示序列间的异同；对多个蛋白结晶结构进行处理，用于使蛋白之间的结构关系能够直观展示。能够有效提高药物研发人员的研发效率，提供更多研发思路。提高大量数据库之间的数据整合度，降低研发门槛，提高研发效率。The configuration method of a drug development database provided by the present invention includes: obtaining relevant data from a public database; processing, associating and matching the relevant data; retrieving and displaying the data that users need to query; reprocessing the protein crystal structure where the ligand is located to make the binding mode of the ligand and the protein easier to understand and display; aligning the amino acid sequences of multiple targets to intuitively display the similarities and differences between sequences; processing multiple protein crystal structures to enable intuitive display of the structural relationship between proteins. It can effectively improve the research and development efficiency of drug developers and provide more research and development ideas. It can improve the data integration between a large number of databases, lower the research and development threshold, and improve research and development efficiency.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。 The above description is only an overview of the technical solution of the present invention. In order to more clearly understand the technical means of the present invention, it can be implemented according to the contents of the specification. In order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand, the specific implementation methods of the present invention are listed below.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other accompanying drawings can be obtained based on these accompanying drawings without paying creative work.

图1为本发明实施例提供的一种药物研发数据库的配置方法的流程图；FIG1 is a flow chart of a method for configuring a drug development database provided by an embodiment of the present invention;

图2为本发明实施例提供的一种药物研发数据库的配置***的组成框图；FIG2 is a block diagram of a configuration system for a drug development database provided by an embodiment of the present invention;

图3为本发明实施例提供的药物数据的处理方法的流程图；FIG3 is a flow chart of a method for processing drug data provided by an embodiment of the present invention;

图4为本发明实施例提供的靶点数据的处理方法的流程图；FIG4 is a flow chart of a method for processing target data provided by an embodiment of the present invention;

图5为本发明实施例提供的蛋白结晶结构数据的处理方法的流程图；FIG5 is a flow chart of a method for processing protein crystal structure data provided by an embodiment of the present invention;

图6为本发明实施例提供的适应症数据的处理方法的流程图；FIG6 is a flow chart of a method for processing indication data provided by an embodiment of the present invention;

图7为本发明实施例提供的生物活性数据的处理方法的流程图；FIG7 is a flow chart of a method for processing biological activity data provided by an embodiment of the present invention;

图8为本发明实施例提供的先导化合物数据的处理方法的流程图；FIG8 is a flow chart of a method for processing lead compound data provided by an embodiment of the present invention;

图9为本发明实施例提供的突变数据的处理方法的流程图；FIG9 is a flow chart of a method for processing mutation data provided by an embodiment of the present invention;

图10为本发明实施例提供的药物搜索的流程图；FIG10 is a flowchart of a drug search provided by an embodiment of the present invention;

图11为本发明实施例提供的靶点搜索的流程图；FIG11 is a flow chart of target search provided by an embodiment of the present invention;

图12为本发明实施例提供的适应症搜索的流程图；FIG12 is a flowchart of an indication search provided by an embodiment of the present invention;

图13为本发明实施例提供的先导化合物搜索的流程图；FIG13 is a flow chart of a lead compound search according to an embodiment of the present invention;

图14为本发明实施例提供的配体展示的流程图；FIG14 is a flow chart of ligand display provided in an embodiment of the present invention;

图15为本发明实施例提供的序列对齐的流程图；FIG15 is a flowchart of sequence alignment provided by an embodiment of the present invention;

图16为本发明实施例提供的结构对齐的流程图。 FIG. 16 is a flowchart of structure alignment provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。The exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although the exemplary embodiments of the present disclosure are shown in the accompanying drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided in order to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

本发明的说明书实施例和权利要求书及附图中的术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元。The terms "comprises" and "having" and any variations thereof in the description embodiments, claims and drawings of the present invention are intended to cover non-exclusive inclusions, for example, including a series of steps or units.

下面结合附图和实施例，对本发明的技术方案做进一步的详细描述。The technical solution of the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments.

如图1所示，一种药物研发数据库的配置方法包括：As shown in FIG1 , a configuration method of a drug development database includes:

对配体所在蛋白结晶结构进行再处理，使配体与蛋白的结合方式更易理解和展示；Reprocess the crystal structure of the protein where the ligand is located to make the binding mode between the ligand and the protein easier to understand and display;

如图2所示，一种药物研发数据库的配置***包括：As shown in FIG2 , a configuration system for a drug development database includes:

序列对齐模块，用于对多个靶点的氨基酸序列进行对齐，用于直观展示序列间的异同；Sequence alignment module is used to align the amino acid sequences of multiple targets. Observe the similarities and differences between the display sequences;

本发明包含以下类型的数据：药物数据，靶点数据，蛋白结晶结构数据，适应症数据，生物活性数据，先导化合物数据，突变数据。The present invention includes the following types of data: drug data, target data, protein crystal structure data, indication data, biological activity data, lead compound data, and mutation data.

本发明对数据的整合方式如下：The present invention integrates data in the following way:

如图3所示，药物数据：从各个药物数据表中找到对药物类型进行定义的数据(列名通常为Drug Type)，将标记为小分子Small Molecule的药物选出，其他类型的药物单独存放。从小分子药物中，找到对药物结构进行定义的数据，直接使用SMILES作为识别方式。将SMILES导入开源模组RDkit中，使用RDkit将SMILES转化为统一的RDkit_SMILES。直接比对所有的RDkit_SMILES，认定具有相同RDkit_SMILES的不同数据来源的药物为同一药物，合并其数据。通过匹配DRUGBANK数据库的数据，获取DRUGBANK ID，将其作为数据表的主键，并与其他数据表进行关联。As shown in Figure 3, drug data: Find the data that defines the drug type from each drug data table (the column name is usually Drug Type), select the drugs marked as small molecules Small Molecule, and store other types of drugs separately. From small molecule drugs, find the data that defines the drug structure, and directly use SMILES as the identification method. Import SMILES into the open source module RDkit, and use RDkit to convert SMILES into a unified RDkit_SMILES. Directly compare all RDkit_SMILES, identify drugs from different data sources with the same RDkit_SMILES as the same drug, and merge their data. By matching the data of the DRUGBANK database, obtain the DRUGBANK ID, use it as the primary key of the data table, and associate it with other data tables.

如图4所示，靶点数据：从各个靶点数据表中获取靶点的分类数据，将靶点按照的分类方式进行分类(Class A，Class B，Class C，Class D，Class F)。对于缺少分类信息的靶点，将分类标记为“TBD”，等待其他方式确认分类。对于所有的靶点数据通过Uniprot ID进行合并，并作为唯一主键与其他数据表进行关联。As shown in Figure 4, target data: Get the classification data of the target from each target data table, and classify the target according to the classification method (Class A, Class B, Class C, Class D, Class F). For targets that lack classification information, mark the classification as "TBD" and wait for other methods to confirm the classification. All target data are merged through Uniprot ID and associated with other data tables as the unique primary key.

如图5所示，蛋白结晶结构数据：对蛋白结晶结构数据，首先对每个PDB文件进行数据抽取，获取PDB基本信息，将其中“HEADER”中不属于蛋白的PDB忽略，仅保留属于蛋白的PDB。在每个PDB文件的详细信息中获取Uniprot ID，用于与靶点数据进行关联。将每个PDB的PDB ID作为主键，与其他数据表进行关联。As shown in Figure 5, protein crystal structure data: For protein crystal structure data, first extract data from each PDB file to obtain basic PDB information, ignore the PDBs that do not belong to proteins in the "HEADER", and only retain the PDBs that belong to proteins. Obtain the Uniprot ID in the detailed information of each PDB file for association with target data. Use the PDB ID of each PDB as the primary key to associate with other data tables.

如图6所示，适应症数据：从数据库中获取的适应症数据，将数据通过同义词进行匹配，合并名称相同的适应症。将适应症通过 DRUGBANK ID与药物数据关联，通过NCT NUMBER与Clinical Trials数据库中的临床实验信息进行关联。As shown in Figure 6, indication data: the indication data obtained from the database, the data is matched by synonyms, and the indications with the same name are merged. DRUGBANK ID is associated with drug data and is associated with clinical trial information in the Clinical Trials database through NCT NUMBER.

如图7所示，生物活性数据：从数据库中获取生物活性数据，包括化合物数据、靶点数据以及二者之间实验结果的数据。将化合物数据、靶点数据分别与药物数据、靶点数据通过SMILES和Uniprot ID关联，便于后续调用。As shown in Figure 7, biological activity data: biological activity data is obtained from the database, including compound data, target data, and data of experimental results between the two. Compound data and target data are associated with drug data and target data through SMILES and Uniprot ID respectively for easy subsequent calls.

如图8所示，先导化合物数据：从生物活性测试数据中获取全部的化合物数据，并对化合物数据进行筛选，仅选取其中数据类型为Ki，Kd，IC50，EC50且数据值不超过1000nM的数据。将数据的SMILES进行识别，合并相同分子的数据。通过CHEMBL数据库匹配分子后，使用CHEMBL ID作为主键，与其他数据进行关联。As shown in Figure 8, lead compound data: All compound data are obtained from the biological activity test data, and the compound data are screened, and only the data with data types of Ki, Kd, IC50, EC50 and data values not exceeding 1000nM are selected. The SMILES of the data are identified and the data of the same molecules are merged. After matching the molecules through the CHEMBL database, the CHEMBL ID is used as the primary key to associate with other data.

如图9所示，突变数据：从数据库中获取突变数据，按照与疾病相关突变和配体相关突变进行分类。对于疾病相关突变，在突变位点信息之外，需要将Uniprot ID与疾病名称关联。对于配体相关突变，需要将Uniprot ID与配体信息关联。按照Uniprot ID整理完成后，通过Uniprot ID与靶点进行关联。As shown in Figure 9, mutation data: Obtain mutation data from the database and classify them according to disease-related mutations and ligand-related mutations. For disease-related mutations, in addition to the mutation site information, the Uniprot ID needs to be associated with the disease name. For ligand-related mutations, the Uniprot ID needs to be associated with the ligand information. After sorting according to the Uniprot ID, associate it with the target through the Uniprot ID.

本发明涉及的功能模块：检索匹配模块，配体展示模块，序列对齐模块，结构对齐模块。The functional modules involved in the present invention include: a retrieval and matching module, a ligand display module, a sequence alignment module, and a structure alignment module.

检索匹配模块：Retrieve matching modules:

如图10所示，药物搜索，用户输入SMILES或药物名检索药物数据，后台通过药物数据中的Drugbank ID匹配相关的靶点数据Uniprot ID、蛋白结晶数据PDB ID、适应症数据，所有数据结合后一并展示。As shown in Figure 10, for drug search, the user enters SMILES or drug name to retrieve drug data. The backend matches the relevant target data Uniprot ID, protein crystallization data PDB ID, and indication data through the Drugbank ID in the drug data. All data are combined and displayed together.

如图11所示，靶点搜索，用户输入UNIPROT ID、靶点名检索靶点数据，后台通过靶点数据中的Uniprot ID匹配相关的药物数据Drugbank ID、蛋白结晶数据PDB ID、突变数据Uniprot ID，最后一并展示。 As shown in Figure 11, for target search, the user enters the UNIPROT ID and target name to retrieve the target data. The backend matches the relevant drug data Drugbank ID, protein crystal data PDB ID, and mutation data Uniprot ID through the Uniprot ID in the target data, and finally displays them together.

如图12所示，适应症搜索，用户输入适应症名称，在数据库中匹配后，通过适应症关联药物数据，一并展示。As shown in Figure 12, in the indication search, the user enters the indication name, and after matching in the database, the drug data is associated with the indication and displayed together.

如图13所示，先导化合物搜索，用户输入SMILES、CHEMBL ID检索先导化合物数据，通过先导化合物数据中的CHEMBL ID匹配相关的靶点数据Uniprot ID，读取相关靶点的其他数据，最后一并展示。As shown in Figure 13, for lead compound search, the user enters SMILES and CHEMBL ID to retrieve the lead compound data, matches the relevant target data Uniprot ID through the CHEMBL ID in the lead compound data, reads other data of the related target, and finally displays them together.

如图14所示，配体展示模块：用户在PDB展示插件中选择指定的蛋白结晶结构，***展示该结晶结构中存在的配体列表，用户继续选择指定配体，并输入指定的半径范围。***收到上述三个信息：蛋白结晶结构名称、配体名称、半径数值，读取数据库中的蛋白结晶结构文件，并以这三个参数进行计算，将计算得到的氨基酸残基加载到PDB展示插件中高亮显示。As shown in Figure 14, the ligand display module: the user selects the specified protein crystal structure in the PDB display plug-in, and the system displays the list of ligands present in the crystal structure. The user continues to select the specified ligand and enter the specified radius range. The system receives the above three information: protein crystal structure name, ligand name, radius value, reads the protein crystal structure file in the database, and calculates with these three parameters, and loads the calculated amino acid residues into the PDB display plug-in for highlighting.

如图15所示，序列对齐模块：用户输入多个靶点的靶点名，从数据库中读取对应的靶点的序列信息，将这些信息经过计算，给出相似度结果和对齐情况。As shown in FIG15 , in the sequence alignment module, the user inputs the target names of multiple targets, reads the sequence information of the corresponding targets from the database, calculates this information, and gives the similarity results and alignment status.

如图16所示，结构对齐模块：用户输入多个蛋白结晶结构的ID，选择指定Chain ID和指定的截断值、循环数，从数据库中读取指定的蛋白结晶结构，将这些参数代入进行计算，给出偏移值并将对齐后的蛋白结晶结构以文件形式加载入PDB展示插件中。As shown in Figure 16, the structural alignment module: the user inputs the IDs of multiple protein crystal structures, selects the specified Chain ID and the specified cutoff value and number of cycles, reads the specified protein crystal structure from the database, substitutes these parameters for calculation, gives the offset value and loads the aligned protein crystal structure into the PDB display plug-in in the form of a file.

有益效果：通过以上模块的协同使用，能够有效提高药物研发人员的研发效率，提供更多研发思路。Beneficial effects: Through the coordinated use of the above modules, the research and development efficiency of drug developers can be effectively improved and more research and development ideas can be provided.

以上的具体实施方式，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上仅为本发明的具体实施方式而已，并不用于限定本发明的保护范围，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。 The above specific implementation methods further illustrate the objectives, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above are only specific implementation methods of the present invention and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention should be included in the scope of protection of the present invention.

Claims

一种药物研发数据库的配置方法，其特征在于，所述配置方法包括：A method for configuring a drug development database, characterized in that the configuration method comprises:

从公开数据库获取相关数据；Obtain relevant data from public databases;

对所述相关数据进行处理、关联和匹配；Processing, associating and matching the relevant data;

对用户需要查询的数据进行检索和展示；Retrieve and display the data that users need to query;

对配体所在蛋白结晶结构进行再处理，使配体与蛋白的结合方式更易理解和展示；Reprocess the crystal structure of the protein where the ligand is located to make the binding mode between the ligand and the protein easier to understand and display;

对多个靶点的氨基酸序列进行对齐，用于直观展示序列间的异同；Align the amino acid sequences of multiple targets to visually display the similarities and differences between sequences;

对多个蛋白结晶结构进行处理，用于使蛋白之间的结构关系能够直观展示。Multiple protein crystal structures are processed to enable intuitive display of the structural relationships between proteins.
根据权利要求1所述的一种药物研发数据库的配置方法，其特征在于，所述相关数据具体包括：The method for configuring a drug development database according to claim 1, wherein the relevant data specifically includes:

药物数据、靶点数据、蛋白结晶结构数据、适应症数据、生物活性数据、先导化合物数据和突变数据。Drug data, target data, protein crystal structure data, indication data, biological activity data, lead compound data and mutation data.
根据权利要求2所述的一种药物研发数据库的配置方法，其特征在于，所述药物数据的处理方法包括：The method for configuring a drug research and development database according to claim 2, wherein the method for processing drug data comprises:

从各个药物数据表中获取对药物类型进行定义的数据，将标记为小分子药物筛选出来，其他类型的药物单独存放；Obtain data defining drug types from each drug data table, screen out drugs marked as small molecules, and store other types of drugs separately;

从小分子药物中，找到对药物结构进行定义的数据，直接使用SMILES作为识别方式；From small molecule drugs, find data that defines the drug structure and directly use SMILES as the identification method;

将所述SMILES导入开源模组RDkit中，使用所述RDkit将所述SMILES转化为统一的RDkit_SMILES；Import the SMILES into the open source module RDkit, and use the RDkit to convert the SMILES into a unified RDkit_SMILES;

直接比对所有的RDkit_SMILES，认定具有相同所述RDkit_SMILES的不同数据来源的药物为同一药物，合并数据；Directly compare all RDkit_SMILES, identify drugs from different data sources with the same RDkit_SMILES as the same drug, and merge the data;

通过匹配DRUGBANK数据库的数据，获取DRUGBANK ID，并作为数据表的主键，与其他数据表进行关联。By matching the data in the DRUGBANK database, we can obtain the DRUGBANK ID and use it as the primary key of the data table to associate it with other data tables.
根据权利要求2所述的一种药物研发数据库的配置方法，其特征在于，所述靶点数据的处理方法包括： The method for configuring a drug development database according to claim 2, wherein the method for processing target data comprises:

从靶点数据表中获取靶点的分类数据；Get the classification data of the target from the target data table;

将靶点按照的分类方式进行分类；Classify the targets according to the classification method;

对于缺少分类信息的靶点，将分类标记为TBD，等待其他方式确认分类；For targets lacking classification information, the classification was marked as TBD, pending confirmation of the classification by other means;

将所述靶点数据通过Uniprot ID进行合并，并作为唯一主键与其他数据表进行关联。The target data are merged through Uniprot ID and associated with other data tables as the unique primary key.
根据权利要求2所述的一种药物研发数据库的配置方法，其特征在于，所述蛋白结晶结构数据的处理方法包括：The method for configuring a drug development database according to claim 2, wherein the method for processing protein crystal structure data comprises:

对每个蛋白质三维数据文件进行数据抽取，获取蛋白质三维数据文件的基本信息；Extract data from each protein three-dimensional data file to obtain basic information of the protein three-dimensional data file;

将HEADER中不属于蛋白的PDB忽略，仅保留属于蛋白的所述蛋白质三维数据文件；Ignore the PDB files that do not belong to proteins in HEADER, and only keep the three-dimensional data files of the proteins that belong to proteins;

在每个所述蛋白质三维数据文件的详细信息中获取Uniprot ID，用于与靶点数据进行关联；Obtaining a Uniprot ID from the detailed information of each of the three-dimensional protein data files for associating with the target data;

将每个所述蛋白质三维数据文件的PDB ID作为主键，与其他数据表进行关联。The PDB ID of each protein three-dimensional data file is used as the primary key to associate with other data tables.
根据权利要求2所述的一种药物研发数据库的配置方法，其特征在于，所述适应症数据的处理方法包括：The method for configuring a drug development database according to claim 2, wherein the method for processing indication data comprises:

从数据库中获取的适应症数据，将数据通过同义词进行匹配，合并名称相同的适应症；The indication data obtained from the database were matched by synonyms and the indications with the same name were merged;

将所述适应症通过DRUGBANK ID与药物数据关联，通过NCT NUMBER与Clinical Trials数据库中的临床实验信息进行关联。The indications are associated with drug data through DRUGBANK ID and with clinical trial information in the Clinical Trials database through NCT NUMBER.
根据权利要求2所述的一种药物研发数据库的配置方法，其特征在于，所述生物活性数据的处理方法包括：The method for configuring a drug development database according to claim 2, wherein the method for processing biological activity data comprises:

从数据库中获取生物活性数据，包括化合物数据、靶点数据以及二者之间实验结果的数据；Obtain biological activity data from the database, including compound data, target data, and data on experimental results between the two;

将所述化合物数据、所述靶点数据分别与药物数据、靶点数据通过SMILES和Uniprot ID关联，便于后续调用。The compound data and the target data are respectively associated with the drug data and the target data through SMILES and Uniprot ID to facilitate subsequent calls.
根据权利要求2所述的一种药物研发数据库的配置方法，其特征在于，所述先导化合物数据具体包括：The method for configuring a drug development database according to claim 2, characterized in that: The lead compound data specifically include:

从生物活性测试数据中获取全部的化合物数据，并对化合物数据进行筛选，选取数据类型和数据值均符合要求的数据；Obtain all compound data from the biological activity test data, screen the compound data, and select data whose data types and data values meet the requirements;

将这部分数据的SMILES进行识别，合并相同分子的数据；Identify the SMILES of this part of the data and merge the data of the same molecules;

通过CHEMBL数据库匹配分子后，使用CHEMBL ID作为主键，与其他数据进行关联。After matching molecules through the CHEMBL database, use the CHEMBL ID as the primary key to associate with other data.
根据权利要求2所述的一种药物研发数据库的配置方法，其特征在于，所述突变数据具体包括：The method for configuring a drug development database according to claim 2, wherein the mutation data specifically includes:

从数据库中获取突变数据，按照与疾病相关突变和配体相关突变进行分类；Obtain mutation data from the database and classify them according to disease-related mutations and ligand-related mutations;

对于疾病相关突变，在突变位点信息之外，需要将Uniprot ID与疾病名称关联；For disease-related mutations, in addition to the mutation site information, the Uniprot ID needs to be associated with the disease name;

对于配体相关突变，需要将所述UniprotID与配体信息关联；For ligand-related mutations, the UniprotID needs to be associated with the ligand information;

按照所述Uniprot ID整理完成后，通过所述Uniprot ID与靶点进行关联。After the sorting is completed according to the Uniprot ID, it is associated with the target through the Uniprot ID.
一种药物研发数据库的配置***，其特征在于，所述配置***包括：A configuration system for a drug development database, characterized in that the configuration system comprises:

数据获取模块，用于从公开数据库获取相关数据；A data acquisition module is used to obtain relevant data from a public database;

数据处理模块，用于对所述相关数据进行处理、关联和匹配；A data processing module, used for processing, associating and matching the relevant data;

检索匹配模块，用于对用户需要查询的数据进行检索和展示；The search and matching module is used to search and display the data that the user needs to query;

配体展示模块，用于对配体所在蛋白结晶结构进行再处理，使配体与蛋白的结合方式更易理解和展示；The ligand display module is used to reprocess the protein crystal structure where the ligand is located, making the binding mode between the ligand and the protein easier to understand and display;

序列对齐模块，用于对多个靶点的氨基酸序列进行对齐，用于直观展示序列间的异同；The sequence alignment module is used to align the amino acid sequences of multiple targets and to intuitively display the similarities and differences between sequences;

结构对齐模块，用于对多个蛋白结晶结构进行处理，用于使蛋白之间的结构关系能够直观展示。 The structural alignment module is used to process multiple protein crystal structures to enable intuitive display of the structural relationships between proteins.