CN111354424B

CN111354424B - Prediction method and device for potential active molecules and computing equipment

Info

Publication number: CN111354424B
Application number: CN202010124320.3A
Authority: CN
Inventors: 杜杰文; 赖力鹏; 温书豪; 马健
Original assignee: Beijing Jingtai Technology Co ltd
Current assignee: Beijing Jingtai Technology Co ltd
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2023-06-23
Anticipated expiration: 2040-02-27
Also published as: CN111354424A

Abstract

The invention discloses a prediction method of potential active molecules, which is executed in computing equipment and comprises the following steps: collecting structural data of a plurality of small molecule compounds with biological activity; comparing the small molecular compounds pairwise, and identifying a common substructure and the remaining variable substructure of each two small molecular chemicals, wherein the two variable substructures form a peer pair; removing the weight of all the generated isostere pairs to obtain an isostere set of each group; receiving a molecule to be optimized input by a user, and identifying a parent nucleus structure of the molecule and groups to be replaced except the parent nucleus structure; acquiring an isostere set of groups to be replaced, and replacing the groups to be replaced by adopting each isostere in the set to obtain a plurality of candidate molecules; and calculating the drug class of each candidate molecule, and selecting one or more candidate molecules with high drug class from the drug classes to recommend to the user. The invention also discloses a corresponding prediction device and calculation equipment of the potential active molecules.

Description

Prediction method and device for potential active molecules and computing equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, and a computing device for predicting a potentially active molecule.

Background

As is well known, drug development is a long process, and has the dilemma of long development period, low development achievement rate and high development cost. With the update of computer technology and the development of big data technology, artificial intelligence is playing a great application value in various industries, and is also receiving a great deal of attention in the pharmaceutical industry. In the process of discovering new drugs, virtual screening can improve the enrichment of active molecules, and by predicting the performance of the compounds, a great deal of manpower and material resources can be saved, the drug research and development period is shortened, and the conversion of research results is accelerated, so that great importance is placed on scientific research institutions and pharmaceutical companies in recent years.

The computer drug screening requires a huge and diverse library of active molecules, and the current construction process of the library of active molecules is still complex, and the molecular diversity still needs to be improved. Moreover, research and development personnel often need the system to automatically provide a plurality of reasonably-behaved candidate products to improve their efficiency of lead optimization and to discover new drug structures, driving the drug development process.

Disclosure of Invention

In view of the above, the present invention proposes a method, apparatus and computing device for predicting potentially active molecules in an effort to solve, or at least solve, the above-identified problems.

According to one aspect of the present invention there is provided a method of predicting a potentially active molecule, suitable for execution in a computing device, comprising the steps of: collecting structural data of a plurality of small molecule compounds with biological activity; comparing the small molecular compounds pairwise, and identifying a common substructure and the remaining variable substructure of each two small molecular chemicals, wherein the two variable substructures form a peer pair; performing de-duplication treatment on all generated isostere pairs to obtain an isostere set of each group; receiving a molecule to be optimized input by a user, and identifying a parent nucleus structure of the molecule and groups to be replaced except the parent nucleus structure; acquiring an isostere set of groups to be replaced, and replacing the groups to be replaced by adopting each isostere in the set to obtain a plurality of candidate molecules; and calculating the drug class of each candidate molecule, and selecting one or more candidate molecules with high drug class from the drug classes to recommend to the user.

Optionally, in the method of predicting potentially active molecules according to the invention, the step of identifying a common substructure and remaining variable substructures of each two small molecule chemicals for any two small molecule compounds comprises: identifying a largest common substructure of the two small molecule compounds and extracting one or more common substructures within the largest common substructure; for any common substructure, extracting two variable substructures of the two small molecular compounds except the common substructure to form a pair of isostere pairs based on the common substructure; the broken bond of the common substructure and the variable substructure is the splicing site of the corresponding isostere.

Optionally, in the method of predicting potentially active molecules according to the present invention, the step of identifying the common substructure and the remaining variable substructure of each two small molecule chemicals further comprises: determining a broken bond between a common substructure and a variable substructure, and extending a predetermined number of chemical bonds from the end of the broken bond to the common substructure to extract an environmental structure of the two variable substructures.

Optionally, in the method of predicting potentially active molecules according to the present invention, the step of identifying the common substructure and the remaining variable substructure of each two small molecule chemicals further comprises: the identified isostere pairs and environmental structures are converted to a reduced molecular linear canonical representation, wherein each isostere splice site is labeled in the reduced molecular linear canonical representation of the isostere pair.

Optionally, in the prediction method of a potentially active molecule according to the present invention, the step of extracting one or more common substructures within the maximum common substructures comprises: taking the largest common substructure as a common substructure; or extracting one or more chemical bonds in the maximum public substructure to generate a plurality of fracture modes; for each cleavage mode, the fragment sets of the two small molecular compounds after cleavage are extracted respectively, and the union of all the same fragments in the two fragment sets is taken as a common substructure of the two small molecular compounds.

Alternatively, in the method of predicting a potentially active molecule according to the present invention, the broken chemical bond is an acyclic single bond.

Optionally, in the method of predicting potentially active molecules according to the invention, the step of deduplicating all generated isostere pairs comprises: for any group, an isostere pair having the group, and an isostere of the group in the isostere pairs are extracted, and all isosteres of the extracted group are subjected to a deduplication treatment.

Optionally, in the method for predicting a potentially active molecule according to the invention, the step of performing a deduplication treatment on all isosteres of the extracted group comprises: extracting the environmental structure of the two peer-to-peer pair if the isosteres extracted from the two peer-to-peer pair are the same; if the environmental structures of the two isosteres are the same, one isostere is removed, and otherwise, the other isostere is not removed.

Alternatively, in the method for predicting a potentially active molecule according to the present invention, the group to be replaced other than the parent nucleus structure has G ₁ ～G _m M is less than or equal to 4, and each group to be replaced has a corresponding isostere set, wherein the ith group to be replaced G _i The isostere number of (2) is N _i At this time, the isostere substitution of the groups to be substituted is realized in a permutation and combination mode.

Optionally, in the method of predicting a potentially active molecule according to the invention, the step of replacing the group to be replaced with each isostere in the set comprises: counting the total replacement times of the groups, generating a plurality of replacement tasks, and sending the replacement tasks to a plurality of servers for processing, wherein each replacement task comprises object structure data and array structure data; the object data comprises a molecular structure to be optimized, splicing sites and group information to be replaced, and the array structure data comprises isostere set information of each group to be replaced.

Optionally, in the method for predicting a potentially active molecule according to the present invention, the calculation formula of the drug-like QED is:

wherein n represents the total number of molecular attributes, d _i Representing the expected function of the ith molecular property, the expected function representing the contribution of the molecular property to the overall drug class.

Optionally, in the method for predicting a potentially active molecule according to the present invention, the method further comprises the steps of: candidate molecules recommended to the user each time are stored in a library of molecules for subsequent screening of candidate drugs from the library of molecules in a drug screening.

Optionally, in the prediction method of a latent active molecule according to the present invention, after collecting structural data of a plurality of small molecule compounds having biological activity, the method further comprises the steps of: and eliminating the wrong structural data and inorganic micromolecule and inorganic salt structures contained in the structural data, and converting the processed structural data into simplified molecular linear input specification representation.

According to another aspect of the present invention there is provided a method of predicting a potentially active molecule, suitable for residence in a computing device, comprising: the data collection module is suitable for collecting structural data of a plurality of small molecular compounds with biological activity; an isostere pair identification module adapted to pair small molecule compounds in pairs, identify a common substructure and a remaining variable substructure of each two small molecule chemicals, wherein the two variable substructures form a pair of isostere pairs; the isostere library generation module is suitable for carrying out de-duplication treatment on all generated isostere pairs to obtain an isostere set of each group; the molecule receiving module is suitable for receiving a molecule to be optimized input by a user, and identifying a parent nucleus structure of the molecule and groups to be replaced except the parent nucleus structure; the molecule replacement module is suitable for acquiring an isostere set of the groups to be replaced, and replacing the groups to be replaced by adopting each isostere in the set respectively to obtain a plurality of candidate molecules; and the molecule recommending module is suitable for calculating the drug class of each candidate molecule and selecting one or more candidate molecules with high drug class from the drug class to recommend the candidate molecules to a user.

According to yet another aspect of the present invention, there is provided a computing device comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the processors implement the steps of the method of predicting potentially active molecules as described above.

According to yet another aspect of the present invention, there is provided a readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, implement the steps of the method of predicting potentially active molecules as described above.

According to the technical scheme of the invention, the method for generating candidate isosteres with large isostere library, high calculation speed and high diversity is provided, and the generated candidate isosteres are automatically scored, so that potential active molecules with excellent physicochemical properties are recommended to users. The invention can greatly improve the efficiency of drug researchers for optimizing the lead, is beneficial to finding new drug structures, promotes the drug research and development process, and can be applied to the fields of skeleton transition and the like to generate new candidate drug molecules.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which set forth the various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to fall within the scope of the claimed subject matter. The above, as well as additional objects, features, and advantages of the present disclosure will become more apparent from the following detailed description when read in conjunction with the accompanying drawings. Like reference numerals generally refer to like parts or elements throughout the present disclosure.

FIG. 1 illustrates a block diagram of a computing device 100, according to one embodiment of the invention;

FIG. 2 illustrates a flow chart of a method 200 of predicting potentially active molecules according to one embodiment of the invention;

FIG. 3 shows a schematic diagram of an isostere pair generation method according to one embodiment of the invention; and

Fig. 4 shows a block diagram of a prediction apparatus 400 of potentially active molecules according to one embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 is a block diagram of a computing device 100 according to one embodiment of the invention. In a basic configuration 102, computing device 100 typically includes a system memory 106 and one or more processors 104. The memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing including, but not limited to: a microprocessor (μp), a microcontroller (μc), a digital information processor (DSP), or any combination thereof. The processor 104 may include one or more levels of caches, such as a first level cache 110 and a second level cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations, the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some implementations, the application 122 may be arranged to operate on an operating system with program data 124. Program data 124 includes instructions, in computing device 100 according to the present invention, program data 124 contains instructions for performing prediction method 200 of potentially active molecules.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to basic configuration 102 via bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices such as a display or speakers via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communication with one or more other computing devices 162 via one or more communication ports 164 over a network communication link.

The network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media in a modulated data signal, such as a carrier wave or other transport mechanism. A "modulated data signal" may be a signal that has one or more of its data set or changed in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or special purpose network, and wireless media such as acoustic, radio Frequency (RF), microwave, infrared (IR) or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 100 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., as part of a small-sized portable (or mobile) electronic device, such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless WEB-watch device, a personal headset device, an application-specific device, or a hybrid device that may include any of the above functions. Computing device 100 may also be implemented as a personal computer including desktop and notebook computer configurations. In some embodiments, the computing device 100 is configured to perform the method 200 of predicting potentially active molecules.

FIG. 2 illustrates a flow chart of a method 200 of predicting potentially active molecules according to one embodiment of the invention. The method 200 is performed in a computing device, such as the computing device 100, in order to recommend potentially active molecules to a user. As shown in fig. 2, the method starts at step S210.

In step S210, structural data of a plurality of small molecule compounds having biological activity are collected. Wherein, the structural data of the active molecule can be obtained from any public database at present, and the invention is not limited to the above. Typically, about 200 ten thousand reported small molecule structural data with biological activity can be collected.

According to one embodiment, after collecting the structural data of the plurality of small molecule compounds having biological activity, the method further comprises the steps of: and eliminating the wrong structure data and inorganic micromolecule and inorganic salt structures contained in the structure data, and converting the processed structure data into simplified molecular linear input specification representation (SMILES, simplified molecular input line entry specification). As shown in the following table, the erroneous results generally refer to structures that the computing device cannot recognize or convert to SMILES, including inorganic molecules such as HCl, HBr, etc., and inorganic salts such as sodium, potassium, etc., included in the data structures. Of course, the following table is merely a few exemplary data structures that may actually comprise a wide variety of inorganic molecules or inorganic salts, and the present invention is not limited in this regard.

Subsequently, in step S220, the small molecule compounds are aligned in pairs, and a common substructure and the remaining variable substructure of each two small molecule chemicals are identified, wherein the two variable substructures constitute a pair of peer pairs.

Here, a molecular pair matching method is used to identify all possible generalized isostere pairs in the molecular library, and an isostere library is constructed. When there is only a small difference in substructures in a pair of molecules, and most of the substructures are identical, the same substructures are referred to as common substructures (constants), and different substructures are referred to as variable substructures (variable). These differential variable substructures may be considered as broad isosteres. In a strict definition, an isostere pair may include not only the structure of the molecule pair, but also the converted form of the variable substructure between them.

According to one embodiment, the step of identifying the common substructure and the remaining variable substructures in this step may comprise: the largest common substructure of the two small molecule compounds is identified and one or more common substructures are extracted within the largest common substructure. For any common substructure, two variable substructures of the two small molecule compounds are extracted except for the common substructure to form a pair of isostere pairs based on the common substructure. The broken bond of the common substructure and the variable substructure is the splicing site of the corresponding isostere.

In particular, when one or more common substructures are extracted within a maximum common substructure, in one implementation, the maximum common substructure may be directly referred to as a common substructure. In yet another implementation, one or more chemical bonds within the largest common substructure may be extracted, resulting in multiple ways of cleavage. And for each breaking mode, respectively extracting fragment sets after breaking the two small molecular compounds, and taking the union of all the same fragments in the two fragment sets as a common substructure of the two small molecular compounds. Preferably, the broken chemical bond is an acyclic single bond.

For example, compound a and compound B in fig. 3 are a pair of molecules in a molecular library, and the main difference between the two molecules is that the atoms connected at the ends of the nitrogen atoms are different, the ends of compound a are hydrogen atoms, and the ends of compound B are carbon atoms. The instant in the upper right box in fig. 3 is the largest common substructure of compound a and compound B, i.e., one common substructure in the molecular pair matching algorithm, which is generated by breaking the N-C bond or the N-H bond. In addition to this maximum common substructure, the remaining two substructures may be considered as variable substructures, namely variable A and variable B, forming a peer pair. The cleavage sites for bond breaking are actually splice sites of an isosteh pair.

Of course, within the largest common substructure, two compounds may also have the expression form of multiple common substructures, which may be achieved by cleavage of one or more bonds within the largest common substructure. Typically, the cleavage is a non-cyclic single bond, so that the bond or carbonyl bond within the benzene ring is not cleaved in FIG. 3. The cleavable bond has a bond of benzene ring and carbonyl carbon, and several bonds of N-H bond and N-C bond, one or more bonds are selected from the several bonds, and a plurality of cleavage modes can be obtained by adopting a permutation and combination mode, and each cleavage mode can obtain a corresponding fragment set. For fragmentation pattern 1, it is assumed that the fragment set of Compound A has { I ₁ ，I ₂ ，I ₃ The fragment set of compound A has { I } ₁ ，I ₂ ，I ₄ Then I ₁ And I ₂ The two fragments together form a common substructure of the two compounds, while fragment I ₃ And I ₄ Then as two variable substructures a pair of peer rows is formed.

Taking the lower right box in fig. 3 as an example, by simultaneously breaking the bond between the benzene ring and the carbonyl carbon, and the N-C bond, two fragment structures are obtained, which together form a common substructure (constant), and the structures of the two compounds other than the two fragments are used as variable substructures, namely, variable a and variable B, to form a pair of isosteres. It is also understood that cleavage of the bond between the benzene ring and the carbonyl carbon alone, in turn, results in a common substructure and corresponding isostere pair. Thus, a pair of compounds, through k (k.gtoreq.1) cleavage patterns, can yield k common substructures and corresponding k pairs of isosteres.

In addition, the conversion form between a pair of peer pairs can be represented in the molecular pair matching algorithm, such as the arrow in the upper left box in fig. 3, which represents the conversion form of the molecular structure from variable a to variable B. In addition, the conversion can be expressed by SMILES expressions with two variable substructures, such as [ H ] [ 1] > C1 ] shown in FIG. 3.

In addition, in step S220, a broken bond (i.e., a cleavage site) between the common substructure and the variable substructure may also be determined, and a predetermined number of chemical bonds may be extended from the end of the broken bond toward the common substructure to extract the environmental structures of the two variable substructures. Wherein the predetermined number may be 2, the corresponding environmental structure is as Environment at radius in fig. 3, i.e. the resulting structure after two chemical bonds extending from the cleavage site of the common substructure towards the common substructure. In the upper right box of FIG. 3, the common substructure is cleaved by the N-H and N-C bonds, thus extending two chemical bonds from the terminal end of the bond to the nitrogen atom, resulting in an environmental structure where the cleavage site is located. The environment structure can represent the chemical bond structure information of the isostere, and has certain reference significance for judging and identifying the same or repeated isosteres.

It will be appreciated that for a common substructure composed of a plurality of fragments, one sub-ambient structure is available from the cleavage site of each fragment, and that a fragment itself is considered a sub-ambient structure when it is too short to extend a corresponding number of chemical bonds. All sub-environment structures within a common sub-structure together form an environment structure of a peer-to-peer pair.

Step S220 may also convert the identified isostere pairs and environmental structures into a reduced molecular linear canonical representation, wherein splice sites for each isostere are labeled in the reduced molecular linear canonical representation of the isostere pair, according to one embodiment of the invention. In general, a SMILES representation of a pair of molecules is known, i.e., a SMILES representation of all of the isostere pairs within the molecule, an environment variable SMILES representation of the isostere pair, an SMILES representation of an isostere pair conversion format, wherein the SMILES representation of the isostere pair contains splice sites of the isosteres (i.e., cleavage sites that generate the isostere).

Subsequently, in step S230, a deduplication process is performed on all the generated isostere pairs, resulting in an isostere set for each group.

As described above, extracting the substructures with differences in each pair of molecules can generate multiple isosteres, and comparing the collected molecules in pairs can generate massive isostere pairs, and the isostere pairs need to be subjected to de-duplication processing to generate an isostere library.

Specifically, for any group, an isostere pair having the group, and an isostere of the group in the pair are extracted, and all isosteres of the extracted group are subjected to a deduplication treatment. Since the group R exists in a plurality of isostere pairs, the isostere in the plurality of isostere pairs is extracted and subjected to a deduplication treatment, thereby obtaining an isostere set of the group R.

Further, the deduplication process may be performed with reference to the environmental structure of the isostere pair. Specifically, if the isosteres extracted from the two peer-to-peer pair are the same, the environmental structure of the two peer-to-peer pair is extracted. If the environmental structures of the two isosteres are the same, one isostere is removed, and otherwise, the other isostere is not removed.

Subsequently, in step S240, a molecule to be optimized input by the user is received, and a parent structure of the molecule and a group to be replaced other than the parent structure are identified. The mother core structure can be obtained by artificial standard, or by deep learning of mother core structures of multiple molecules, a mother core structure identification model can be trained, and the mother core structure of the molecule to be optimized can be automatically identified through the model. Generally, the training set of the model comprises a plurality of molecules and labeled mother nucleus structures thereof, and the trained model can be obtained by learning and training the data. The structure and parameters of the model can be set by the person skilled in the art according to the needs, and the invention is not limited to this.

Subsequently, in step S250, a set of isosteres of the groups to be replaced is obtained, and each isostere in the set is used to replace the groups to be replaced, resulting in a plurality of candidate molecules.

In general, the groups to be replaced other than the parent core structure may be one or more, for example G ₁ ～G _m The m groups, m is less than or equal to 4, each group to be replaced has a corresponding isostere set, wherein the ith group to be replaced G _i The isostere number of (2) is N _i In this case, the isostere substitution of the groups to be substituted is realized by permutation and combination, so that the isostere substitution scheme of the molecule shares N in total ₁ ×N ₂ ……×N _m And each.

According to one embodiment, the step of replacing the group to be replaced with each isostere in the set comprises: and counting the total replacement times of the groups, generating a plurality of replacement tasks, and sending the plurality of replacement tasks to a plurality of servers for processing, wherein each replacement task comprises object structure data and array structure data. The object data comprises a molecular structure to be optimized, splicing sites and group information to be replaced, and the array structure data comprises isostere set information of each group to be replaced. The task of isostere replacement can be distributed according to groups, G is distributed based on a control variable method ₁ ～G _m The m groups are sequentially ordered, the corresponding isostere sets are also sequentially ordered, and a plurality of tasks are generated and distributed to the multi-core CPU according to a sequential arrangement and combination mode.

Here, the invention provides a suitableThe segment splicing method and the data structure are used for storing input molecules, segments, splicing sites and isostere segments, and the isostere replacement is carried out on the input molecules by adopting a parallel splicing method on a multi-core CPU. When isostere is generated, the splicing sites are marked, and the input molecular structure also marks the splicing sites, so that the chemical information software package can be used for directly splicing the atomic molecular structures. When a molecular structure is formed as G ₁ -A-G ₂ Is composed of three parts of structure, G ₁ And G ₂ All represent groups to be replaced, and the A is used as a parent nucleus structure of a molecule to be reserved, and G is searched in a database through accurate structure matching ₁ And G ₂ Corresponding isostere sets, the number of corresponding isosteres is N ₁ And N ₂ . Combining N ₁ And N ₂ Splicing the isosteres to the A parent nucleus to generate N ₁ ×N ₂ New molecules. If a certain group to be replaced does not match an isostere, the group is not replaced.

When the number of the groups to be replaced is too large, a large number of new molecules are generated, so that the memory load is too large, and in order to improve the replacement efficiency and save the memory, the group replacement tasks are distributed to the multi-core CPU in batches for calculation. The data used by each task is stored in the form of objects and arrays. For example, G ₁ -A-G ₂ When the molecule is replaced, a plurality of tasks can be generated in a load balancing mode, and G in each task ₁ -A-G ₂ The molecule is stored in an object structure and comprises splicing sites and information of groups to be replaced; g ₁ And G ₂ And storing the isosceles structure corresponding to the group in an array, sequentially distributing the tasks to the multi-core CPU, and summarizing the results after the tasks are completed. According to one embodiment, a load balancing approach may be generated

Tasks, wherein->

Representing an upward rounding.

Subsequently, in step S260, the drug class (QED, quantitative Estimate of Druglikeness) of each candidate molecule is calculated, and one or more candidate molecules with high drug class are selected and recommended to the user.

Quantitative evaluation of drug-like properties (QED) represents a quantitative estimate of drug similarity. The empirical rules of the QED evaluation method reflect the basic distribution of molecular properties including molecular weight, oil-water partition coefficient, topological polar surface area, number of hydrogen bond donors and acceptors, number of aromatic rings and rotatable bonds, and the presence of warning structures. The warning structure is a molecule, functional group or molecular substructure related to specific adverse effects, such as skin irritation or corrosion caused by administration of a drug containing a specific molecular group, which is a warning structure requiring special attention during the drug development stage.

According to one embodiment, the calculation formula of the drug class QED is:

wherein n represents the total number of molecular attributes, d _i Representing the expected function of the ith molecular property, the expected function representing the contribution of the molecular property to the overall drug class. The QED has a value range of 0 to 1, wherein 0 represents that all physicochemical properties are unfavorable for patent medicine, and 1 represents that all physicochemical properties are favorable for patent medicine.

Through the steps, the main user inputs the molecular structure to be optimized and marks the fragment to be replaced, and the recommended isostere molecules with high drug properties and the corresponding physicochemical properties can be obtained.

Optionally, after step S260, the method 200 may further comprise the steps of: candidate molecules recommended to the user each time are stored in a library of molecules for subsequent screening of candidate drugs from the library of molecules in a drug screening.

Fig. 4 shows a block diagram of a potentially active molecule prediction apparatus 400, which may reside in a computing device, such as computing device 100, according to one embodiment of the invention. As shown in fig. 4, the apparatus 400 includes a data collection module 410, an isostere pair identification module 420, an isostere library generation module 430, a molecule reception module 440, a molecule replacement module 450, and a molecule recommendation module 460.

The data collection module 410 collects structural data of a plurality of small molecule compounds having biological activity. According to one embodiment, the data collection module 410 may also remove erroneous structural data, as well as inorganic small molecule and inorganic salt structures contained in the structural data, from the collected data and convert the processed structural data into a reduced-molecule linear input specification representation. The data collection module 410 may perform a process corresponding to the process described above in step S210, and a detailed description thereof will not be repeated here.

The isostere pair recognition module 420 performs pairwise comparisons of small molecule compounds, recognizing the common substructure and the remaining variable substructure of every two small molecule chemicals, where the two variable substructures constitute a pair of isosteres. Specifically, the isostere pair identification module 420 identifies the largest common substructure of the two small molecule compounds and extracts one or more common substructures within the largest common substructure; for any common substructure, two variable substructures of the two small molecule compounds are extracted except for the common substructure to form a pair of isostere pairs based on the common substructure.

According to one embodiment, the isostere pair identification module 420 may also determine a break between the common substructure and the variable substructure and extend a predetermined number of chemical bonds from the end of the break to the common substructure to extract the environmental structures of the two variable substructures. In turn, the isostere pair identification module 420 may convert the identified isostere pair and environmental structure into a reduced molecular linear canonical representation in which splice sites for each isostere are labeled. The isostere pair identification module 420 may perform a process corresponding to the process described above in step S220, and a detailed description thereof will not be repeated here.

The isostere library generation module 430 performs deduplication processing on all generated isostere pairs to obtain an isostere set for each group. Specifically, for any group, the isostere library generation module 430 extracts the isostere pair having the group, and the isostere of the group in the pair, and performs a deduplication process on all the extracted isosteres of the group. Wherein, if the isosteres extracted from the two peer-to-peer pair are the same, the isostere library generation module 430 further extracts the environmental structure of the two peer-to-peer pair. If the environmental structures of the two isosteres are the same, one isostere is removed, and otherwise, the other isostere is not removed. The isosceles body library generating module 430 may perform the process corresponding to the process described above in step S230, and a detailed description will not be repeated here.

The molecule receiving module 440 receives a molecule to be optimized inputted by a user, and recognizes a parent structure of the molecule and a group to be replaced other than the parent structure. In general, the groups to be replaced, other than the parent nucleus structure, have G ₁ ～G _m M is less than or equal to 4, and each group to be replaced has a corresponding isostere set, wherein the ith group to be replaced G _i The isostere number of (2) is N _i At this time, the isostere substitution of the groups to be substituted is realized in a permutation and combination mode. The molecular receiving module 440 may perform a process corresponding to the process described above in step S240, and a detailed description thereof will not be repeated here.

The molecule replacing module 450 obtains an isostere set of the groups to be replaced, and replaces the groups to be replaced with each isostere in the set respectively, so as to obtain a plurality of candidate molecules. Specifically, the molecular replacement module 450 counts the total number of times of replacement of the plurality of groups, generates a plurality of replacement tasks, and sends the plurality of replacement tasks to a plurality of servers for processing, wherein each replacement task comprises object structure data and array structure data. The object data comprises a molecular structure to be optimized, splicing sites and group information to be replaced, and the array structure data comprises isostere set information of each group to be replaced. The molecular replacement module 450 may perform a process corresponding to the process described above in step S250, and a detailed description will not be repeated here.

The molecular recommendation module 460 calculates the drug class of each candidate molecule and selects one or more candidate molecules with high drug class for recommendation to the user. Optionally, the molecular recommendation module 460 may also store candidate molecules recommended to the user each time in a library of molecules for subsequent screening of candidate drugs from the library of molecules in drug screening. The molecular recommendation module 460 may perform a process corresponding to the process described above in step S260, and a detailed description will not be repeated here.

According to the technical scheme, the tool with high calculation speed and high structural diversity of the isosceles block generation is provided, and the calculation mode of the multi-core CPU is simple, convenient and easy to expand; the isostere fragment library is huge and reaches millions of orders, so that the generated isostere replacement molecular structure is very various; the drug-like evaluation process can recommend candidate products with reasonable properties from a large number of isostere replacement molecules. The invention can extremely conveniently improve the efficiency of drug researchers for optimizing the lead, is beneficial to the discovery of new drug structures and promotes the drug research and development process.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions of the methods and apparatus of the present invention, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U-drives, floppy diskettes, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the prediction method of potentially active molecules of the present invention according to instructions in said program code stored in the memory.

The method of A8, A7, wherein the step of performing a deduplication process on all isosteres of the extracted group comprises: extracting the environmental structure of the two peer-to-peer pair if the isosteres extracted from the two peer-to-peer pair are the same; if the environmental structures of the two isosteres are the same, one isostere is removed, and otherwise, the other isostere is not removed. A9, the process according to any one of A1 to A8, wherein the groups to be replaced other than the parent nucleus structure have G ₁ ～G _m M is less than or equal to 4, and each group to be replaced has a corresponding isostere set, wherein the ith group to be replaced G _i The isostere number of (2) is N _i At this time, the isostere substitution of the groups to be substituted is realized in a permutation and combination mode.

A10, the method of A9, wherein the step of replacing the group to be replaced with each isostere in the set comprises: counting the total replacement times of the groups, generating a plurality of replacement tasks, and sending the replacement tasks to a plurality of servers for processing, wherein each replacement task comprises object structure data and array structure data; the object data comprises a molecular structure to be optimized, splicing sites and group information to be replaced, and the array structure data comprises isostere set information of each group to be replaced. A11, the method of any one of A1-A10, wherein the calculation formula of the drug class QED is as follows:

A12, the method of any of A1-11, further comprising the step of: candidate molecules recommended to the user each time are stored in a library of molecules for subsequent screening of candidate drugs from the library of molecules in a drug screening. The method of any one of A1-a12, wherein after collecting structural data of a plurality of small molecule compounds having biological activity, further comprising the steps of: and eliminating the wrong structural data and inorganic micromolecule and inorganic salt structures contained in the structural data, and converting the processed structural data into simplified molecular linear input specification representation.

By way of example, and not limitation, readable media comprise readable storage media and communication media. The readable storage medium stores information such as computer readable instructions, data structures, program modules, or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with examples of the invention. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into a plurality of sub-modules.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the embodiments are described herein as methods or combinations of method elements that may be implemented by a processor of a computer system or by other means of performing the functions. Thus, a processor with the necessary instructions for implementing the described method or method element forms a means for implementing the method or method element. Furthermore, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is for carrying out the functions performed by the elements for carrying out the objects of the invention.

As used herein, unless otherwise specified the use of the ordinal terms "first," "second," "third," etc., to describe a general object merely denote different instances of like objects, and are not intended to imply that the objects so described must have a given order, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of the above description, will appreciate that other embodiments are contemplated within the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is defined by the appended claims.

Claims

1. A method of predicting potentially active molecules, adapted to be executed in a computing device, comprising the steps of:

collecting structural data of a plurality of small molecule compounds with biological activity;

comparing the small molecular compounds pairwise, and identifying a common substructure and the remaining variable substructure of each two small molecular chemicals, wherein the two variable substructures form a peer pair;

performing de-duplication treatment on all generated isostere pairs to obtain an isostere set of each group;

receiving a molecule to be optimized input by a user, and identifying a parent nucleus structure of the molecule and groups to be replaced except the parent nucleus structure;

acquiring an isostere set of the groups to be replaced, and replacing the groups to be replaced by adopting each isostere in the set to obtain a plurality of candidate molecules; and

and calculating the drug class of each candidate molecule, and selecting one or more candidate molecules with high drug class from the drug classes to recommend to a user.

2. The method of claim 1, wherein for any two small molecule compounds, the step of identifying the common substructure and the remaining variable substructure of each two small molecule chemicals comprises:

identifying a largest common substructure of the two small molecule compounds and extracting one or more common substructures within the largest common substructure;

For any common substructure, extracting two variable substructures of the two small molecular compounds except the common substructure to form a pair of isostere pairs based on the common substructure;

the broken bond of the common substructure and the variable substructure is the splicing site of the corresponding isostere.

3. The method of claim 1 or 2, wherein the step of identifying the common substructure and the remaining variable substructure of each two small molecule chemicals further comprises:

determining a broken bond between a common substructure and a variable substructure, and extending a predetermined number of chemical bonds from the end of the broken bond to the common substructure to extract an environmental structure of the two variable substructures.

4. The method of claim 1 or 2, the step of identifying a common substructure and remaining variable substructures for each two small molecule chemicals further comprising:

the identified isostere pairs and environmental structures are converted to a reduced molecular linear canonical representation, wherein each isostere splice site is labeled in the reduced molecular linear canonical representation of the isostere pair.

5. The method of claim 2, wherein the extracting one or more common substructures within the maximum common substructures comprises:

Taking the maximum common substructure as a common substructure; or alternatively

Extracting one or more chemical bonds in the maximum public substructure to generate a plurality of fracture modes;

for each cleavage mode, the fragment sets of the two small molecular compounds after cleavage are extracted respectively, and the union of all the same fragments in the two fragment sets is taken as a common substructure of the two small molecular compounds.

6. The method of claim 5, wherein the broken chemical bond is an acyclic single bond.

7. The method of claim 1 or 2, wherein the step of deduplicating all generated isostere pairs comprises:

for any group, an isostere pair having the group, and an isostere of the group in the isostere pairs are extracted, and all isosteres of the extracted group are subjected to a deduplication treatment.

8. The method of claim 7, wherein the step of deduplicating all isosteres of the extracted group comprises:

extracting the environmental structure of the two peer-to-peer pair if the isosteres extracted from the two peer-to-peer pair are the same;

if the environmental structures of the two isosteres are the same, one isostere is removed, and otherwise, the other isostere is not removed.

9. The method of claim or 2, wherein,

the groups to be replaced other than the parent nucleus structure being G ₁ ～G _m M is less than or equal to 4, and each group to be replaced has a corresponding isostere set, wherein the ith group to be replaced G _i The isostere number of (2) is N _i At this time, the isostere substitution of the groups to be substituted is realized in a permutation and combination mode.

10. The method of claim 9, wherein the step of replacing the group to be replaced with each isostere in the set comprises:

counting the total replacement times of the groups, generating a plurality of replacement tasks, and sending the replacement tasks to a plurality of servers for processing, wherein each replacement task comprises object structure data and array structure data;

the object structure data comprises a molecular structure to be optimized, splicing sites and group information to be replaced, and the array structure data comprises isostere set information of each group to be replaced.

11. The method according to claim 1 or 2, wherein the calculation formula of the drug class QED is:

12. The method according to claim 1 or 2, further comprising the step of:

candidate molecules recommended to the user each time are stored in a library of molecules for subsequent screening of candidate drugs from the library of molecules in a drug screening.

13. The method according to claim 1 or 2, wherein after collecting structural data of a plurality of small molecule compounds having biological activity, further comprising the steps of:

and eliminating the wrong structural data and inorganic micromolecule and inorganic salt structures contained in the structural data, and converting the processed structural data into simplified molecular linear input specification representation.

14. A predictive device for potentially active molecules, adapted to reside in a computing device, comprising:

the data collection module is suitable for collecting structural data of a plurality of small molecular compounds with biological activity;

an isostere pair identification module adapted to pair small molecule compounds in pairs, identify a common substructure and a remaining variable substructure of each two small molecule chemicals, wherein the two variable substructures form a pair of isostere pairs;

the isostere library generation module is suitable for carrying out de-duplication treatment on all generated isostere pairs to obtain an isostere set of each group;

The molecule receiving module is suitable for receiving a molecule to be optimized input by a user, and identifying a parent nucleus structure of the molecule and groups to be replaced except the parent nucleus structure;

the molecule replacement module is suitable for acquiring an isostere set of the groups to be replaced, and replacing the groups to be replaced by adopting each isostere in the set respectively to obtain a plurality of candidate molecules; and

the molecule recommending module is suitable for calculating the drug class of each candidate molecule and selecting one or more candidate molecules with high drug class from the drug class to recommend to a user.

15. A computing device, comprising:

a memory;

one or more processors;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-13.

16. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-13.