CN111354424A

CN111354424A - Prediction method and device of potential active molecules and computing equipment

Info

Publication number: CN111354424A
Application number: CN202010124320.3A
Authority: CN
Inventors: 杜杰文; 赖力鹏; 温书豪; 马健
Original assignee: Beijing Jingpai Technology Co ltd
Current assignee: Beijing Jingpai Technology Co ltd
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-06-30
Anticipated expiration: 2040-02-27
Also published as: CN111354424B

Abstract

The invention discloses a prediction method of potential active molecules, which is executed in computing equipment and comprises the following steps: collecting structural data of a plurality of small molecule compounds with biological activity; pairwise comparison is carried out on the small molecule compounds, and the common substructure and the remaining variable substructure of every two small molecule chemicals are identified, wherein the two variable substructures form a pair of isosteres; removing the weight of all the generated isostere pairs to obtain an isostere set of each group; receiving a molecule to be optimized input by a user, and identifying a parent nucleus structure of the molecule and a group to be replaced except the parent nucleus structure; acquiring an isostere set of a group to be replaced, and replacing the group to be replaced by each isostere in the set to obtain a plurality of candidate molecules; and calculating the drug-like property of each candidate molecule, and selecting one or more candidate molecules with high drug-like property from the candidate molecules to recommend to the user. The invention also discloses a corresponding prediction device and a computing device of the potential active molecules.

Description

Prediction method and device of potential active molecules and computing equipment

Technical Field

The invention relates to the technical field of computers, in particular to a method, a device and a computing device for predicting potential active molecules.

Background

As is well known, drug development is a long process, and the dilemma that the development period is long, the development achievement rate is low, and the development cost is high exists. With the updating of computer technology and the development of big data technology, artificial intelligence is playing a great application value in various industries, and is also receiving wide attention in the pharmaceutical industry. In the process of new drug discovery, the virtual screening can improve the enrichment of active molecules, and by predicting the performance of the compound, a large amount of manpower and material resources can be saved, the drug development period can be shortened, and the conversion of research results can be accelerated, so that the new drug discovery method has attracted great attention of scientific research institutions and pharmaceutical companies in recent years.

In computer drug screening, huge and diverse active molecule libraries are needed, but the construction process of the current active molecule libraries is still complex, and the molecular diversity is still to be improved. Moreover, researchers often need systems to automatically provide multiple candidate products with reasonable properties, so as to improve the efficiency of optimizing the lead, discover new drug structures and promote the drug development process.

Disclosure of Invention

In view of the above, the present invention proposes a method, apparatus and computing device for predicting potentially reactive molecules in an attempt to solve, or at least solve, the above existing problems.

According to an aspect of the present invention, there is provided a method of predicting a potentially reactive molecule, adapted to be executed in a computing device, comprising the steps of: collecting structural data of a plurality of small molecule compounds with biological activity; pairwise comparison is carried out on the small molecule compounds, and the common substructure and the remaining variable substructure of every two small molecule chemicals are identified, wherein the two variable substructures form a pair of isosteres; carrying out duplicate removal treatment on all generated isostere pairs to obtain an isostere set of each group; receiving a molecule to be optimized input by a user, and identifying a parent nucleus structure of the molecule and a group to be replaced except the parent nucleus structure; acquiring an isostere set of a group to be replaced, and replacing the group to be replaced by each isostere in the set to obtain a plurality of candidate molecules; and calculating the drug-like property of each candidate molecule, and selecting one or more candidate molecules with high drug-like property from the candidate molecules to recommend to the user.

Alternatively, in the method for predicting a potentially active molecule according to the present invention, the step of identifying, for any two small molecule compounds, a common substructure and remaining variable substructures of every two small molecule chemicals comprises: identifying the largest common substructure of the two small molecule compounds, and extracting one or more common substructures from the largest common substructure; for any common substructure, extracting two variable substructures except the common substructure from the two small molecule compounds to form a pair of isostere pairs based on the common substructure; wherein, the broken bond positions of the common substructure and the variable substructure are splicing sites corresponding to the isostere.

Optionally, in the method for predicting a potentially active molecule according to the present invention, the step of identifying the common substructure and the remaining variable substructure of each two small molecule chemicals further comprises: determining a broken bond between the common substructure and the variable substructure and extending a predetermined number of chemical bonds from the end of the broken bond towards the common substructure to extract the environmental structure of the two variable substructures.

Optionally, in the method for predicting a potentially active molecule according to the present invention, the step of identifying the common substructure and the remaining variable substructure of each two small molecule chemicals further comprises: the identified pair of isosteres and the environmental structure are converted to a simplified molecular linear canonical representation, wherein the simplified molecular linear canonical representation of the pair of isosteres is labeled with the splice site of each isostere.

Optionally, in the method for predicting a potentially reactive molecule according to the present invention, the step of extracting one or more common substructures within the largest common substructure comprises: taking the maximum common substructure as a common substructure; or extracting one or more chemical bonds within the largest common substructure to generate a plurality of fragmentation patterns; for each fragmentation mode, extracting fragment sets of the two small molecule compounds after fragmentation respectively, and taking a union set of all the same fragments in the two fragment sets as a common substructure of the two small molecule compounds.

Alternatively, in the method for predicting a potentially reactive molecule according to the present invention, the broken chemical bond is an acyclic single bond.

Alternatively, in the method for predicting a potentially reactive molecule according to the present invention, the step of performing a deduplication process on all generated isostere pairs comprises: for any group, the pair of isosteres having the group and the isostere of the group in the pair of isosteres are extracted, and all the isosteres of the extracted group are subjected to deduplication processing.

Alternatively, in the method for predicting a potentially reactive molecule according to the present invention, the step of performing de-duplication processing on all the extracted isosteres of the group comprises: if the isosteres extracted from the two isostere pairs are the same, extracting the environmental structures of the two isostere pairs; if the two isosteres have the same environmental structure, one of the isosteres is removed, otherwise, the isostere is not removed.

Alternatively, in the method for predicting a potentially active molecule according to the present invention, the group to be replaced other than the parent nucleus structure has G₁～G_mM is less than or equal to 4, each group to be replaced has a corresponding isostere set, wherein the ith group to be replaced G_iHas the number of isosteres of N_iAnd at the moment, the isosteric substitution of the groups to be substituted is realized by adopting a permutation and combination mode.

Alternatively, in the method for predicting a potentially reactive molecule according to the present invention, the step of replacing the group to be replaced with each isostere of the set comprises: counting the total replacement times of the groups, generating a plurality of replacement tasks and sending the replacement tasks to a plurality of servers for processing, wherein each replacement task comprises object structure data and array structure data; the object data comprise the molecular structure to be optimized, splicing sites and group information to be replaced, and the array structure data comprise the isostere set information of each group to be replaced.

Alternatively, in the method for predicting a potentially active molecule according to the present invention, the formula for calculating the drug-like property QED is:

wherein n represents the total number of molecular attributes, d_iAnd the expectation function represents the contribution degree of the molecular attribute to the overall drug-like property.

Optionally, in the method for predicting a potentially active molecule according to the present invention, further comprising the steps of: and storing the candidate molecules recommended to the user each time in a molecule library so as to screen candidate medicines from the molecule library in medicine screening later.

Optionally, in the method for predicting a potentially active molecule according to the present invention, after collecting structural data of a plurality of small molecule compounds having biological activity, the method further comprises the steps of: and removing wrong structure data and inorganic micromolecule and inorganic salt structures contained in the structure data, and converting the processed structure data into simplified molecule linear input standard representation.

According to another aspect of the present invention, there is provided a method of predicting a potentially reactive molecule, adapted to reside in a computing device, comprising: a data collection module adapted to collect structural data of a plurality of small molecule compounds having biological activity; the isostere pair recognition module is suitable for pairwise comparison of the small molecule compounds and recognizing the common substructure and the remaining variable substructure of each two small molecule chemicals, wherein the two variable substructures form an isostere pair; the isovolumetric library generation module is suitable for carrying out duplication removal treatment on all generated isovolumetric pairs to obtain an isovolumetric set of each group; the molecule receiving module is suitable for receiving a molecule to be optimized input by a user and identifying a parent nucleus structure of the molecule and a group to be replaced except the parent nucleus structure; the molecule replacement module is suitable for acquiring an isostere set of groups to be replaced and replacing the groups to be replaced by each isostere in the set respectively to obtain a plurality of candidate molecules; and the molecule recommending module is suitable for calculating the drug-like property of each candidate molecule, and selecting one or more candidate molecules with high drug-like property from the candidate molecules to recommend to the user.

According to yet another aspect of the present invention, there is provided a computing device comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs when executed by the processors implement the steps of the method for predicting a potentially reactive molecule as described above.

According to a further aspect of the invention, there is provided a readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, carry out the steps of the method of predicting a potentially reactive molecule as described above.

According to the technical scheme provided by the invention, the method has the advantages that the isostere library is large, the calculation speed is high, the diversity of the generated candidate isosteres is high, and the generated candidate isosteres are automatically scored, so that potential active molecules with excellent physicochemical properties are recommended to a user. The invention can greatly improve the efficiency of drug researchers in optimizing the leads, is beneficial to discovering new drug structures and promoting the drug research and development process, and can be applied to the fields of framework transition and the like to generate new candidate drug molecules.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a block diagram of a computing device 100, according to one embodiment of the invention;

FIG. 2 shows a flow diagram of a method 200 for prediction of potentially reactive molecules according to one embodiment of the invention;

FIG. 3 shows a schematic diagram of a peer pair generation method according to one embodiment of the invention; and

fig. 4 shows a block diagram of a prediction apparatus 400 of potentially reactive molecules according to one embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 is a block diagram of a computing device 100 according to one embodiment of the invention. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some embodiments, application 122 may be arranged to operate with program data 124 on an operating system. The program data 124 comprises instructions, and in the computing device 100 according to the invention the program data 124 comprises instructions for performing the prediction method 200 of the potentially reactive molecule.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 100 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., or as part of a small-form factor portable (or mobile) electronic device, such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless WEB-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations. In some embodiments, the computing device 100 is configured to perform a method 200 of predicting potentially reactive molecules.

Fig. 2 shows a flow diagram of a method 200 for prediction of potentially reactive molecules according to one embodiment of the invention. The method 200 is executed in a computing device, such as the computing device 100, to recommend potentially active molecules to a user. As shown in fig. 2, the method begins at step S210.

In step S210, structural data of a plurality of small molecule compounds having biological activity is collected. Wherein, the structural data of the active molecule can be obtained from any currently published database, which is not limited by the present invention. Typically, about 200 million publicly reported data on the structure of small molecules with biological activity can be collected.

According to one embodiment, after collecting structural data of a plurality of small molecule compounds having biological activity, further comprising the steps of: the structure data with errors and the structures of inorganic small molecules and inorganic salts contained in the structure data are removed, and the processed structure data are all converted into Simplified molecule linear input canonical representation (SMILES). As shown in the table below, the error result generally refers to the structure that the computing device cannot recognize or convert to SMILES, the inorganic molecules contained in the data structure, such as HCl, HBr, etc., and the inorganic salts contained therein, such as sodium salts, potassium salts, etc. Of course, the following table is only a few exemplary data structures, and may actually contain a wide variety of inorganic molecules or inorganic salts, and the present invention is not limited thereto.

Subsequently, in step S220, pairwise comparisons are performed on the small molecule compounds, and common substructures and remaining variable substructures of each two small molecule chemicals are identified, wherein the two variable substructures constitute an isostere pair.

Here, a molecule pair matching method is used to identify all possible generalized isostere pairs in the molecule library and construct the isostere library. When there is only a small difference in substructures between a pair of molecules and most of the substructures are the same, the same substructures are called common substructures (constants) and the different substructures are called variable substructures (variables). These diverged variables substructures can be considered as broad isosteres. In a strict definition, an isostere pair may include not only the structure of the molecular pair, but also a transition form of the variable substructure between them.

According to one embodiment, the step of identifying common substructures and remaining variable substructures in the step may comprise: identifying the largest common substructure of the two small molecule compounds and extracting one or more common substructures within the largest common substructure. For any common substructure, two variable substructures of the two small molecule compounds except the common substructure are extracted to form a pair of isostere pairs based on the common substructure. Wherein, the broken bond positions of the common substructure and the variable substructure are splicing sites corresponding to the isostere.

In particular, when one or more common substructures are extracted within the largest common substructure, the largest common substructure may be directly taken as one common substructure in one implementation. In yet another implementation, one or more chemical bonds within the largest common substructure may be extracted, generating multiple fragmentation patterns. And for each fragmentation mode, respectively extracting fragment sets of the two small molecule compounds after fragmentation, and taking a union of all the same fragments in the two fragment sets as a common substructure of the two small molecule compounds. Preferably, the broken chemical bond is an acyclic single bond.

For example, compound a and compound B in fig. 3 are a pair of molecules in a library of molecules, the main difference between the two molecules being that the atoms to which the nitrogen atoms are attached at the ends are different, compound a ending with a hydrogen atom and compound B ending with a carbon atom. The constant in the upper right box of fig. 3 is the largest common substructure of compound a and compound B, i.e. a common substructure in the molecular pair matching algorithm, which is generated by breaking the N-C bond or the N-H bond. In addition to the largest common substructure, the remaining two substructures can be considered variable substructures, namely variable A and variable B, forming a pair of isosteres. The cleavage sites for bond cleavage are actually the splicing sites of the alignment.

Of course, within the largest common substructure, two compounds may also have multiple common substructure representations, which may be achieved by cleaving one or more bonds within the largest common substructure. Typically, the cleavage is a single acyclic bond, and thus the bond within the benzene ring or the carbonyl bond is not cleaved in FIG. 3. The breakable bond comprises a bond connecting the benzene ring and the carbonyl carbon and a plurality of bonds of N-H bond and N-C bond, one or a plurality of bonds can be selected from the plurality of bonds, a plurality of breaking modes can be obtained by adopting a permutation and combination mode, and each breaking mode can obtain the corresponding breaking modeA set of fragments. For fragmentation pattern 1, assume that the fragment set of Compound A has { I }₁，I₂，I₃}, fragment set of Compound A has { I₁，I₂，I₄}, then I₁And I₂The two fragments together form a common substructure for the two compounds, while fragment I₃And I₄A pair of peer-to-peer pairs is formed as two variable substructures.

Taking the lower right box in fig. 3 as an example, two fragment structures are obtained by simultaneously breaking the bond between the benzene ring and the carbonyl carbon, and the N — C bond, and the two fragment structures together form a common substructure (constant), and the structures other than the two fragments in the two compounds are used as variable substructures, namely variable a and variable B, to form an isostere pair. It is also understood that the cleavage of the bond between the benzene ring and the carbonyl carbon alone, in turn, produces a common substructure and corresponding isostere pair. Thus, for a pair of compounds, k common substructures and corresponding k isostere pairs can be obtained by k (k.gtoreq.1) cleavage modes.

In addition, the molecular pair matching algorithm can also represent the conversion form between one pair of isosteres, such as the arrow representation form in the upper left box in FIG. 3, which represents the conversion form of the molecular structure from variable A to variable B. The conversion can also be expressed by the SMILES expression of two variable substructures, such as [ H ] [. sup.1 ] > C [. sup.1 ] as shown in FIG. 3.

Furthermore, in step S220, a broken bond (i.e., a cleavage site) between the common substructure and the variable substructure may also be determined, and a predetermined number of chemical bonds may be extended from the end of the broken bond to the common substructure to extract the environmental structure of the two variable substructures. Where the predetermined number may be 2, the corresponding environmental structure is as Environment at radius 2 in fig. 3, i.e. the resulting structure after extending two chemical bonds from the cleavage site of the common substructure towards the common substructure. In the upper right box of FIG. 3, the common substructure cuts both N-H and N-C bonds, thus extending two chemical bonds from the end of the broken bond towards the nitrogen atom, resulting in the environmental structure of the cleavage site. The environment structure can represent the chemical bond structure information of the isostere, and has certain reference significance for judging and identifying the same or repeated isosteres.

It will be appreciated that for a common substructure made up of a plurality of fragments, a sub-environment structure may be obtained from the cleavage site of each fragment, and that a fragment itself may be considered a sub-environment structure when it is too short to extend a corresponding number of chemical bonds. All sub-ambient structures within a common sub-structure together form an ambient structure of a peer pair.

According to an embodiment of the invention, step S220 may also convert the identified pair of isosteres and the environmental structure into a simplified molecular linear canonical representation, wherein the simplified molecular linear canonical representation of the pair of isosteres is labeled with the splice site of each isostere. In general, the SMILES representation of a pair of molecules is known, i.e., the SMILES representation of all isostere pairs within the molecule, the SMILES representation of the environment variable of the isostere pair, and the SMILES representation of the transition form of the isostere pair can be output, wherein the SMILES representation of the isostere pair contains the splice site of the isostere (i.e., the cleavage site that generates the isostere).

Subsequently, in step S230, all the generated isostere pairs are subjected to deduplication processing to obtain an isostere set for each group.

As described above, extracting the different substructures in each pair of molecules can generate a plurality of isosteres, comparing the collected molecules pairwise can generate a large number of isostere pairs, and the isostere pairs need to be subjected to de-duplication treatment to generate an isostere library.

Specifically, for any group, an isostere pair having the group and an isostere of the group in the isostere pair are extracted, and all the isosteres of the extracted group are subjected to a deduplication process. For the group R, it exists in a plurality of isostere pairs, and therefore, the isosteres in the plurality of isostere pairs are extracted and subjected to a deduplication treatment to obtain an isostere set of the group R.

Further, the deduplication process may be performed with reference to the environment structure of the peer pair. Specifically, if the isosteres extracted from two isostere pairs are the same, the environmental structure of the two isostere pairs is extracted. If the two isosteres have the same environmental structure, one of the isosteres is removed, otherwise, the isostere is not removed.

Subsequently, in step S240, a molecule to be optimized input by a user is received, and a parent nucleus structure of the molecule and a group to be replaced other than the parent nucleus structure are identified. The mother nucleus structure can be artificially standardized, or the mother nucleus structures of a plurality of molecules can be deeply learned to train a mother nucleus structure recognition model, and the mother nucleus structure of the molecule to be optimized is automatically recognized through the model. Generally, the training set of the model includes a plurality of molecules and labeled parent nucleus structures thereof, and the trained model can be obtained by learning and training the data. The structure and parameters of the model can be set by those skilled in the art according to the needs, and the invention is not limited thereto.

Subsequently, in step S250, an isostere set of the to-be-replaced groups is obtained, and each isostere in the set is used to replace the to-be-replaced group, so as to obtain a plurality of candidate molecules.

In general, the radicals to be replaced, other than the parent nucleus, may have one or more, for example G₁～G_mM is less than or equal to 4, each group to be replaced has a corresponding isostere set, wherein the ith group to be replaced G_iHas the number of isosteres of N_iWhen the isosteric substitution of the multiple groups to be substituted is realized by permutation and combination, the isosteric substitution scheme of the molecule has N in total₁×N₂……×N_mAnd (4) respectively.

According to one embodiment, the step of replacing the group to be replaced with each isostere of the set comprises: and counting the total replacement times of the groups, generating a plurality of replacement tasks and sending the replacement tasks to a plurality of servers for processing, wherein each replacement task comprises object structure data and array structure data. The object data comprise the molecular structure to be optimized, splicing sites and group information to be replaced, and the array structure data comprise the isostere set information of each group to be replaced. Task of peer replacementCan be distributed according to groups, based on a controlled variable method, G₁～G_mThe m groups are sequentially ordered, the corresponding isostere set is also sequentially ordered, and a plurality of tasks are generated and distributed to the multi-core CPU according to the sequential arrangement and combination mode.

Here, the present invention provides a suitable segment splicing method and data structure to store input molecules, segments, splice sites and isostere segments, and adopts a parallel splicing method on a multi-core CPU to perform isostere replacement on the input molecules. The splicing sites are marked when the isostere is generated, and the input molecular structure is also marked with the splicing sites, so that the molecular structure splicing of atoms can be directly carried out by using a chemical information software package. When a molecular structure, e.g. G₁-A-G₂Consisting of a three-part structure, G₁And G₂All represent the group to be replaced, A needs to be preserved as the parent nucleus structure of the molecule, and G is respectively retrieved in a database through precise structure matching₁And G₂Corresponding peer sets are obtained, and the number of the corresponding peers is N respectively₁And N₂. Combining N₁And N₂The individual isosteres are spliced to the A parent nucleus to generate N₁×N₂A new molecule. If a certain group to be replaced does not match an isostere, the group is not replaced.

In the combined mode, when the replaced groups are excessive, a large number of new molecules are generated, so that the memory load is excessive, and in order to improve the replacement efficiency and save the memory, the group replacement task is distributed to the multi-core CPU in batches for calculation. The data used by each task is stored in the form of objects and arrays. Such as for G₁-A-G₂When the molecule is replaced, a plurality of tasks can be generated in a load balancing mode, and G in each task₁-A-G₂The molecule is stored in an object structure and contains information of splicing sites and groups to be replaced; g₁And G₂And storing the peer structures corresponding to the groups in an array, distributing the tasks to the multi-core CPU in sequence, and summarizing the results after the tasks are completed. According to one embodiment, load balancing may be generated

A task, wherein

Representing a rounding up.

Subsequently, in step S260, a drug-like property (QED) of each candidate molecule is calculated, and one or more candidate molecules with high drug-like property are selected from the QED and recommended to the user.

Quantitative assessment (QED) of drug-like properties represents a quantitative estimate of drug similarity. Empirical rules of the QED evaluation method reflect the basic distribution of molecular properties including molecular weight, oil-water partition coefficient, topological polar surface area, number of hydrogen bond donors and acceptors, number of aromatic and rotatable bonds, and presence of warning structures. Among these, the warning structure is a kind of molecule associated with a specific adverse effect, and a functional group or a molecular substructure, such as a drug containing a specific molecular group, may cause skin irritation or corrosion after being taken, and the specific group is a warning structure and needs special attention in the drug development stage.

According to one embodiment, the formula for the calculation of the property-liked QED is:

wherein n represents the total number of molecular attributes, d_iAnd the expectation function represents the contribution degree of the molecular attribute to the overall drug-like property. The QED has a value range of 0 to 1, wherein 0 represents that all physicochemical properties are not favorable for patent medicine, and 1 represents that all physicochemical properties are favorable for patent medicine.

Through the steps, the main user inputs the molecular structure to be optimized and marks the segment to be replaced, and then the recommended isostere molecules with high drug-like property and the corresponding physicochemical attributes can be obtained.

Optionally, after step S260, the method 200 may further include the steps of: and storing the candidate molecules recommended to the user each time in a molecule library so as to screen candidate medicines from the molecule library in medicine screening later.

Fig. 4 shows a block diagram of a potentially reactive molecule prediction apparatus 400, which may reside in a computing device, such as computing device 100, according to one embodiment of the invention. As shown in FIG. 4, the apparatus 400 includes a data collection module 410, a peer pair identification module 420, a peer library generation module 430, a molecule reception module 440, a molecule replacement module 450, and a molecule recommendation module 460.

The data collection module 410 collects structural data for a plurality of small molecule compounds having biological activity. According to one embodiment, the data collection module 410 may further remove erroneous structural data and inorganic small molecule and inorganic salt structures contained in the structural data from the collected data, and convert all the processed structural data into a simplified molecular linear input specification representation. The data collection module 410 may perform processing corresponding to the processing described above in step S210, and the detailed description thereof will not be repeated.

The isostere pair recognition module 420 compares the small molecule compounds pairwise to recognize the common substructure and the remaining variable substructure of each two small molecule chemicals, wherein the two variable substructures form an isostere pair. Specifically, the isostere pair recognition module 420 recognizes the largest common substructure of the two small molecule compounds and extracts one or more common substructures within the largest common substructure; for any common substructure, two variable substructures of the two small molecule compounds except the common substructure are extracted to form a pair of isostere pairs based on the common substructure.

According to one embodiment, the isostere pair identification module 420 may also determine a broken bond between the common substructure and the variable substructure and extend a predetermined number of chemical bonds from the end of the broken bond to the common substructure to extract the environmental structure of the two variable substructures. Further, the isostere pair identification module 420 may convert the identified isostere pairs and the environmental structure into a simplified molecular linear canonical representation in which the splice site of each isostere is labeled. The peer pair identification module 420 may perform processing corresponding to the processing described above in step S220, and the detailed description thereof is omitted.

The isostere bank generation module 430 performs de-duplication processing on all generated isostere pairs to obtain an isostere set of each group. Specifically, for any group, the isostere bank creation module 430 extracts the isostere pair with the group and the isostere of the group in the isostere pair, and performs deduplication processing on all the extracted isosteres of the group. Wherein, if the isosteres extracted from the two isostere pairs are the same, the isostere bank generating module 430 further extracts the environmental structure of the two isostere pairs. If the two isosteres have the same environmental structure, one of the isosteres is removed, otherwise, the isostere is not removed. The ranked ust library generating module 430 may perform processing corresponding to the processing described above in step S230, and the detailed description thereof will not be repeated.

The molecule receiving module 440 receives a molecule to be optimized input by a user, and identifies a parent nucleus structure of the molecule and a group to be replaced other than the parent nucleus structure. In general, the radicals to be replaced, other than the parent nucleus, have G₁～G_mM is less than or equal to 4, each group to be replaced has a corresponding isostere set, wherein the ith group to be replaced G_iHas the number of isosteres of N_iAnd at the moment, the isosteric substitution of the groups to be substituted is realized by adopting a permutation and combination mode. The molecule receiving module 440 may perform the processing corresponding to the processing described above in step S240, and the detailed description thereof will not be repeated.

The molecule replacement module 450 obtains the isostere set of the to-be-replaced groups, and replaces the to-be-replaced groups with each isostere in the set to obtain a plurality of candidate molecules. Specifically, the molecule replacement module 450 counts the total replacement times of the plurality of groups, generates a plurality of replacement tasks, and sends the replacement tasks to a plurality of servers for processing, wherein each replacement task includes object structure data and array structure data. The object data comprise the molecular structure to be optimized, splicing sites and group information to be replaced, and the array structure data comprise the isostere set information of each group to be replaced. The molecule replacement module 450 may perform a process corresponding to the process described above in step S250, and a detailed description thereof will not be repeated.

The molecule recommending module 460 calculates the drug-like property of each candidate molecule, and selects one or more candidate molecules with high drug-like property from the candidate molecules to recommend to the user. Optionally, the molecule recommendation module 460 may also store the candidate molecules recommended to the user each time in a molecule library for subsequent screening of candidate drugs from the molecule library in a drug screening. The molecule recommendation module 460 may perform the processing corresponding to the processing described above in step S260, and the detailed description thereof is omitted.

According to the technical scheme of the invention, a tool with high calculation speed and high structural diversity of the generated peer blocks is provided, and the calculation mode by setting the multi-core CPU is simple, convenient and easy to expand; the isostere fragment library is huge and reaches the million level, so that the generated isostere replacement molecular structures are very diverse; the candidate product with reasonable properties can be recommended from massive isostere replacement molecules in the process of evaluating the drug-like property. The invention can conveniently improve the efficiency of optimizing the lead by drug researchers, is beneficial to the discovery of new drug structures and promotes the drug research and development process.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the method of predicting a potentially reactive molecule of the invention according to instructions in said program code stored in the memory.

A8, the method as recited in a7, wherein the step of de-duplicating all the extracted isosteres of the group comprises: if the isosteres extracted from the two isostere pairs are the same, extracting the environmental structures of the two isostere pairs; if the two isosteres have the same environmental structure, one of the isosteres is removed, otherwise, the isostere is not removed. A9 the method as described in any of A1-A8, wherein the group to be replaced except the mother nucleus has G₁～G_mM is less than or equal to 4, each group to be replaced has a corresponding isostere set, wherein the ith group to be replaced G_iHas the number of isosteres of N_iAnd at the moment, the isosteric substitution of the groups to be substituted is realized by adopting a permutation and combination mode.

A10, the method of a9, wherein the step of replacing the group to be replaced with each isostere of the set comprises: counting the total replacement times of the groups, generating a plurality of replacement tasks and sending the replacement tasks to a plurality of servers for processing, wherein each replacement task comprises object structure data and array structure data; the object data comprise a molecular structure to be optimized, splicing sites and group information to be replaced, and the array structure data comprise the isostere set information of each group to be replaced. A11, the method as in any one of A1-A10, wherein the QED is calculated by:

A12, the method of any one of A1-11, further comprising the steps of: and storing the candidate molecules recommended to the user each time in a molecule library so as to screen candidate medicines from the molecule library in medicine screening later. A13, the method as claimed in any one of A1-A12, wherein after collecting structural data of a plurality of small molecule compounds with biological activity, further comprising the steps of: and removing wrong structure data and inorganic micromolecule and inorganic salt structures contained in the structure data, and converting the processed structure data into simplified molecule linear input standard representation.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense with respect to the scope of the invention, as defined in the appended claims.

Claims

1. A method of predicting a potentially reactive molecule, adapted to be executed in a computing device, comprising the steps of:

collecting structural data of a plurality of small molecule compounds with biological activity;

pairwise comparison is carried out on the small molecule compounds, and the common substructure and the remaining variable substructure of every two small molecule chemicals are identified, wherein the two variable substructures form a pair of isosteres;

carrying out duplicate removal treatment on all generated isostere pairs to obtain an isostere set of each group;

receiving a molecule to be optimized input by a user, and identifying a parent nucleus structure of the molecule and a group to be replaced except the parent nucleus structure;

acquiring an isostere set of the group to be replaced, and replacing the group to be replaced by each isostere in the set to obtain a plurality of candidate molecules; and

and calculating the drug-like property of each candidate molecule, and selecting one or more candidate molecules with high drug-like property from the candidate molecules to recommend to the user.

2. The method of claim 1, wherein for any two small molecule compounds, the step of identifying a common substructure and remaining variable substructures for each two small molecule chemicals comprises:

identifying the largest common substructure of the two small molecule compounds, and extracting one or more common substructures from the largest common substructure;

for any common substructure, extracting two variable substructures except the common substructure from the two small molecule compounds to form a pair of isostere pairs based on the common substructure;

wherein, the broken bond positions of the common substructure and the variable substructure are splicing sites corresponding to the isostere.

3. The method of claim 1 or 2, wherein the step of identifying the common substructure and the remaining variable substructures of every two small molecule chemicals further comprises:

determining a broken bond between the common substructure and the variable substructure and extending a predetermined number of chemical bonds from the end of the broken bond towards the common substructure to extract the environmental structure of the two variable substructures.

4. The method of any one of claims 1-3, the step of identifying a common substructure and remaining variable substructures for every two small molecule chemicals further comprising:

the identified pair of isosteres and the environmental structure are converted to a simplified molecular linear canonical representation, wherein the simplified molecular linear canonical representation of the pair of isosteres is labeled with the splice site of each isostere.

5. The method of any one of claims 2-4, wherein the extracting one or more common substructures within the largest common substructure comprises:

taking the maximum common substructure as a common substructure; or

Extracting one or more chemical bonds within the largest common substructure to generate a plurality of fragmentation patterns;

for each fragmentation mode, extracting fragment sets of the two small molecule compounds after fragmentation respectively, and taking a union set of all the same fragments in the two fragment sets as a common substructure of the two small molecule compounds.

6. The method of claim 5, wherein the broken chemical bond is an acyclic single bond.

7. The method of any one of claims 1-6, wherein the step of de-duplicating all generated peer pairs comprises:

for any group, the pair of isosteres having the group and the isostere of the group in the pair of isosteres are extracted, and all the isosteres of the extracted group are subjected to deduplication processing.

8. An apparatus for predicting a potentially reactive molecule, adapted to reside in a computing device, comprising:

a data collection module adapted to collect structural data of a plurality of small molecule compounds having biological activity;

the isostere pair recognition module is suitable for pairwise comparison of the small molecule compounds and recognizing the common substructure and the remaining variable substructure of each two small molecule chemicals, wherein the two variable substructures form an isostere pair;

the isovolumetric library generation module is suitable for carrying out duplication removal treatment on all generated isovolumetric pairs to obtain an isovolumetric set of each group;

the molecule receiving module is suitable for receiving a molecule to be optimized input by a user and identifying a parent nucleus structure of the molecule and a group to be replaced except the parent nucleus structure;

the molecule replacement module is suitable for acquiring the isostere set of the group to be replaced and respectively replacing the group to be replaced by each isostere in the set to obtain a plurality of candidate molecules; and

and the molecule recommending module is suitable for calculating the drug-like property of each candidate molecule, and selecting one or more candidate molecules with high drug-like property from the candidate molecules to recommend to the user.

9. A computing device, comprising:

a memory;

one or more processors;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-7.

10. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-7.