WO2021203865A9 - 分子结合位点检测方法、装置、电子设备及存储介质 - Google Patents

分子结合位点检测方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2021203865A9
WO2021203865A9 PCT/CN2021/078263 CN2021078263W WO2021203865A9 WO 2021203865 A9 WO2021203865 A9 WO 2021203865A9 CN 2021078263 W CN2021078263 W CN 2021078263W WO 2021203865 A9 WO2021203865 A9 WO 2021203865A9
Authority
WO
WIPO (PCT)
Prior art keywords
site
feature
location
layer
target
Prior art date
Application number
PCT/CN2021/078263
Other languages
English (en)
French (fr)
Other versions
WO2021203865A1 (zh
Inventor
李贤芝
陈广勇
王平安
张胜誉
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2021545445A priority Critical patent/JP7246813B2/ja
Priority to KR1020217028480A priority patent/KR102635777B1/ko
Priority to EP21759220.3A priority patent/EP3920188A4/en
Publication of WO2021203865A1 publication Critical patent/WO2021203865A1/zh
Priority to US17/518,953 priority patent/US20220059186A1/en
Publication of WO2021203865A9 publication Critical patent/WO2021203865A9/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/20Identification of molecular entities, parts thereof or of chemical compositions
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/69Microscopic objects, e.g. biological cells or cellular parts
    • G06V20/695Preprocessing, e.g. image segmentation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/80Data visualisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • This application relates to the field of computer technology, and in particular to a method, device, electronic equipment and storage medium for detecting molecular binding sites.
  • the binding site of protein molecules refers to the positions on the protein molecules that bind to other molecules. , Commonly known as "protein binding pocket". Determining the binding sites of protein molecules is of great significance to the analysis of protein structure and function. Therefore, how to accurately detect the binding sites in protein molecules is an important research direction.
  • the embodiments of the present application provide a molecular binding site detection method, device, electronic equipment, and storage medium, which can improve the accuracy of the molecular binding site detection process.
  • the technical solution is as follows.
  • a method for detecting molecular binding sites is provided, which is applied to an electronic device, and the method includes:
  • the first target point of any site is the center point of all sites included in the target spherical space
  • the target spherical space is The any point is a spherical space with the center of the sphere and the target length as the radius
  • the second target point of any point is the positive extension line of the vector pointing to the point with the origin as the starting point and the target spherical space.
  • the site detection model to perform prediction processing on the extracted location feature to obtain at least one predicted probability of the at least one site, where one predicted probability is used to characterize the possibility that a site belongs to a binding site;
  • a binding site within the at least one site in the target molecule is determined.
  • a device for detecting molecular binding sites which includes:
  • the obtaining module is used to obtain the three-dimensional coordinates of at least one site in the target molecule to be detected, and the target molecule is a chemical molecule of the binding site to be detected;
  • the first determining module is used to determine the first target point and the second target point corresponding to each site respectively, wherein the first target point of any site is the center point of all sites included in the target spherical space ,
  • the target spherical space is a spherical space with any position as the center of the sphere and the target length as the radius, and the second target point of any position is the positive extension of the vector pointing to the position with the origin as the starting point The intersection point of the line with the outer surface of the spherical space of the target;
  • the extraction module is used for extracting the position feature with rotation-invariant characteristics in the three-dimensional coordinates of the at least one site, the at least one first target point, and the at least one second target point based on the three-dimensional coordinates of the at least one site, the position
  • the feature is used to characterize the location information of the at least one site in the target molecule
  • the prediction module is used to call the site detection model to perform prediction processing on the extracted location feature to obtain at least one predicted probability of the at least one site, where one predicted probability is used to characterize a site belonging to a binding site possibility;
  • the second determining module is configured to determine the binding site in the at least one site in the target molecule based on the at least one predicted probability.
  • an electronic device in one aspect, includes one or more processors and one or more memories, and at least one piece of program code is stored in the one or more memories. Multiple processors are loaded and executed to implement the molecular binding site detection method in any of the above-mentioned possible implementation modes.
  • a storage medium is provided, and at least one program code is stored in the storage medium, and the at least one program code is loaded and executed by a processor to implement the method for detecting a molecular binding site in any of the above-mentioned possible implementation modes.
  • the first target point and the second target point corresponding to each site are determined, based on the three-dimensional coordinates of each site, each first target point, and each second target point, Extract the location features with rotation invariant characteristics in the three-dimensional coordinates of each site, call the site detection model to predict the extracted location features, and obtain the predicted probability of whether each site belongs to the binding site, and then determine based on the predicted probability Draw out the binding site of the target molecule.
  • each first target point and the second target point are related to each site and have a certain spatial representative point, so with the help of each site, each first target point and each second target
  • the three-dimensional coordinates of the points construct a position feature that can fully reflect the detailed structure of the target molecule and has the characteristics of rotation invariance, thereby avoiding the loss of details caused by designing voxel features for the target molecule, and making the binding site based on the position feature
  • the position information of the detailed structure of the target molecule can be fully utilized, which improves the accuracy of the molecular binding site detection process.
  • FIG. 1 is a schematic diagram of an implementation environment of a method for detecting a molecular binding site provided by an embodiment of the present application
  • FIG. 2 is a flowchart of a method for detecting molecular binding sites according to an embodiment of the present application
  • Fig. 3 is a flowchart of a method for detecting a molecular binding site provided by an embodiment of the present application
  • FIG. 4 is a schematic diagram of a first target point and a second target point provided by an embodiment of the present application
  • Fig. 5 is a schematic diagram of a graph convolutional neural network provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an edge convolutional layer provided by an embodiment of the present application.
  • Fig. 7 is a schematic structural diagram of a molecular binding site detection device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the term "at least one" refers to one or more, and the meaning of “multiple” refers to two or more than two, for example, multiple first positions refer to two or more first positions.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes audio processing technology, computer vision technology, natural language processing technology, and machine learning/deep learning.
  • Machine Learning is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity. Theory and many other subjects.
  • Machine learning technology specializes in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
  • Binding sites refer to various sites that bind to other molecules on the current molecule, commonly known as “binding pockets” and “binding pocket sites.”
  • predicting the binding sites of protein molecules can also help in the design of reasonable drug molecules: the analysis of the role of protein molecules has a great role in the treatment of various diseases, through the analysis of the structure and function of protein molecules. , Can reveal the pathogenesis of certain diseases, and then provide guidance for finding the targets of certain drugs and the development of new drugs.
  • predicting the binding site of a protein molecule is not only of great significance for revealing the structure and function of the protein molecule itself, but also by revealing the structure and function of the protein molecule itself, it can further reveal the pathogenesis of certain diseases in the pathology. , So as to guide the search for drug targets and guide the research and development of new drugs.
  • the molecular binding site detection method in the embodiments of the application is used to detect the binding site of the target molecule, but the target molecule is not limited to the above-mentioned protein molecule.
  • the target molecule is ATP (Adenosine TriphosPhate,
  • ATP Addenosine TriphosPhate
  • the embodiments of the present application do not specifically limit the types of target molecules.
  • Protein binding pockets various binding sites located on protein molecules that bind to other molecules.
  • Point cloud data A data collection of points in a certain coordinate system. The data of each point contains a wealth of information, including the three-dimensional coordinates, color, intensity value, time, etc. of the point.
  • a three-dimensional laser scanner is usually used for data collection to obtain point cloud data.
  • DCNN Deep Convolutional Neural Network
  • the structure of DCNN includes input layer, hidden layer and output layer.
  • the hidden layer usually includes a convolutional layer, a pooling layer, and a fully-connected layer.
  • the function of the convolutional layer is to perform feature extraction on the input data.
  • the convolutional layer contains multiple convolution kernels, and each element of the convolution kernel corresponds to a weight coefficient and a deviation. After feature extraction in the convolutional layer, the output feature map will be passed to the pooling layer for feature selection and filtering.
  • the fully connected layer is located at the last part of the hidden layer of the convolutional neural network.
  • the feature map loses the spatial topology in the fully connected layer, is expanded into a vector and passed to the output layer through the activation function.
  • the objects studied by DCNN must have a regular spatial structure, such as images and voxels.
  • GCN Graph Convolutional Neural Network
  • GCN is a method for deep learning of graph data. GCN constructs graph data with points and edges on the input data, and uses multiple hidden layers for each point Extract high-dimensional features, which imply the graph connection relationship between this point and surrounding points, and finally get the expected output result through the output layer. GCN has achieved success in many tasks such as e-commerce recommendation systems, new drug development, and point cloud analysis.
  • the GCN network structure includes Spectral CNN (Spectral Convolutional Neural Network), Graph Attention Network (Graph Attention Network), Graph Recurrent Attention Network ( Graph recursive attention network), Dynamic Graph CNN (Dynamic Graph Convolutional Neural Network, DGCNN), etc.
  • the traditional GCN does not have the characteristics of rotation invariance.
  • MLP Multilayer Perceptron
  • DCNN is used to detect the binding sites (protein binding pockets) of protein molecules.
  • DCNN has shown good performance in image and video analysis, recognition, and processing. So try to migrate DCNN to the task of identifying protein binding pockets.
  • the traditional DCNN has achieved success in many tasks, the objects studied by DCNN must have a regular spatial structure, such as image pixels, molecular voxels, etc., for many real-life data that does not have a regular spatial structure ( For example, protein molecules).
  • the technician In order to migrate DCNN to the detection process of protein binding pockets, the technician must manually design a feature with a regular spatial structure for the protein molecule, which is used as the input of DCNN.
  • the DeepSite network is the first proposed DCNN network to detect protein binding pockets.
  • Features essentially a substructure
  • a multi-layer convolutional neural network is used. Predict whether the substructure of the imported protein molecule belongs to the pocket binding site.
  • the technicians proposed a new feature extractor: feature extraction from the shape of the protein molecule and the energy of the binding site, and the output features are represented by 3D voxels (also That is, the voxel feature) is input into the DCNN network.
  • FRSite is also a DCNN network that detects protein binding pockets.
  • DeepDrup3D is also a DCNN network that detects protein binding pockets. It directly converts protein molecules into 3D voxels as the input of the DCNN network to predict protein binding pockets.
  • the embodiments of the present application provide a method for detecting the binding site of a molecule, which is used to detect the binding site of a target molecule.
  • the target molecule as a protein molecule as an example
  • the point cloud data of the protein molecule (including three-dimensional coordinate )
  • the locus detection model can fully explore the tissue structure of protein molecules, so as to automatically extract the most efficient and most conducive biology for pocket detection. Therefore, it is possible to accurately identify protein binding pockets from the point cloud data of protein molecules.
  • the traditional graph convolutional neural network does not have the rotation invariance characteristic, and the protein molecule can rotate arbitrarily in three-dimensional space
  • the adopted network structure does not have Rotation invariant feature, then the pocket detection results of the same protein molecule before and after the rotation may be very different, which will greatly reduce the detection accuracy of the protein binding pocket.
  • the three-dimensional coordinate points in the point cloud data of protein molecules are transformed into rotation-invariant representations (that is, positional features), such as angle, length, etc., and rotation-invariant positional features are replaced by rotationally changing three-dimensional
  • the coordinate points are used as system input, which makes the network structure of the site detection model have the characteristics of rotation invariance, that is, the detection result of the protein binding pocket does not change with the direction of the input protein point cloud data.
  • the detection process is of breakthrough significance.
  • Fig. 1 is a schematic diagram of an implementation environment of a method for detecting a molecular binding site provided by an embodiment of the present application.
  • the implementation environment includes a terminal 101 and a server 102, and both the terminal 101 and the server 102 are electronic devices.
  • the terminal 101 is used to provide the point cloud data of the target molecule.
  • the terminal 101 is a control terminal of a three-dimensional laser scanner, which collects data of the target molecule through the three-dimensional laser scanner, and exports the collected point cloud data to the control terminal,
  • a detection request carrying the point cloud data of the target molecule is generated by the control terminal.
  • the detection request is used to request the server 102 to detect the binding site of the target molecule, so that the server 102 responds to the detection request and performs the detection of the target molecule based on the point cloud data of the target molecule.
  • the detection of the binding site determines the binding site of the target molecule, and returns the binding site of the target molecule to the control terminal.
  • the control terminal sends the point cloud data of the entire target molecule to the server 102, which enables the server 102 to perform a more comprehensive molecular structure analysis of the target molecule.
  • the point cloud data in addition to the three-dimensional coordinates of each site, also includes additional attributes such as color, intensity value, and time. The coordinates are sent to the server 102, so that the amount of communication in the data transmission process can be saved.
  • the terminal 101 and the server 102 are connected through a wired network or a wireless network.
  • the server 102 is used to provide detection services for molecular binding sites. After receiving a detection request from any terminal, the server 102 parses the detection request to obtain the point cloud data of the target molecule, based on the three-dimensional coordinates of each site in the point cloud data , Extract the location feature of each site with rotation invariance, use the location feature as the input of the site detection model, perform the operation of predicting the binding site, and obtain the binding site of the target molecule.
  • the server 102 includes at least one of one server, multiple servers, a cloud computing platform, or a virtualization center. In some embodiments, the server 102 is responsible for the main calculation work, and the terminal 101 is responsible for the secondary calculation work; or, the server 102 is responsible for the secondary calculation work, and the terminal 101 is responsible for the main calculation work; or, between the terminal 101 and the server 102 Distributed computing architecture for collaborative computing.
  • the terminal 101 and the server 102 complete the molecular binding site detection through communication interaction as an example for description.
  • the terminal 101 can also independently complete the detection of the molecular binding site.
  • the terminal 101 collects After the point cloud data of the target molecule is obtained, directly based on the three-dimensional coordinates of each site in the point cloud data, the prediction process based on the site detection model is executed to predict the binding site of the target molecule, which is similar to the prediction process of the server 102, here Do not repeat it.
  • the terminal 101 generally refers to one of multiple terminals.
  • the device type of the terminal 101 includes, but is not limited to: smart phones, tablet computers, e-book readers, MP3 (Moving Picture Experts Group Audio Layer III, dynamic image) Expert compression standard audio layer 3) player, MP4 (Moving Picture Experts Group Audio Layer IV, moving picture expert compression standard audio layer 4) player, laptop portable computer or desktop computer at least one.
  • the terminal includes a smart phone as an example.
  • the number of the aforementioned terminals 101 is more or less. For example, there is only one terminal 101, or there are dozens or hundreds of terminals 101, or more. The embodiment of the present application does not limit the number of terminals 101 and the device type.
  • Fig. 2 is a flowchart of a method for detecting molecular binding sites provided by an embodiment of the present application. Referring to FIG. 2, the method is applied to an electronic device, and this embodiment includes the following steps.
  • An electronic device acquires a three-dimensional coordinate of at least one site in a target molecule to be detected, and the target molecule is a chemical molecule of a binding site to be detected.
  • the target molecule is any chemical molecule of the binding site to be detected, such as protein molecule, ATP (Adenosine TriphosPhate, adenosine triphosphate) molecule, organic polymer molecule, organic small molecule, etc.
  • ATP Addenosine TriphosPhate, adenosine triphosphate
  • organic polymer molecule organic small molecule, etc.
  • the type is specifically limited.
  • the three-dimensional coordinates of the at least one site are expressed in the form of point cloud data, and at least one three-dimensional coordinate points in a certain coordinate system are stacked together to describe the structure of the target molecule.
  • point cloud data occupies less storage space, and because 3D voxels depend on the feature extraction method, it is easy to lose some detailed structures in the target molecule during the feature extraction process, so point cloud data It can also describe the detailed structure of the target molecule.
  • the three-dimensional coordinate point is a kind of data that is very sensitive to rotation, take protein molecules as an example. After the same protein point cloud is rotated, the three-dimensional coordinate value of each site will change. Therefore, if you directly change the three-dimensional coordinate value of each site Three-dimensional coordinates are input into the location detection model for feature extraction and binding site prediction. Since the coordinate values will change before and after the rotation, the same location detection model may extract different biological features for the input before and after the rotation. , Thus predicting different binding sites, that is to say, it is precisely because the three-dimensional coordinate points are not rotationally invariant, it will cause the site detection model to predict different binding sites for the same protein molecule before and after the rotation, resulting in The accuracy of the molecular binding site detection process cannot be guaranteed.
  • the electronic device separately determines the first target point and the second target point corresponding to each location, where the first target point of any location is the center point of all the locations included in the target spherical space, and the target Spherical space is a spherical space with any position as the center of the sphere and the target length as the radius.
  • the second target point of any position is the positive extension line of the vector pointing to the position with the origin as the starting point and the The intersection point of the outer surface of the target spherical space.
  • each location uniquely corresponds to a first target point and a second target point.
  • the first target point refers to: taking the location as the center of the sphere and taking the target length as the radius The center point of all the sites of the target molecule contained in the target spherical space.
  • This center point is a space point calculated based on the average value of the three-dimensional coordinates of all the sites contained in the target spherical space. Therefore, the first The target point is not necessarily the actual point in the point cloud data of the target molecule.
  • the target length is any value greater than 0; the second target point refers to the positive of the vector pointing to the point with the origin as the starting point.
  • the origin is the origin of the three-dimensional coordinate system where the target molecule is located.
  • a vector pointing to the position is derived from the origin, and the direction of the vector points from the origin to the position.
  • Point, the length of the vector is equal to the modulus length of the position, and the positive extension of the vector has a unique intersection with the outer surface of the target spherical space.
  • This intersection is the second target point.
  • the second target point It is not necessarily the actual site in the point cloud data of the target molecule.
  • the electronic device extracts, based on the three-dimensional coordinates of the at least one location, the at least one first target point, and the at least one second target point, a location feature having a rotation-invariant characteristic in the three-dimensional coordinates of the at least one location, and the location feature It is used to characterize the location information of the at least one site in the target molecule.
  • the location feature of each location is obtained through the three-dimensional coordinates of each location, each first target point, and each second target point, that is, the location feature is not affected by the rotation angle of the target molecule, Replacing the three-dimensional coordinates with the position features as the input of the position detection model can avoid the problem that the detection accuracy is reduced due to the lack of rotation invariance of the three-dimensional coordinates involved in the above step 201.
  • the electronic device calls the site detection model to perform prediction processing on the extracted location features to obtain at least one predicted probability of the at least one site, where one predicted probability is used to characterize the possibility that a site belongs to a binding site .
  • the site detection model is used to detect the binding site of the target molecule.
  • the site detection model belongs to a classification model, which is used to process the classification of whether each site in the target molecule belongs to a binding site. Task.
  • the location detection model includes a graph convolutional neural network, or includes other deep learning networks. The embodiment of the present application does not specifically limit the type of location detection model.
  • the electronic device inputs the location characteristics of each site into the site detection model, and the site detection model performs the prediction operation of the binding site based on the location characteristics of each site.
  • the location In the detection model the biological characteristics of the target molecule are first extracted based on the location characteristics of each site, and then the binding site is detected based on the biological characteristics of the target molecule to obtain the predicted probability of each site.
  • the electronic device determines a binding site within the at least one site in the target molecule based on the at least one predicted probability.
  • the electronic device determines the site with the predicted probability greater than the probability threshold as the binding site, or sorts the sites in descending order of the predicted probability, and determines the site with the previous target number as the binding site.
  • the probability threshold is any value greater than or equal to 0 and less than or equal to 1
  • the target number is any integer greater than or equal to 1. For example, when the number of targets is 3, the electronic device sorts the sites in descending order of the predicted probability, and determines the sites with the top 3 rankings as binding sites.
  • the method provided in the embodiments of the present application determines the first target point and the second target point corresponding to each position by obtaining the three-dimensional coordinates of each site in the target molecule, based on each site, each first target point and each The three-dimensional coordinates of the second target point are extracted from the three-dimensional coordinates of each site with rotation-invariant location features, and the site detection model is used to predict the extracted location features to obtain whether each site belongs to the binding site Predict the probability, and determine the binding site of the target molecule based on the predicted probability.
  • each site and each third The three-dimensional coordinates of a target point and each second target point are constructed to fully reflect the detailed structure of the target molecule and have rotation-invariant position features, thereby avoiding the loss of details caused by designing voxel features for the target molecule , which makes it possible to make full use of the position information of the detailed structure of the target molecule when the binding site is detected based on the position feature, which improves the accuracy of the molecular binding site detection process.
  • Fig. 3 is a flowchart of a method for detecting molecular binding sites provided by an embodiment of the present application. Referring to FIG. 3, this embodiment is applied to an electronic device, and the electronic device is used as an example for description. This embodiment includes the following steps.
  • the terminal obtains the three-dimensional coordinates of at least one site in the target molecule to be detected, and the target molecule is a chemical molecule of the binding site to be detected.
  • the above step 300 is similar to the above step 201, and will not be repeated here.
  • the terminal determines the first target point and the second target point corresponding to the location based on the three-dimensional coordinates of the location.
  • each location uniquely corresponds to a first target point.
  • the first target point refers to: the target spherical space with the location as the center of the sphere and the target length as the radius The center point of all the included sites.
  • the target spherical space refers to the spherical space with the site as the center of the sphere and the target length as the radius. This center point is based on the three-dimensionality of all the sites included in the target spherical space. A point in space obtained by averaging the coordinates. Therefore, the first target point is not necessarily a point that actually exists in the point cloud data of the target molecule.
  • the target length is specified by the technician, and the target length is any A value greater than 0.
  • each site uniquely corresponds to a second target point.
  • the second target point refers to the forward extension line of the vector pointing to the site with the origin as the starting point and the At the intersection of the outer surface of the target spherical space, take the origin as the starting point to draw a vector pointing to the position, the direction of the vector points from the origin to the position, the length of the vector is equal to the modulus length of the position, and the positive direction of the vector
  • the extension line has a unique point of intersection with the outer surface of the target spherical space, and this point of intersection is the second target point.
  • the second target point is not necessarily a real point in the point cloud data of the target molecule.
  • the terminal in the process of determining the first target point and the second target point, the terminal first determines the target spherical space with the site as the center of the sphere and the target length as the radius, and then from at least one site of the target molecule In determining all the points located in the spherical space of the target, the center point of all the points located in the spherical space of the target is determined as the first target point. In some embodiments, when the above-mentioned center point is determined, the The three-dimensional coordinates of all the points in the space are determined as the three-dimensional coordinates of the center point, that is, the three-dimensional coordinates of the first target point. Further, a vector pointing to the position with the origin as a starting point is determined, and the intersection of the positive extension line of the vector and the outer surface of the target spherical space is determined as the second target point.
  • FIG. 4 is a schematic diagram of a first target point and a second target point provided by an embodiment of the present application.
  • the point cloud data of a protein molecule includes N (N ⁇ 1) points 3D coordinates
  • the point cloud data consists of N 3D coordinate points Stacked, where the origin is origin(0,0,0), p i represents the three-dimensional coordinates of the i-th position, and x i , y i , and z i represent the i-th position on the x, y, and z axes, respectively
  • the coordinate value above, i is an integer greater than or equal to 1 and less than or equal to N, and the structure of protein molecules can be described through point cloud data.
  • the center of the sphere to p i, r is the radius of a spherical target space 401
  • the target 401 contained in the spherical space of the center point of all loci m i is determined as a first target point 402
  • the average value of x coordinate, the target 401 contained in the spherical space all of the sites is determined as m i x coordinate of the center point
  • the intersection point s i of the extension line and the outer surface of the target spherical space 401 is determined as the second target point 403.
  • the terminal constructs a global location feature of the location based on the three-dimensional coordinates of the location, the first target point, and the second target point, and the global location feature is used to characterize the space where the location is located in the target molecule location information.
  • the global location feature includes: the modulus length of the site, the distance between the site and the first target point, the distance between the first target point and the second target point, and the first target point. At least one of the cosine value of an included angle or the cosine value of a second included angle, wherein the first included angle is the included angle formed between the first line segment and the second line segment, and the second included angle is the The angle formed between the second line segment and the third line segment, the first line segment is the line segment formed between the location and the first target point, and the second line segment is the first target point and the first target point.
  • a line segment formed between two target points, and the third line segment is a line segment formed between the location and the second target point.
  • the terminal obtains the modulus length of the site, the distance between the site and the first target point, the distance between the first target point and the second target point, and the first included angle
  • the cosine value and the cosine value of the second included angle are used to construct a five-dimensional vector based on the above five items of data, and use the five-dimensional vector as the global location feature of the location.
  • the global location feature includes: the modulus length of the site, the distance between the site and the first target point, the distance between the first target point and the second target point, and the first target point. At least one of an angle of an included angle or an angle of a second included angle.
  • the first and second included angles are not taken as cosine values, and the angle between the first included angle and the second included angle is directly used as an element in the global position feature.
  • the first step is determined by the above step 301 the target point 402 (denoted by m i) and the second target point 403 (denoted by s i), the following five terminals respectively acquire data.
  • ⁇ i a first angle ⁇ i cosine cos ( ⁇ i)
  • the first angle ⁇ i is the angle between the first line segment and a second segment constituted
  • the first segment is a site p between the i line and the first target point m i constituted
  • the second line segment is a line segment between the first target point a second target point m i and s i are constituted.
  • the first included angle ⁇ i and the second included angle ⁇ i are the two internal angles of the triangle ⁇ m i s i p i.
  • This five data can be configured as a five-dimensional global position vector p i of the feature site: [dp i; dpm i; dsm i; cos ( ⁇ i); cos ( ⁇ i) ].
  • site-die length of p i dp i
  • 2 the position of the characteristic site of p i
  • site-p The model length of i replaces the three-dimensional coordinate points of the site p i and is input into the site detection model, which can solve the problem that the three-dimensional coordinate points do not have rotation.
  • site-p i the modulus length of the site p i
  • the terminal additionally extracted four data [dpm i; dsm i; ⁇ i; ⁇ i], whether it is apparent from the amount dp i, dpm i , dsm i , or the angle quantities ⁇ i and ⁇ i , will not change with the rotation of the protein molecule, so they have rotation invariance.
  • the global position of the feature point three-dimensional coordinates substituted (x i, y i , z i ) to indicate the location of the location p i in the point cloud space coordinate system, that is, based on the global location feature, the location of the location p i can be accurately located in the point cloud space coordinate system Therefore, the global location feature can retain the location information of the site p i to the greatest extent, and the global location feature has rotation invariance.
  • a radius of a spherical target in a space and therefore the amount of the distance dp i, dpm i, in the range of dsm i Are between 0 and 1, while the range of the first included angle ⁇ i and the second included angle ⁇ i is between 0 and ⁇ ( ⁇ i and ⁇ i ⁇ [0, ⁇ ]),
  • the cos( ⁇ i ) and cos( ⁇ i ) with the value range between 0 and 1 can be obtained, thereby ensuring that the input is in place
  • the data of the point detection model has a uniform value range, which enables the point detection model to have more stable training performance and prediction performance.
  • the terminal constructs at least one part between the location and the at least one neighborhood point based on the three-dimensional coordinates of the location, the first target point, the second target point, and at least one neighborhood point of the location Location feature, a local location feature is used to characterize the relative location information between the site and a neighboring point.
  • the neighboring points of the site refer to the K points closest to the site in the target molecule, and K is greater than or equal to 1, or the neighboring points of the site refer to the target of the site
  • the points included in the neighborhood for example, the target neighborhood is a spherical neighborhood, a columnar neighborhood, etc. centered on the site, which are not limited in the embodiment of the present application.
  • the local location feature between the site and the neighborhood point includes: Distance, the distance between the neighborhood point and the first target point, the distance between the neighborhood point and the second target point, the cosine of the third angle, the cosine of the fourth angle, or the fifth At least one of the cosine values of the included angle, where the third included angle is the included angle formed between the fourth line segment and the fifth line segment, and the fourth included angle is the fifth line segment and the sixth line segment.
  • the fifth included angle is the included angle formed between the sixth line segment and the fourth line segment
  • the fourth line segment is the line segment formed between the neighborhood point and the location.
  • the five line segment is a line segment formed between the neighborhood point and the first target point
  • the sixth line segment is a line segment formed between the neighborhood point and the second target point.
  • the terminal obtains the distance between the neighborhood point and the site, and the distance between the neighborhood point and the first target point.
  • the distance, the distance between the neighborhood point and the second target point, the cosine value of the third angle, the cosine value of the fourth angle, and the cosine value of the fifth angle construct a six-dimensional vector based on the above six data , Take the six-dimensional vector as a local location feature of the location, and further, perform a similar operation on all the neighboring points to obtain the local location feature of the location relative to all the neighboring points.
  • the local location feature between the site and the neighborhood point includes: Distance, the distance between the neighborhood point and the first target point, the distance between the neighborhood point and the second target point, the angle of the third angle, the angle of the fourth angle, or the fifth angle At least one of the angles.
  • the third, fourth, and fifth included angles are not taken as cosine values, and the third, fourth, and fifth included angles are directly used as elements in the local position feature.
  • the first step 301 can be used to determine a target point 402 (denoted by m i) and the second target point 403 (indicated by S i), the presence of the j-th points in the neighborhood of the i p ij p i of loci (j ⁇ 1) is assumed, it can be seen ,
  • a tetrahedron can be constructed by using the site p i , the first target point mi , the second target point s i and the neighboring point p ij , and the side length of the tetrahedron includes the neighboring point p ij and the site distance dpp ij (fourth line segment length) between the points p i, the distance dpm ij (fifth line segment length) between the neighborhood of
  • the third angle Fourth angle And the fifth angle Take the cosine value to get the cosine value corresponding to each of the three included angles with By constructing a six-dimensional vector
  • the local location feature can describe the relative positional relationship between the location p i and the neighboring point p ij in the point cloud space coordinate system.
  • Location features and local location features can more comprehensively and accurately describe the location information of the site p i in the point cloud spatial coordinate system of the protein molecule.
  • the terminal obtains the location feature of the site based on the global location feature and the at least one local location feature.
  • the terminal obtains a five-dimensional global location feature.
  • the terminal obtains at least one six-dimensional local location feature.
  • the local location feature is the same as the global location feature.
  • the location feature is spliced to obtain an eleven-dimensional location feature component, and the matrix formed by all the location feature components is determined as the location feature of the site.
  • the terminal can extract the location feature of the location based on the three-dimensional coordinates of the location, the first target point, and the second target point.
  • the location features are equivalent to the global location features, that is, the terminal acquires in step 302 After the operation of the global location feature, the above steps 303-304 are not performed, and the global location feature of each site is directly input into the site detection model, and the local location feature of each site is not obtained, which can simplify the process of the binding site detection method. Reduce the amount of calculation in the binding site detection process.
  • p i is present with the site corresponding to the first target point m i, s i and a second target point K (K ⁇ 1) th points in the neighborhood 302 extracted by the above-described steps a 5-dimensional (5-dim) the global position of the feature [dp i; dpm i; dsm i; cos ( ⁇ i); cos ( ⁇ i)], the above steps are extracted 303 respectively correspond K 6-dimensional (6-dim) local location features at K neighborhood points
  • Each local location feature is spliced with the global location feature to obtain K 11-dimensional location feature components to form a [K ⁇ 11]-dimensional location feature with rotation invariance.
  • the expression of the location feature is as follows:
  • the left side of the matrix indicates the position of the characteristic site global p i of G i, p i to indicate the position of the site of the point cloud in the space
  • the right side shows the bit matrix the K local position of the feature point p i between the K and its neighboring points p i1 ⁇ p iK L i1 ⁇ L iK
  • p i to indicate the site of the K and its neighboring points p i1 ⁇ p iK The relative position between.
  • the terminal repeats the foregoing steps 301 to 304 on at least one site in the target molecule to obtain the location feature of the at least one site.
  • the terminal can extract the three-dimensional coordinates of the at least one location with rotation-invariant characteristics based on the three-dimensional coordinates of the at least one location, the at least one first target point, and the at least one second target point.
  • Location feature which is used to characterize the location information of the at least one site in the target molecule.
  • the terminal uses the three-dimensional coordinates of each site to construct a location information that can fully characterize each site and has Rotation-invariant location features have high feature expression capabilities.
  • the terminal inputs the location feature of the at least one location into the input layer of the graph convolutional neural network, and outputs the graph data of the at least one location through the input layer, and the graph data is used to represent the location in the form of a graph. Location characteristics.
  • the location detection model is a graph convolutional neural network as an example for description.
  • the graph convolutional neural network includes an input layer, at least one edge convolution (EdgeConv) layer, and an output layer.
  • EtgeConv edge convolution
  • the at least one edge convolution layer is used to extract the global biological characteristics of each site, and the output layer is used to perform feature fusion and probability prediction.
  • the input layer of the graph convolutional neural network includes a multi-layer perceptron and a pooling layer
  • the terminal inputs the position feature of the at least one site into the multi-layer perceptron in the input layer, and the multi-layer perceptron is used through the multi-layer perceptron.
  • the layer perceptron maps the location feature of the at least one site to obtain the first feature of the at least one site.
  • the dimension of the first feature is greater than the dimension of the location feature, and the first feature of the at least one site is input
  • the pooling layer in the input layer performs dimensionality reduction on the first feature of the at least one site through the pooling layer to obtain map data of the at least one site.
  • the pooling layer is a maximum pooling layer (max pooling layer), and the first feature is subjected to a maximum pooling operation in the maximum pooling layer, or the pooling layer is an average pooling layer (average pooling layer). layer), the average pooling operation is performed on the first feature in the average pooling layer, and the embodiment of the present application does not specifically limit the type of the pooling layer.
  • the multi-layer perceptron maps the input location feature to the output first feature, which is equivalent to upgrading the location feature, extracting the high-dimensional first feature, and reducing the first feature through the pooling layer.
  • the dimension is equivalent to filtering and selecting the first feature, filtering out some unimportant information, and obtaining the graph data.
  • Figure 5 is a schematic diagram of the principle of a graph convolutional neural network provided by an embodiment of the present application. Please refer to Figure 5, assuming that [N ⁇ 3]-dimensional point cloud data 500 of a protein molecule is given, using rotation invariance
  • the characterization extractor (similar to step 301) converts the point cloud data into a [N ⁇ K ⁇ 11]-dimensional rotation-invariant characterization 501, which is the positional feature of each site.
  • the [N ⁇ K ⁇ 32]-dimensional first feature 502 is further extracted, and the maximum pool is used
  • the transformation layer performs maximum pooling on the [N ⁇ K ⁇ 32]-dimensional first feature 502 along the K-dimensional direction, and converts the [N ⁇ K ⁇ 32]-dimensional first feature 502 into [N ⁇ 32]-dimensional Figure data 503.
  • the terminal inputs the graph data of at least one location into the at least one edge convolution layer in the graph convolutional neural network, and performs feature extraction on the graph data of the at least one location through the at least one edge convolution layer to obtain the The global biological characteristics of at least one site.
  • the terminal executes the following sub-steps 3071-3074.
  • the terminal performs feature extraction on the side convolution feature output by the upper side convolution layer through the side convolution layer, and input the extracted side convolution feature Convolutional layer on the bottom side.
  • each edge convolutional layer includes a multi-layer perceptron and a pooling layer.
  • a cluster map is constructed based on the edge convolution features output by the upper convolutional layer;
  • the cluster graph is input to the multi-layer perceptron in the edge convolutional layer, and the cluster graph is mapped through the multi-layer perceptron to obtain the intermediate feature of the cluster graph;
  • the intermediate feature is input into the edge convolutional layer
  • the pooling layer in, through the pooling layer to reduce the dimensionality of the intermediate feature, and input the dimensionality-reduced intermediate feature into the lower convolutional layer.
  • the edge convolution feature output by the previous convolution layer is used to construct the cluster graph through the KNN (k-Nearest Neighbor) algorithm.
  • the class graph is called the KNN graph.
  • the K-means algorithm can also be used to construct the cluster graph. The embodiment of the present application does not specifically limit the method of constructing the cluster graph.
  • the pooling layer is a maximum pooling layer (max pooling layer), in which the intermediate features are subjected to a maximum pooling operation, or an average pooling layer, where the average pooling layer is The average pooling operation is performed on the intermediate features in the pooling layer, and the embodiment of the present application does not specifically limit the type of the pooling layer.
  • Fig. 6 is a schematic structural diagram of an edge convolution layer provided by an embodiment of the present application. Please refer to Fig. 6.
  • the cluster graph KNN graph
  • the multi-layer perceptron MLPs is used to extract high-dimensional features from the cluster graph, which can map the [N ⁇ C]-dimensional edge convolution feature 601 to [N ⁇ K ⁇ C']-dimensional intermediate feature 602, the [N ⁇ K ⁇ C']-dimensional intermediate feature 602 is reduced by the pooling layer, and the [N ⁇ C']-dimensional edge convolution feature 603 (reduced The intermediate feature after the dimension), the [N ⁇ C']-dimensional edge convolution feature 603 is input into the next convolution layer.
  • the terminal performs the above operations on each side convolution layer in at least one side convolution layer, and the side convolution feature output by the upper side convolution layer is used as the input of the lower side convolution layer, so as to pass the at least one side convolution layer.
  • the edge convolutional layer is equivalent to performing a series of higher-dimensional feature extraction on the map data of at least one location.
  • the terminal inputs [N ⁇ 32]-dimensional graph data 503 into the first edge convolutional layer
  • the [N ⁇ 64]-dimensional edge convolution feature 504 is output through the first edge convolution layer
  • the terminal inputs the [N ⁇ 64]-dimensional edge convolution feature 504 into the second edge convolution layer, and the terminal passes through the second edge convolution layer.
  • Each edge convolution layer outputs the [N ⁇ 128]-dimensional edge convolution feature 505, and the following step 3072 is executed.
  • the terminal splices the image data of the at least one location and the at least one edge convolution feature output by the at least one edge convolution layer to obtain a second feature.
  • the terminal splices the image data of each location and the side convolutional layer features output by each side convolutional layer to obtain a second feature, which is equivalent to the at least one side convolutional layer.
  • Residual features so that in the process of extracting global biological features, not only the edge convolution features output by the last edge convolution layer, but also the initial input map data of each location and every intermediate point can be considered.
  • the edge convolution features output by each edge convolution layer are conducive to improving the expression ability of global biological features.
  • the splicing mentioned here refers to connecting the graph data with the edge convolution features output by each edge convolution layer directly in dimension. For example, assuming that the number of edge convolution layers is 1, then [N The ⁇ 32]-dimensional graph data and the [N ⁇ 64]-dimensional edge convolution feature are spliced to obtain the [N ⁇ 96]-dimensional second feature.
  • FIG. 5 Take the graph convolutional neural network including two side convolutional layers as an example.
  • the terminal outputs [N ⁇ 32]-dimensional graph data 503 and the first side convolutional layer [
  • the N ⁇ 64]-dimensional edge convolution feature 504 and the [N ⁇ 128]-dimensional edge convolution feature 505 output by the second edge convolution layer are spliced to obtain a [N ⁇ 224]-dimensional second feature.
  • the terminal inputs the second feature into the multi-layer perceptron, and maps the second feature through the multi-layer perceptron to obtain the third feature.
  • the process of feature mapping by the terminal through the multi-layer perceptron is similar to the process of feature mapping through the multi-layer perceptron in the previous steps, and will not be repeated here.
  • the terminal inputs the third feature to the pooling layer, and reduces the dimensionality of the third feature through the pooling layer to obtain the global biological feature.
  • the pooling layer is a maximum pooling layer (max pooling layer), and the third feature is subjected to a maximum pooling operation in the maximum pooling layer, or it is an average pooling layer.
  • the average pooling operation is performed on the third feature in the average pooling layer, and the embodiment of the present application does not specifically limit the type of the pooling layer.
  • the terminal fuses the global biological feature, the graph data of the at least one location, and the edge convolution feature output by the at least one edge convolution layer, and inputs the feature obtained by the fusion into the output of the graph convolutional neural network
  • the output layer is used to perform probability fitting on the feature obtained by the fusion to obtain at least one predicted probability.
  • a predicted probability is used to characterize the possibility that a site belongs to a binding site.
  • the fused features are input to the multilayer perceptron in the output layer, and the fused features are mapped by the multilayer perceptron , Get the at least one predicted probability.
  • the mapping process of the multi-layer perceptron is similar to the mapping process of the multi-layer perceptron in the previous steps, and will not be repeated here.
  • the terminal fuses the global biological features, the map data of each site, and the edge convolution features output by each edge convolution layer, and finally uses a multi-layer perceptron to perform probabilistic fitting of the fused features. Combine the predicted probability that each site belongs to the binding site.
  • the above fusion process is to directly perform the global biological characteristics, the map data of each site, and the edge convolution features output by each edge convolution layer. Splicing.
  • FIG. 5 Take the graph convolutional neural network including two side convolutional layers as an example.
  • the terminal outputs [N ⁇ 32]-dimensional graph data 503 and the first side convolutional layer [ The N ⁇ 64]-dimensional edge convolution feature 504, the [N ⁇ 128]-dimensional edge convolution feature 505 output by the second edge convolution layer, and the [1 ⁇ 1024]-dimensional global biological feature 506 are concatenated to obtain A [1 ⁇ 1248]-dimensional fusion feature 507, input the [1 ⁇ 1248]-dimensional fusion feature 507 into the multi-layer perceptron MLPs, and use the multi-layer perceptron MLPs to fit each location that the location belongs to Combining the predicted probability of the binding site, the final output of the detection result is an [N ⁇ 1]-dimensional array 508, and each value in the array 508 represents the predicted probability of a site belonging to the binding site.
  • this task since it is necessary to predict whether each site in the input protein molecule is a binding site, this task
  • the terminal calls the location detection model to perform prediction processing on the extracted location features to obtain at least one location.
  • a process of predicting probability is another deep learning network.
  • the embodiment of the present application does not specifically limit the type of location detection model.
  • the terminal determines a binding site in the at least one site in the target molecule based on the at least one predicted probability.
  • the terminal determines a site with a predicted probability greater than the probability threshold as a binding site, or the terminal sorts the sites in descending order of the predicted probability, and places the sort at The site of the previous target number is determined as the binding site.
  • the probability threshold is any value greater than or equal to 0 and less than or equal to 1
  • the target number is any integer greater than or equal to 1. For example, when the number of targets is 3, the electronic device sorts the sites in descending order of the predicted probability, and determines the sites with the top 3 rankings as binding sites.
  • the method provided in the embodiments of the present application determines the first target point and the second target point corresponding to each position by obtaining the three-dimensional coordinates of each site in the target molecule, based on each site, each first target point and each The three-dimensional coordinates of the second target point are extracted from the three-dimensional coordinates of each site with rotation-invariant location features, and the site detection model is used to predict the extracted location features to obtain whether each site belongs to the binding site Predict the probability, and determine the binding site of the target molecule based on the predicted probability.
  • each site and each third The three-dimensional coordinates of a target point and each second target point are constructed to fully reflect the detailed structure of the target molecule and have rotation-invariant position features, thereby avoiding the loss of details caused by designing voxel features for the target molecule , which makes it possible to make full use of the position information of the detailed structure of the target molecule when the binding site is detected based on the position feature, which improves the accuracy of the molecular binding site detection process.
  • the powerful performance of the graph convolutional neural network in deep learning is used to extract the biological characteristics of protein molecules, instead of artificially designing a voxel feature as a biological feature by a technician, it is possible to obtain Biological features with higher expression ability, so as to achieve better recognition accuracy of binding sites, and the prediction of binding sites can be completed through GPU (Graphics Processing Unit, image processor), which can achieve real-time detection. Requirements, and because the position features of each site are rotation invariant, even when the protein molecule is rotated, it can still generate stable prediction results through the graph convolutional neural network, which improves the entire binding site detection process Accuracy and stability.
  • FIG. 7 is a schematic structural diagram of a molecular binding site detection device provided by an embodiment of the present application. Please refer to FIG. 7.
  • the device includes an acquisition module 701, a first determination module 702, an extraction module 703, a prediction module 704, and a second determination Module 705.
  • the obtaining module 701 is configured to obtain the three-dimensional coordinates of at least one site in the target molecule to be detected, and the target molecule is a chemical molecule of the binding site to be detected;
  • the first determination module 702 is configured to determine the first target point and the second target point corresponding to each site respectively, wherein the first target point of any site is the center of all sites included in the target spherical space Point, the target spherical space is a spherical space with any one site as the center of the sphere and the target length as the radius, and the second target point of any one site is a vector pointing to the site with the origin as the starting point The intersection of the forward extension line of and the outer surface of the target spherical space;
  • the extraction module 703 is configured to extract a position feature with rotation-invariant characteristics in the three-dimensional coordinates of the at least one location based on the three-dimensional coordinates of the at least one location, at least one first target point, and at least one second target point.
  • the location feature is used to characterize the location information of the at least one site in the target molecule;
  • the prediction module 704 is configured to call the site detection model to perform prediction processing on the extracted location feature to obtain at least one predicted probability of the at least one site, where one predicted probability is used to characterize that a site belongs to a binding site The possibility;
  • the second determining module 705 is configured to determine a binding site within the at least one site in the target molecule based on the at least one predicted probability.
  • the device provided in the embodiment of the present application determines the first target point and the second target point corresponding to each point by obtaining the three-dimensional coordinates of each site in the target molecule, based on each site, each first target point and each The three-dimensional coordinates of the second target point are extracted from the three-dimensional coordinates of each site with rotation-invariant location features, and the site detection model is used to predict the extracted location features to obtain whether each site belongs to the binding site Predict the probability, and determine the binding site of the target molecule based on the predicted probability.
  • each site and each third The three-dimensional coordinates of a target point and each second target point are constructed to fully reflect the detailed structure of the target molecule and have rotation-invariant position features, thereby avoiding the loss of details caused by designing voxel features for the target molecule , which makes it possible to make full use of the position information of the detailed structure of the target molecule when the binding site is detected based on the position feature, which improves the accuracy of the molecular binding site detection process.
  • the extraction module 703 includes:
  • the extraction unit is used for extracting any position of the at least one position based on the position, the first target point corresponding to the position, and the three-dimensional coordinates of the second target point corresponding to the position. Position features with rotation-invariant characteristics in three-dimensional coordinates.
  • the extraction unit is used to:
  • a local location feature is used to characterize the relative location information between the site and a neighboring point;
  • the location feature of the site is acquired.
  • the global location feature includes: the modulus length of the site, the distance between the site and the first target point, the distance between the first target point and the second target point , At least one of the cosine value of the first included angle or the cosine value of the second included angle, wherein the first included angle is the included angle formed between the first line segment and the second line segment, and the second included angle Is the angle formed between the second line segment and the third line segment, the first line segment is the line segment formed between the location and the first target point, and the second line segment is the first target point and the first target point.
  • a line segment formed between the second target point, and the third line segment is a line segment formed between the location and the second target point.
  • the local location feature between the location and the neighborhood point includes: the distance between the neighborhood point and the location , The distance between the neighborhood point and the first target point, the distance between the neighborhood point and the second target point, the cosine value of the third angle, the cosine value of the fourth angle, or the fifth angle At least one of the cosine values of an angle, where the third angle is the angle formed between the fourth line segment and the fifth line segment, and the fourth angle is the angle formed between the fifth line segment and the sixth line segment.
  • the fifth included angle is the included angle formed between the sixth line segment and the fourth line segment, the fourth line segment is the line segment formed between the neighboring point and the location, and the fifth The line segment is a line segment formed between the neighborhood point and the first target point, and the sixth line segment is a line segment formed between the neighborhood point and the second target point.
  • the location detection model is a graph convolutional neural network
  • the graph convolutional neural network includes an input layer, at least one edge convolution layer, and an output layer;
  • the prediction module 704 includes:
  • the input and output unit is used to input the location feature of the at least one location into the input layer of the graph convolutional neural network, and output the graph data of the at least one location through the input layer, and the graph data is used to represent in the form of a graph Location characteristics of the site;
  • the feature extraction unit is configured to input the graph data of at least one location into at least one edge convolution layer in the graph convolutional neural network, and perform feature extraction on the graph data of the at least one location through the at least one edge convolution layer , To obtain the global biological characteristics of the at least one site;
  • the probability fitting unit is used to fuse the global biological feature, the graph data of the at least one site, and the edge convolution feature output by the at least one edge convolution layer, and input the fused feature into the graph convolution
  • the output layer of the neural network performs probability fitting on the feature obtained by the fusion through the output layer to obtain the at least one predicted probability.
  • the input and output unit is used to:
  • the location feature of the at least one site is input into the multi-layer perceptron in the input layer, and the location feature of the at least one site is mapped through the multi-layer perceptron to obtain the first feature of the at least one site.
  • the dimension of the first feature is greater than the dimension of the location feature;
  • the first feature of the at least one site is input into the pooling layer in the input layer, and the dimensionality of the first feature of the at least one site is reduced through the pooling layer to obtain the map data of the at least one site.
  • the feature extraction unit includes:
  • the extraction input subunit is used to perform feature extraction on the edge convolution feature output by the upper convolution layer for any edge convolution layer in the at least one edge convolution layer, and input the extracted edge convolution feature to the lower side Convolutional layer
  • a splicing subunit for splicing the image data of the at least one site and the at least one edge convolution feature output by the at least one edge convolution layer to obtain a second feature
  • the mapping subunit is used to input the second feature into the multilayer perceptron, and map the second feature through the multilayer perceptron to obtain the third feature;
  • the dimensionality reduction subunit is used for inputting the third feature to the pooling layer, and performing dimensionality reduction on the third feature through the pooling layer to obtain the global biological feature.
  • the extraction input subunit is used to:
  • any side convolutional layer in the at least one side convolutional layer construct a clustering graph based on the side convolution features output by the upper side convolutional layer;
  • the intermediate feature is input to the pooling layer in the side convolutional layer, the dimensionality of the intermediate feature is reduced through the pooling layer, and the dimensionality-reduced intermediate feature is input to the next convolutional layer.
  • the probability fitting unit is used to:
  • the feature obtained by the fusion is input to the multi-layer perceptron in the output layer, and the feature obtained by the fusion is mapped by the multi-layer perceptron to obtain the at least one predicted probability.
  • the second determining module 705 is configured to:
  • a site with a predicted probability greater than the probability threshold is determined as a binding site.
  • the molecular binding site detection device provided in the above embodiment detects the binding site of the target molecule
  • only the division of the above functional modules is used for illustration. In practical applications, the above functions can be allocated according to needs. It is completed by different functional modules, that is, the internal structure of the electronic device is divided into different functional modules to complete all or part of the functions described above.
  • the molecular binding site detection device provided in the above embodiment and the molecular binding site detection method embodiment belong to the same concept. For the specific implementation process, please refer to the molecular binding site detection method embodiment, which will not be repeated here.
  • FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. Please refer to Figure 8, taking the electronic device as an example of a terminal 800 for description.
  • the terminal 800 is: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, Motion Picture Experts compresses standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, moving image experts compress standard audio layer 4) Player, laptop or desktop computer.
  • the terminal 800 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.
  • the terminal 800 includes a processor 801 and a memory 802.
  • the processor 801 includes one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 801 is implemented in at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array) .
  • the processor 801 includes a main processor and a co-processor.
  • the main processor is a processor used to process data in an awake state, and is also called a CPU (Central Processing Unit, central processing unit);
  • the coprocessor is a low-power processor used to process data in the standby state.
  • the processor 801 is integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used to render and draw content that needs to be displayed on the display screen.
  • the processor 801 includes an AI (Artificial Intelligence) processor, and the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 802 includes one or more computer-readable storage media, which are non-transitory.
  • the memory 802 further includes a high-speed random access memory and a non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 802 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 801 to implement the following molecular binding site detection steps:
  • the first target point of any site is the center point of all sites included in the target spherical space
  • the target spherical space is The any point is a spherical space with the center of the sphere and the target length as the radius
  • the second target point of any point is the positive extension line of the vector pointing to the point with the origin as the starting point and the target spherical space.
  • the site detection model to perform prediction processing on the extracted location feature to obtain at least one predicted probability of the at least one site, where one predicted probability is used to characterize the possibility that a site belongs to a binding site;
  • a binding site within the at least one site in the target molecule is determined.
  • the three-dimensional coordinates of the at least one location, the at least one first target point, and the at least one second target point are extracted based on the three-dimensional coordinates of the at least one location that has a rotation-invariant characteristic.
  • Features include:
  • the first target point corresponding to the site, and the second target point corresponding to the site extract the three-dimensional coordinates of the site with rotation Positional characteristics of invariant characteristics.
  • the three-dimensional coordinates of the three-dimensional coordinates of the location extracted based on the three-dimensional coordinates of the location, the first target point corresponding to the location, and the second target point corresponding to the location have a rotation-invariant characteristic
  • the location features include:
  • a local location feature is used to characterize the relative location information between the site and a neighboring point;
  • the location feature of the site is acquired.
  • the global location feature includes: the modulus length of the site, the distance between the site and the first target point, the distance between the first target point and the second target point , At least one of the cosine value of the first included angle or the cosine value of the second included angle, wherein the first included angle is the included angle formed between the first line segment and the second line segment, and the second included angle Is the angle formed between the second line segment and the third line segment, the first line segment is the line segment formed between the location and the first target point, and the second line segment is the first target point and the first target point.
  • a line segment formed between the second target point, and the third line segment is a line segment formed between the location and the second target point.
  • the local location feature between the location and the neighborhood point includes: the distance between the neighborhood point and the location , The distance between the neighborhood point and the first target point, the distance between the neighborhood point and the second target point, the cosine value of the third angle, the cosine value of the fourth angle, or the fifth angle At least one of the cosine values of an angle, where the third angle is the angle formed between the fourth line segment and the fifth line segment, and the fourth angle is the angle formed between the fifth line segment and the sixth line segment.
  • the fifth included angle is the included angle formed between the sixth line segment and the fourth line segment, the fourth line segment is the line segment formed between the neighboring point and the location, and the fifth The line segment is a line segment formed between the neighborhood point and the first target point, and the sixth line segment is a line segment formed between the neighborhood point and the second target point.
  • the location detection model is a graph convolutional neural network
  • the graph convolutional neural network includes an input layer, at least one edge convolution layer, and an output layer;
  • the calling site detection model performs prediction processing on the extracted location feature to obtain at least one predicted probability of the at least one site includes:
  • the map data is used to represent the location feature of the site in the form of a graph
  • the global biological features, the graph data of the at least one site, and the edge convolution features output by the at least one edge convolution layer are fused, and the fused features are input to the output layer of the graph convolutional neural network, and
  • the output layer performs probability fitting on the feature obtained by the fusion to obtain the at least one predicted probability.
  • the inputting the location feature of the at least one location into the input layer of the graph convolutional neural network, and outputting the graph data of the at least one location through the input layer includes:
  • the location feature of the at least one site is input into the multi-layer perceptron in the input layer, and the location feature of the at least one site is mapped through the multi-layer perceptron to obtain the first feature of the at least one site.
  • the dimension of the first feature is greater than the dimension of the location feature;
  • the first feature of the at least one site is input into the pooling layer in the input layer, and the dimensionality of the first feature of the at least one site is reduced through the pooling layer to obtain the map data of the at least one site.
  • the feature extraction of the map data of the at least one site through the at least one edge convolution layer to obtain the global biological feature of the at least one site includes:
  • any side convolution layer in the at least one side convolution layer feature extraction is performed on the side convolution feature output by the upper side convolution layer, and the extracted side convolution feature is input to the lower side convolution layer;
  • the third feature is input to the pooling layer, and the dimensionality of the third feature is reduced through the pooling layer to obtain the global biological feature.
  • any edge convolution layer of the at least one edge convolution layer feature extraction is performed on the edge convolution feature output by the upper convolution layer, and the extracted edge convolution feature is input
  • the convolutional layer on the lower side includes:
  • any side convolutional layer in the at least one side convolutional layer construct a clustering graph based on the side convolution features output by the upper side convolutional layer;
  • the intermediate feature is input to the pooling layer in the side convolutional layer, the intermediate feature is dimensionalized through the pooling layer, and the dimensionality-reduced intermediate feature is input to the next convolutional layer.
  • the input of the fused features into the output layer of the graph convolutional neural network, and the probability fitting of the fused features through the output layer to obtain the at least one predicted probability includes:
  • the feature obtained by the fusion is input to the multi-layer perceptron in the output layer, and the feature obtained by the fusion is mapped by the multi-layer perceptron to obtain the at least one predicted probability.
  • the determining a binding site in the at least one site in the target molecule based on the at least one predicted probability includes:
  • a site with a predicted probability greater than the probability threshold is determined as a binding site.
  • the terminal 800 may optionally further include: a peripheral device interface 803 and at least one peripheral device.
  • the processor 801, the memory 802, and the peripheral device interface 803 are connected by a bus or signal line.
  • Each peripheral device is connected to the peripheral device interface 803 through a bus, a signal line or a circuit board.
  • the peripheral device includes: a display screen 804.
  • the peripheral device interface 803 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 801 and the memory 802.
  • I/O Input/Output
  • the display screen 804 is used to display a UI (User Interface, user interface).
  • the UI includes graphics, text, icons, videos, and any combination of them.
  • the display screen 804 also has the ability to collect touch signals on or above the surface of the display screen 804.
  • the touch signal is input to the processor 801 as a control signal for processing.
  • the display screen 804 is also used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • FIG. 8 does not constitute a limitation on the terminal 800, and includes more or fewer components than shown in the figure, or combines some components, or adopts different component arrangements.
  • a computer-readable storage medium such as a memory including at least one program code, which can be executed by a processor in a terminal to complete the following molecular binding site detection steps:
  • the first target point of any site is the center point of all sites included in the target spherical space
  • the target spherical space is The any point is a spherical space with the center of the sphere and the target length as the radius
  • the second target point of any point is the positive extension line of the vector pointing to the point with the origin as the starting point and the target spherical space.
  • the site detection model to perform prediction processing on the extracted location feature to obtain at least one predicted probability of the at least one site, where one predicted probability is used to characterize the possibility that a site belongs to a binding site;
  • a binding site within the at least one site in the target molecule is determined.
  • the first target point corresponding to the site, and the second target point corresponding to the site extract the three-dimensional coordinates of the site with rotation Positional characteristics of invariant characteristics.
  • the three-dimensional coordinates of the site, the first target point corresponding to the site, and the second target point corresponding to the site are extracted based on the three-dimensional coordinates of the site with rotation-invariant characteristics
  • the location features include:
  • a local location feature is used to characterize the relative location information between the site and a neighboring point;
  • the location feature of the site is acquired.
  • the global location feature includes: the modulus length of the site, the distance between the site and the first target point, the distance between the first target point and the second target point , At least one of the cosine value of the first included angle or the cosine value of the second included angle, wherein the first included angle is the included angle formed between the first line segment and the second line segment, and the second included angle Is the angle formed between the second line segment and the third line segment, the first line segment is the line segment formed between the location and the first target point, and the second line segment is the first target point and the A line segment formed between the second target point, and the third line segment is a line segment formed between the location and the second target point.
  • the local location feature between the location and the neighborhood point includes: the distance between the neighborhood point and the location , The distance between the neighborhood point and the first target point, the distance between the neighborhood point and the second target point, the cosine value of the third angle, the cosine value of the fourth angle, or the fifth angle At least one of the cosine values of an angle, where the third angle is the angle formed between the fourth line segment and the fifth line segment, and the fourth angle is the angle formed between the fifth line segment and the sixth line segment.
  • the fifth included angle is the included angle formed between the sixth line segment and the fourth line segment, the fourth line segment is the line segment formed between the neighboring point and the location, and the fifth The line segment is a line segment formed between the neighborhood point and the first target point, and the sixth line segment is a line segment formed between the neighborhood point and the second target point.
  • the location detection model is a graph convolutional neural network
  • the graph convolutional neural network includes an input layer, at least one edge convolution layer, and an output layer;
  • the calling site detection model performs prediction processing on the extracted location feature to obtain at least one predicted probability of the at least one site includes:
  • the map data is used to represent the location feature of the site in the form of a graph
  • the global biological features, the graph data of the at least one site, and the edge convolution features output by the at least one edge convolution layer are fused, and the fused features are input to the output layer of the graph convolutional neural network, and
  • the output layer performs probability fitting on the feature obtained by the fusion to obtain the at least one predicted probability.
  • the inputting the location feature of the at least one location into the input layer of the graph convolutional neural network, and outputting the graph data of the at least one location through the input layer includes:
  • the location feature of the at least one site is input into the multi-layer perceptron in the input layer, and the location feature of the at least one site is mapped through the multi-layer perceptron to obtain the first feature of the at least one site.
  • the dimension of the first feature is greater than the dimension of the location feature;
  • the first feature of the at least one site is input into the pooling layer in the input layer, and the dimensionality of the first feature of the at least one site is reduced through the pooling layer to obtain the map data of the at least one site.
  • the feature extraction of the map data of the at least one site through the at least one edge convolution layer to obtain the global biological feature of the at least one site includes:
  • any side convolution layer in the at least one side convolution layer feature extraction is performed on the side convolution feature output by the upper side convolution layer, and the extracted side convolution feature is input to the lower side convolution layer;
  • the third feature is input to the pooling layer, and the dimensionality of the third feature is reduced through the pooling layer to obtain the global biological feature.
  • any edge convolution layer of the at least one edge convolution layer feature extraction is performed on the edge convolution feature output by the upper convolution layer, and the extracted edge convolution feature is input
  • the convolutional layer on the lower side includes:
  • any side convolutional layer in the at least one side convolutional layer construct a clustering graph based on the side convolution features output by the upper side convolutional layer;
  • the intermediate feature is input to the pooling layer in the side convolutional layer, the intermediate feature is dimensionalized through the pooling layer, and the dimensionality-reduced intermediate feature is input to the next convolutional layer.
  • the input of the fused feature into the output layer of the graph convolutional neural network, and the probability fitting of the fused feature through the output layer to obtain the at least one predicted probability includes:
  • the feature obtained by the fusion is input to the multi-layer perceptron in the output layer, and the feature obtained by the fusion is mapped by the multi-layer perceptron to obtain the at least one predicted probability.
  • the determining a binding site within the at least one site in the target molecule based on the at least one predicted probability includes:
  • a site with a predicted probability greater than the probability threshold is determined as a binding site.
  • the above-mentioned computer-readable storage medium is ROM (Read-Only Memory), RAM (Random-Access Memory, random access memory), CD-ROM (Compact Disc Read-Only Memory), and only CD-ROM), magnetic tapes, floppy disks and optical data storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Chemical & Material Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Multimedia (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Medicinal Chemistry (AREA)

Abstract

本申请公开了一种分子结合位点检测方法、装置、电子设备及存储介质,属于计算机技术领域。本申请通过获取目标分子中各个位点的三维坐标,确定各个位点对应的第一目标点和第二目标点,进而提取出各个位点的三维坐标中具有旋转不变特性的位置特征,调用位点检测模型对提取到的位置特征进行预测,得到各个位点是否属于结合位点的预测概率,基于预测概率确定出结合位点,由于第一目标点和第二目标点是与各个位点相关的且具有一定空间代表性的点,有利于构造出能够全面体现出目标分子细节结构的、具有旋转不变特性的位置特征,避免了为目标分子设计体素特征所带来的细节损失,提升了分子结合位点检测过程的准确性。

Description

分子结合位点检测方法、装置、电子设备及存储介质
本申请要求于2020年4月9日提交的申请号为202010272124.0、发明名称为“分子结合位点检测方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种分子结合位点检测方法、装置、电子设备及存储介质。
背景技术
随着计算机技术的发展,在生物医学领域中如何通过计算机检测蛋白质分子的结合位点称为一个热点议题,其中,蛋白质分子的结合位点是指蛋白质分子上与其他分子相互绑定的位置点,俗称为“蛋白质结合口袋”。确定蛋白质分子的结合位点对蛋白质结构和功能的分析有重要意义,因此,如何准确地检测出蛋白质分子中的结合位点是一个重要研究方向。
发明内容
本申请实施例提供了一种分子结合位点检测方法、装置、电子设备及存储介质,能够提升分子结合位点检测过程的准确率。该技术方案如下。
一方面,提供了一种分子结合位点检测方法,应用于电子设备,该方法包括:
获取待检测的目标分子中至少一个位点的三维坐标,该目标分子为待检测结合位点的化学分子;
分别确定每个位点对应的第一目标点和第二目标点,其中,任一个位点的第一目标点为目标球形空间内所包括的所有位点的中心点,该目标球形空间是以该任一个位点为球心、以目标长度为半径的球形空间,任一个位点的第二目标点为以原点为起点、指向该位点的向量的正向延长线与该目标球形空间的外表面的交点;
基于该至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取该至少一个位点的三维坐标中具有旋转不变特性的位置特征,该位置特征用于表征该至少一个位点在该目标分子中所处的位置信息;
调用位点检测模型对提取到的该位置特征进行预测处理,以得到该至少一个位点的至少一个预测概率,其中,一个预测概率用于表征一个位点属于结合位点的可能性;
基于该至少一个预测概率,确定该目标分子中该至少一个位点内的结合位点。
一方面,提供了一种分子结合位点检测装置,该装置包括:
获取模块,用于获取待检测的目标分子中至少一个位点的三维坐标,该目标分子为待检测结合位点的化学分子;
第一确定模块,用于分别确定每个位点对应的第一目标点和第二目标点,其中,任一个位点的第一目标点为目标球形空间内所包括的所有位点的中心点,该目标球形空间是以该任一个位点为球心、以目标长度为半径的球形空间,任一个位点的第二目标点为以原点为起点、指向该位点的向量的正向延长线与该目标球形空间的外表面的交点;
提取模块,用于基于该至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取该至少一个位点的三维坐标中具有旋转不变特性的位置特征,该位置特征用于表征该至少一个位点在该目标分子中所处的位置信息;
预测模块,用于调用位点检测模型对提取到的该位置特征进行预测处理,以得到该至少一个位点的至少一个预测概率,其中,一个预测概率用于表征一个位点属于结合位点的可能性;
第二确定模块,用于基于该至少一个预测概率,确定该目标分子中该至少一个位点内的结合位点。
一方面,提供了一种电子设备,该电子设备包括一个或多个处理器和一个或多个存储器,该一个或多个存储器中存储有至少一条程序代码,该至少一条程序代码由该一个或多个处理器加载并执行以实现如上述任一种可能实现方式的分子结合位点检测方法。
一方面,提供了一种存储介质,该存储介质中存储有至少一条程序代码,该至少一条程序代码由处理器加载并执行以实现如上述任一种可能实现方式的分子结合位点检测方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
通过获取目标分子中各个位点的三维坐标,确定出各个位点分别对应的第一目标点和第二目标点,基于各个位点、各个第一目标点和各个第二目标点的三维坐标,提取出各个位点的三维坐标中具有旋转不变特性的位置特征,调用位点检测模型对提取到的位置特征进行预测,得到各个位点是否属于结合位点的预测概率,从而基于预测概率确定出目标分子的结合位点,由于第一目标点和第二目标点是与各个位点相关的且具有一定空间代表性的点,因此借助各个位点、各个第一目标点和各个第二目标点的三维坐标,构造出能够全面体现出目标分子细节结构的、具有旋转不变特性的位置特征,从而避免了为目标分子设计体素特征所带来的细节损失,使得基于位置特征进行结合位点检测时,能够充分利用目标分子的细节结构的位置信息,提升了分子结合位点检测过程的准确性。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附 图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还能够根据这些附图获得其他的附图。
图1是本申请实施例提供的一种分子结合位点检测方法的实施环境示意图;
图2是本申请实施例提供的一种分子结合位点检测方法的流程图;
图3是本申请实施例提供的一种分子结合位点检测方法的流程图;
图4是本申请实施例提供的一种第一目标点和第二目标点的示意图;
图5是本申请实施例提供的一种图卷积神经网络的原理性示意图;
图6是本申请实施例提供的一种边卷积层的结构示意图;
图7是本申请实施例提供的一种分子结合位点检测装置的结构示意图;
图8是本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。
本申请中术语“至少一个”是指一个或多个,“多个”的含义是指两个或两个以上,例如,多个第一位置是指两个或两个以上的第一位置。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互***、机电一体化等技术。人工智能软件技术主要包括音频处理技术、计算机视觉技术、自然语言处理技术以及机器学习/深度学习等几大方向。
本申请实施例提供的技术方案涉及到人工智能领域的机器学习技术,机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。机器学习技术专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习等技 术。
随着机器学习技术的研究和进步,机器学习技术在多个领域展开了广泛的研究和应用,本申请实施例提供的技术方案涉及到机器学习技术在生物医学领域的应用,具体地,涉及到一种基于人工智能的分子结合位点检测方法,结合位点是指在当前分子上与其他分子相互绑定的各类位点,俗称为“结合口袋”、“结合口袋位点”。
以蛋白质分子为例进行说明,随着生物学和医学中针对重要蛋白质分子的结构知识的不断增长,预测蛋白质分子的结合位点成为一个越来越重要的热点议题,通过预测蛋白质分子的结合位点能够更好地揭示出蛋白质的分子功能。由于生物过程都是通过蛋白质分子的相互作用来实现的,因此要想完全理解或要操纵生物过程,就需要技术人员揭开蛋白质分子相互作用的背后机制,其中,比如生物过程包括DNA(DeoxyriboNucleic Acid,脱氧核糖核酸)合成、信号传导、生命代谢等,而研究蛋白质分子相互作用机制的第一步就是要识别出蛋白质分子的相互作用位点(也即结合位点)。因此,预测蛋白质分子的结合位点能够辅助技术人员后续对蛋白质分子结构和功能的分析。
进一步地,预测蛋白质分子的结合位点还能为设计出合理的药物分子提供帮助:蛋白质分子的作用分析在各种疾病的治疗方面具有极大的推进作用,通过对蛋白质分子结构和功能的分析,能够揭示出某些疾病的发病机理,进而为寻找某些药物的靶点和新药研发具有指导作用。
因此,预测蛋白质分子的结合位点不但对揭示蛋白质分子自身的结构和功能具有重大意义,而且通过揭示蛋白质分子自身的结构和功能,还能够进一步地在病理学上揭示出某些疾病的发病机理,从而指导药物的靶点的寻找、指导新药研发。
需要说明的是,本申请实施例的分子结合位点检测方法用于检测出目标分子的结合位点,但目标分子并不局限于上述蛋白质分子,该目标分子是ATP(Adenosine TriphosPhate,腺苷三磷酸)分子、有机聚合物分子、有机小分子等化学分子,本申请实施例不对目标分子的类型进行具体限定。
以下,对本申请实施例所涉及的术语进行解释。
蛋白质结合口袋:位于蛋白质分子上的与其他分子相互绑定的各类结合位点。
点云数据(point cloud data):在某个坐标系下的点的数据集合。每个点的数据包含丰富的信息,包括该点的三维坐标、颜色、强度值、时间等,通常利用三维激光扫描仪进行数据采集获取点云数据。
深度卷积神经网络(Deep Convolutional Neural Network,DCNN):DCNN是一类包含卷积计算且具有深度结构的前馈神经网络,是深度学习的代表算法之一。DCNN的结构包括输入层、隐含层和输出层。隐含层中通常包括卷积层(convolutional layer)、池化层(pooling layer)和全连接层(fully-connected layer)。卷积层的功能是对输入数据进行特征提取,卷积层内部包含多个卷积核,组成卷积核的每个元素都对应一个权重系数和一个偏差量。在卷积层进行 特征提取后,输出的特征图会被传递至池化层进行特征选择和过滤。全连接层位于卷积神经网络隐含层的最后部分。特征图会在全连接层中失去空间拓扑结构,被展开为向量并通过激励函数传递给输出层。DCNN研究的对象必须具有规则的空间结构,比如图像、体素等。
图卷积神经网络(Graph Convolutional Network,GCN):GCN是一种能对图数据进行深度学习的方法,GCN对输入数据构建出具有点和边的图数据,利用多个隐含层为每个点提取高维特征,该特征隐含了这个点与周围点之间的图连接关系,最后通过输出层得到预期的输出结果。GCN在电子商务推荐***、新药研发、点云分析等很多任务上取得了成功,GCN网络结构包括Spectral CNN(光谱卷积神经网络),Graph Attention Network(图注意力网络),Graph Recurrent Attention Network(图递归注意力网络),Dynamic Graph CNN(动态图卷积神经网络,DGCNN)等。传统的GCN不具备旋转不变特性。
多层感知器(Multilayer Perceptron,MLP):MLP是一种前向结构的人工神经网络,能够将一组输入向量映射到一组输出向量。
在相关技术中,以蛋白质分子为例,利用DCNN进行蛋白质分子的结合位点(蛋白质结合口袋)的检测,DCNN近年来在图像和视频的分析、识别、处理等领域均表现出良好的性能,因此尝试将DCNN迁移至识别蛋白质结合口袋这一任务中。虽然传统的DCNN在很多任务上取得了成功,但是DCNN研究的对象必须具有规则的空间结构,比如图像的像素、分子的体素等,对于现实生活中很多并不具有规则的空间结构的数据(比如蛋白质分子),要想将DCNN迁移到蛋白质结合口袋的检测过程中,那么技术人员必须为蛋白质分子手动设计出一个具有规则的空间结构的特征,以此作为DCNN的输入。例如,在检测蛋白质结合口袋时,针对蛋白质分子设计出一个体素特征,再将该体素特征输入到深度卷积神经网络DCNN中,通过DCNN来预测输入的体素特征所对应的分子结构是否为蛋白质结合口袋,这一过程视为利用DCNN处理一个二分类问题。
在一个示例中,DeepSite网络是首次提出的一种检测蛋白质结合口袋的DCNN网络,通过从蛋白质分子中手动设计出特征(本质上是一种子结构)作为DCNN的输入,并用多层卷积神经网络预测输入的蛋白质分子的子结构是否属于口袋结合位点。随后,在另一个示例中,技术人员又提出了一种全新的特征提取器:从蛋白质分子的形状和结合位点的能量两方面进行特征提取,输出的特征以3D体素的表示方式(也即体素特征)被输入到DCNN网络中。类似的,在另一个示例中,FRSite也是一种检测蛋白质结合口袋的DCNN网络,通过从蛋白质分子中提取体素特征作为DCNN网络输入,并且利用快速卷积神经网络进行结合位点的检测。同理,在另一个示例中,DeepDrup3D也是一种检测蛋白质结合口袋的DCNN网络,通过直接将蛋白质分子转化为3D体素作为DCNN网络的输入,进而去预测蛋白质结合口袋。
然而,上述基于体素特征的DCNN检测方法严重受限于体素的分辨率,无法处理更加精细的蛋白质分子结构。并且,由于都需要手动设计体素特征作为DCNN网络的输入。虽然这些体素特征经过了技术人员的精心设计,但是仍然无法保证能够充分表征出蛋白质分子中暗 含的重要信息。因此,最终蛋白质结合口袋的检测结果往往还会受限于设计的体素特征的提取方法。
有鉴于此,本申请实施例提供一种分子结合位点检测方法,用于检测目标分子的结合位点,以目标分子为蛋白质分子为例说明,直接把蛋白质分子的点云数据(包括三维坐标)作为***输入,采用图卷积神经网络等位点检测模型来进行自主探索,位点检测模型能够充分探索蛋白质分子的组织结构,从而自动提取出高效的、最有利于结合口袋检测的生物学特征,因此能够从蛋白质分子的点云数据中准确识别出蛋白质结合口袋。
进一步地,相较于传统的图卷积神经网络而言,由于传统的图卷积神经网络不具备旋转不变特性,而蛋白质分子能够在三维空间中进行任意旋转,如果采用的网络结构不具备旋转不变特性,那么相同的蛋白质分子在经过旋转前后的口袋检测结果可能会有很大不同,这将大大降低蛋白质结合口袋的检测准确率。而本申请实施例通过将蛋白质分子的点云数据中三维坐标点转化为旋转不变的表征(也即位置特征),比如角度、长度等,将具有旋转不变性的位置特征取代旋转变化的三维坐标点作为***输入,使得位点检测模型的网络结构具有旋转不变特性,也即是说,蛋白质结合口袋的检测结果不随着输入蛋白质点云数据的方向而发生改变,这对于蛋白质结合口袋的检测过程具有突破性的意义。下面,将对本申请实施例的应用场景进行详述说明。
图1是本申请实施例提供的一种分子结合位点检测方法的实施环境示意图。参见图1,在该实施环境中包括终端101和服务器102,终端101和服务器102均为一种电子设备。
终端101用于提供目标分子的点云数据,比如,终端101是三维激光扫描仪的控制终端,通过三维激光扫描仪对目标分子进行数据采集,将采集到的点云数据导出至该控制终端,通过控制终端生成携带目标分子的点云数据的检测请求,该检测请求用于请求服务器102检测目标分子的结合位点,使得服务器102响应于检测请求,基于目标分子的点云数据对目标分子进行结合位点的检测工作,确定出目标分子的结合位点,将该目标分子的结合位点返回至该控制终端。
在上述过程中,控制终端将整个目标分子的点云数据均发送至服务器102,能够使得服务器102对目标分子进行更加全面的分子结构分析。在一些实施例中,由于点云数据除了各个位点的三维坐标之外还包括颜色、强度值、时间等附加属性,在一些实施例中,控制终端仅将目标分子中至少一个位点的三维坐标发送至服务器102,从而能够节约数据传输过程的通信量。
终端101和服务器102通过有线网络或无线网络相连。
服务器102用于提供分子结合位点的检测服务,服务器102在接收到任一终端的检测请求之后,解析该检测请求,得到目标分子的点云数据,基于点云数据中各个位点的三维坐标,提取出各个位点具有旋转不变性的位置特征,将该位置特征作为位点检测模型的输入,执行预测结合位点的操作,得到目标分子的结合位点。
在一些实施例中,服务器102包括一台服务器、多台服务器、云计算平台或者虚拟化中心中的至少一种。在一些实施例中,服务器102承担主要计算工作,终端101承担次要计算工作;或者,服务器102承担次要计算工作,终端101承担主要计算工作;或者,终端101和服务器102两者之间采用分布式计算架构进行协同计算。
在上述过程中,以终端101和服务器102通过通信交互完成分子结合位点检测为例进行说明,在一些实施例中,终端101也能够独立完成分子结合位点的检测工作,此时终端101采集到目标分子的点云数据之后,直接基于点云数据中各个位点的三维坐标,执行基于位点检测模型的预测处理,预测出目标分子的结合位点,与服务器102的预测过程类似,这里不做赘述。
在一些实施例中,终端101泛指多个终端中的一个,终端101的设备类型包括但不限于:智能手机、平板电脑、电子书阅读器、MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)播放器、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机或者台式计算机中的至少一种。以下实施例,以终端包括智能手机来进行举例说明。
本领域技术人员知晓,上述终端101的数量更多或更少。比如上述终端101仅为一个,或者上述终端101为几十个或几百个,或者更多数量。本申请实施例对终端101的数量和设备类型不加以限定。
图2是本申请实施例提供的一种分子结合位点检测方法的流程图。参见图2,该方法应用于电子设备,该实施例包括下述步骤。
201、电子设备获取待检测的目标分子中至少一个位点的三维坐标,该目标分子为待检测结合位点的化学分子。
其中,目标分子是任一待检测结合位点的化学分子,比如蛋白质分子、ATP(Adenosine TriphosPhate,腺苷三磷酸)分子、有机聚合物分子、有机小分子等,本申请实施例不对目标分子的类型进行具体限定。
在一些实施例中,该至少一个位点的三维坐标通过点云数据的形式表示,由某一坐标系内的至少一个三维坐标点堆叠在一起来描述目标分子的结构。相较于3D体素的表示形式,点云数据占用的存储空间更小,并且由于3D体素依赖于特征提取方式,在特征提取过程中容易丢失掉目标分子中一些细节结构,因此点云数据还能够描述出目标分子的细节结构。
由于三维坐标点是一类对于旋转非常敏感的数据,以蛋白质分子为例,相同的蛋白质点云在经过旋转之后,各个位点的三维坐标值会发生改变,因此,如果直接将各个位点的三维坐标输入到位点检测模型中进行特征提取和结合位点预测,由于坐标值在旋转前后会发生改变,那么同一位点检测模型针对旋转前后的输入,有可能分别会提取出不同的生物学特征,从而预测出不同的结合位点,也即是说,正是由于三维坐标点不具备旋转不变性,那么会导致位点检测模型对同一蛋白质分子在旋转前后预测出不同的结合位点,导致无法保障分子结 合位点检测过程的准确性。
202、电子设备分别确定每个位点对应的第一目标点和第二目标点,其中,任一个位点的第一目标点为目标球形空间内所包括的所有位点的中心点,该目标球形空间是以该任一个位点为球心、以目标长度为半径的球形空间,任一个位点的第二目标点为以原点为起点、指向该位点的向量的正向延长线与该目标球形空间的外表面的交点。
其中,每个位点均唯一对应于一个第一目标点和第二目标点,对每个位点而言,其第一目标点是指:以该位点为球心、以目标长度为半径的目标球形空间内所包含的目标分子的所有位点的中心点,这个中心点是基于目标球形空间内所包含的所有位点的三维坐标进行平均值计算而得到的一个空间点,因此第一目标点并不一定是目标分子的点云数据中真实存在的位点,目标长度为任一大于0的数值;其第二目标点是指:以原点为起点、指向该位点的向量的正向延长线与该目标球形空间的外表面的交点,该原点是该目标分子所处三维坐标系的原点,以原点为起点引出一条指向该位点的向量,该向量的方向从原点指向该位点,该向量的长度等于该位点的模长,该向量的正向延长线与目标球形空间的外表面具有唯一的一个交点,这个交点即为第二目标点,同理,第二目标点也并不一定是目标分子的点云数据中真实存在的位点。
203、电子设备基于该至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取该至少一个位点的三维坐标中具有旋转不变特性的位置特征,该位置特征用于表征该至少一个位点在该目标分子中所处的位置信息。
在上述步骤203中,通过各个位点、各个第一目标点和各个第二目标点的三维坐标获取各个位点的位置特征,也即是,该位置特征不受目标分子的旋转角度的影响,以位置特征替换三维坐标来作为位点检测模型的输入,能够避免上述步骤201中所涉及的由于三维坐标不具备旋转不变性而导致检测准确性下降的问题。
204、电子设备调用位点检测模型对提取到的位置特征进行预测处理,以得到该至少一个位点的至少一个预测概率,其中,一个预测概率用于表征一个位点属于结合位点的可能性。
其中,该位点检测模型用于检测目标分子的结合位点,在一些实施例中,位点检测模型属于一种分类模型,用来处理目标分子中各个位点是否属于结合位点这一分类任务,在一些实施例中,该位点检测模型包括图卷积神经网络,或者包括其他的深度学习网络,本申请实施例不对位点检测模型的类型进行具体限定。
在上述步骤204中,电子设备将各个位点的位置特征输入位点检测模型,由位点检测模型基于各个位点的位置特征进行结合位点的预测操作,在一些实施例中,在位点检测模型中,先基于各个位点的位置特征提取出目标分子的生物学特征,再基于目标分子的生物学特征进行结合位点的检测,得到各个位点的预测概率。
205、电子设备基于该至少一个预测概率,确定该目标分子中该至少一个位点内的结合位点。
在上述过程中,电子设备将预测概率大于概率阈值的位点确定为结合位点,或者按照预 测概率从大到小的顺序对位点进行排序,将排序位于前目标数量的位点确定为结合位点。其中,该概率阈值是任一大于或等于0且小于或等于1的数值,该目标数量是任一大于或等于1的整数。例如,当目标数量为3时,电子设备按照预测概率从大到小的顺序对位点进行排序,将排序位于前3的位点确定为结合位点。
本申请实施例提供的方法,通过获取目标分子中各个位点的三维坐标,确定出各个位点分别对应的第一目标点和第二目标点,基于各个位点、各个第一目标点和各个第二目标点的三维坐标,提取出各个位点的三维坐标中具有旋转不变特性的位置特征,调用位点检测模型对提取到的位置特征进行预测,得到各个位点是否属于结合位点的预测概率,从而基于预测概率确定出目标分子的结合位点,由于第一目标点和第二目标点是与各个位点相关的且具有一定空间代表性的点,因此借助各个位点、各个第一目标点和各个第二目标点的三维坐标,构造出能够全面体现出目标分子细节结构的、具有旋转不变特性的位置特征,从而避免了为目标分子设计体素特征所带来的细节损失,使得基于位置特征进行结合位点检测时,能够充分利用目标分子的细节结构的位置信息,提升了分子结合位点检测过程的准确性。
图3是本申请实施例提供的一种分子结合位点检测方法的流程图。参见图3,该实施例应用于电子设备,以电子设备为终端为例进行说明,该实施例包括下述步骤。
300、终端获取待检测的目标分子中至少一个位点的三维坐标,该目标分子为待检测结合位点的化学分子。
上述步骤300与上述步骤201类似,这里不做赘述。
301、对于该至少一个位点中任一位点,终端基于该位点的三维坐标,确定该位点对应的第一目标点和第二目标点。
其中,每个位点均唯一对应于一个第一目标点,对每个位点而言,其第一目标点是指:以该位点为球心、以目标长度为半径的目标球形空间内所包含的所有位点的中心点,该目标球形空间是指以该位点为球心、以目标长度为半径的球形空间,这个中心点是基于目标球形空间内所包含的所有位点的三维坐标进行平均值计算而得到的一个空间点,因此第一目标点并不一定是目标分子的点云数据中真实存在的位点,其中,该目标长度由技术人员进行指定,目标长度为任一大于0的数值。
其中,每个位点均唯一对应于一个第二目标点,对每个位点而言,其第二目标点是指:以原点为起点、指向该位点的向量的正向延长线与该目标球形空间的外表面的交点,以原点为起点引出一条指向该位点的向量,该向量的方向从原点指向该位点,该向量的长度等于该位点的模长,该向量的正向延长线与目标球形空间的外表面具有唯一的一个交点,这个交点即为第二目标点,同理,第二目标点也并不一定是目标分子的点云数据中真实存在的位点。
在上述过程中,终端在确定第一目标点和第二目标点的过程中,先确定以该位点为球心、以目标长度为半径的目标球形空间,再从目标分子的至少一个位点中确定位于该目标球形空间内的所有位点,将位于目标球形空间内所有位点的中心点确定为第一目标点,在一些实施 例中,在确定上述中心点时,获取位于该目标球形空间内的所有位点的三维坐标,将位于该目标球形空间内的所有位点的三维坐标的平均值坐标确定为上述中心点的三维坐标,也即是第一目标点的三维坐标。进一步地,确定以原点为起点、指向该位点的向量,将该向量的正向延长线与目标球形空间外表面的交点确定为第二目标点。
图4是本申请实施例提供的一种第一目标点和第二目标点的示意图,请参考图4,在一个示例中,假设蛋白质分子的点云数据包括N(N≥1)个位点的三维坐标,那么该点云数据由N个三维坐标点
Figure PCTCN2021078263-appb-000001
堆叠而成,其中原点为origin(0,0,0),p i表示第i个位点的三维坐标,x i、y i、z i分别表示第i个位点在x、y、z轴上的坐标值,i为大于或等于1且小于或等于N的整数,通过点云数据能够描述蛋白质分子的结构。针对第i个位点400,在以p i为球心、r为半径的目标球形空间401中,将目标球形空间401中所包含所有位点的中心点m i确定为第一目标点402,具体地,将目标球形空间401中所包含所有位点的x轴坐标平均值确定为中心点m i的x轴坐标,将目标球形空间401中所包含所有位点的y轴坐标平均值确定为中心点m i的y轴坐标,将目标球形空间401中所包含所有位点的z轴坐标平均值确定为中心点m i的z轴坐标;将以原点为起点、指向p i的向量的正向延长线与目标球形空间401外表面的交点s i确定为第二目标点403。
302、终端基于该位点、该第一目标点以及该第二目标点的三维坐标,构建该位点的全局位置特征,该全局位置特征用于表征该位点在目标分子内所处的空间位置信息。
在一些实施例中,该全局位置特征包括:该位点的模长、该位点与该第一目标点之间的距离、该第一目标点与该第二目标点之间的距离、第一夹角的余弦值或者第二夹角的余弦值中至少一项,其中,该第一夹角为第一线段与第二线段之间所构成的夹角,该第二夹角为该第二线段与第三线段之间所构成的夹角,该第一线段为该位点与该第一目标点之间所构成的线段,该第二线段为该第一目标点与该第二目标点之间所构成的线段,该第三线段为该位点与该第二目标点之间所构成的线段。
在一些实施例中,终端获取该位点的模长、该位点与该第一目标点之间的距离、该第一目标点与该第二目标点之间的距离、第一夹角的余弦值以及第二夹角的余弦值,基于上述五项数据构建一个五维向量,将该五维向量作为该位点的全局位置特征。
在一些实施例中,该全局位置特征包括:该位点的模长、该位点与该第一目标点之间的距离、该第一目标点与该第二目标点之间的距离、第一夹角的角度或者第二夹角的角度中至少一项。也即是说,不对第一夹角和第二夹角取余弦值,直接将第一夹角和第二夹角的角度作为全局位置特征中的元素。
在一个示例中,请参考图4,针对第i个位点400(用p i表示),在以p i为球心、r为半径的目标球形空间401中,通过上述步骤301确定出第一目标点402(用m i表示)和第二目标点403(用s i表示),终端分别获取下列五项数据。
1)位点p i的模长dp i=||p i|| 2
2)位点p i与第一目标点m i之间的距离dpm i=||p i-m i|| 2
3)位点p i与第二目标点s i之间的距离dsm i=||p i-s i|| 2
4)第一夹角α i的余弦值cos(α i),该第一夹角α i为第一线段与第二线段之间所构成的夹角,该第一线段为位点p i与第一目标点m i之间所构成的线段,该第二线段为第一目标点m i与第二目标点s i之间所构成的线段。
5)第二夹角β i的余弦值cos(β i),该第二夹角β i为上述第二线段与第三线段之间所构成的夹角,该第三线段为位点p i与第二目标点s i之间所构成的线段。
从图4中能够看出来,第一夹角α i和第二夹角β i为三角形Δm is ip i的两个内角。终端基于上述1)-5)这五项数据,能够构造一个五维向量作为位点p i的全局位置特征:[dp i;dpm i;dsm i;cos(α i);cos(β i)]。
基于上述示例进行分析,在给定点云中任一位点p i的情况下,如果直接将位点p i的三维坐标点(x i,y i,z i)输入到位点检测模型中,那么由于三维坐标点不具有旋转不变性,会导致位点检测模型针对同一蛋白质分子预测出不同的结合位点检测结果,降低结合位点检测过程的准确性。
在一些实施例中,假设仅利用位点p i的模长dp i=||p i|| 2作为位点p i的位置特征,由于模长是具有旋转不变性的,那么将位点p i的模长替代位点p i的三维坐标点输入到位点检测模型中,能够解决三维坐标点不具备旋转性的问题。然而,由于仅知道位点p i的模长,实际上无法精准定位出位点p i在点云空间坐标系中所处的位置,仅利用模长作为位置特征,会损失掉蛋白质分子各个位点之间的一些位置信息。
在一些实施例中,假设在位点p i的模长dp i之外,终端还额外提取了四项数据[dpm i;dsm i;α i;β i],显然不管是距离量dp i、dpm i、dsm i,还是角度量α i和β i,均不会随着蛋白质分子的旋转而发生变化,因此具有旋转不变性。基于上述各项数据,构造五维向量[dp i;dpm i;dsm i;cos(α i);cos(β i)]作为全局位置特征,将全局位置特征取代三维坐标点(x i,y i,z i)来表示位点p i在点云空间坐标系中所处的位置,也即是说,基于全局位置特征就能够精准定位出位点p i在点云空间坐标系中所处的位置,因此该全局位置特征能够最大程度地保留位点p i的位置信息,且该全局位置特征具有旋转不变性。
需要说明的是,由于事先将蛋白质分子的点云数据都归一化到一个以原点为球心、半径为1的目标球形空间内,因此距离量dp i、dpm i、dsm i的取值范围都介于0到1之间,而第一夹角α i和第二夹角β i的取值范围却是介于0到π之间(α i和β i∈[0,π]),通过分别对第一夹角α i和第二夹角β i取余弦值,分别得到取值范围介于0到1之间的cos(α i)和cos(β i),从而能够保证输入到位点检测模型的数据具有统一的取值范围,能够使得位点检测模型能够具有更加稳定的训练性能和预测性能。
303、终端基于该位点、该第一目标点、该第二目标点以及该位点的至少一个邻域点的三维坐标,构建该位点与该至少一个邻域点之间的至少一个局部位置特征,一个局部位置特征用于表征该位点与一个邻域点之间的相对位置信息。
在一些实施例中,该位点的邻域点指该目标分子中与该位点最邻近的K个点,K大于或 等于1,或者,该位点的邻域点指该位点的目标邻域内所包括的点,例如,该目标邻域为以该位点为中心的球状邻域、柱状邻域等,本申请实施例对此不作限定。
在一些实施例中,对于该位点的至少一个邻域点中任一邻域点,该位点与该邻域点之间的局部位置特征包括:该邻域点与该位点之间的距离、该邻域点与该第一目标点之间的距离、该邻域点与该第二目标点之间的距离、第三夹角的余弦值、第四夹角的余弦值或者第五夹角的余弦值中至少一项,其中,该第三夹角为第四线段与第五线段之间所构成的夹角,该第四夹角为该第五线段与第六线段之间所构成的夹角,该第五夹角为该第六线段与该第四线段之间所构成的夹角,该第四线段为该邻域点与该位点之间所构成的线段,该第五线段为该邻域点与该第一目标点之间所构成的线段,该第六线段为该邻域点与该第二目标点之间所构成的线段。
在一些实施例中,对于该位点的至少一个邻域点中任一邻域点,终端获取该邻域点与该位点之间的距离、该邻域点与第一目标点之间的距离、该邻域点与第二目标点之间的距离、第三夹角的余弦值、第四夹角的余弦值以及第五夹角的余弦值,基于上述六项数据构建一个六维向量,将该六维向量作为该位点的一个局部位置特征,进一步地,对所有的邻域点执行类似的操作,得到该位点相对于所有邻域点的局部位置特征。
在一些实施例中,对于该位点的至少一个邻域点中任一邻域点,该位点与该邻域点之间的局部位置特征包括:该邻域点与该位点之间的距离、该邻域点与该第一目标点之间的距离、该邻域点与该第二目标点之间的距离、第三夹角的角度、第四夹角的角度或者第五夹角的角度中至少一项。也即是说,不对第三夹角、第四夹角和第五夹角取余弦值,直接将第三夹角、第四夹角和第五夹角的角度作为局部位置特征中的元素。
在一个示例中,请参考图4,针对第i个位点400(用p i表示),在以p i为球心、r为半径的目标球形空间401中,通过上述步骤301能够确定出第一目标点402(用m i表示)和第二目标点403(用s i表示),假设存在第i个位点p i的第j个邻域点p ij(j≥1),能够看出,利用位点p i、第一目标点m i、第二目标点s i以及邻域点p ij能够构建出一个四面体,且四面体的边长中包括该邻域点p ij与该位点p i之间的距离dpp ij(第四线段的长度)、该邻域点p ij与该第一目标点m i之间的距离dpm ij(第五线段的长度)、该邻域点p ij与该第二目标点s i之间的距离dps ij(第六线段的长度),四面体的夹角中包括第三夹角
Figure PCTCN2021078263-appb-000002
第四夹角
Figure PCTCN2021078263-appb-000003
和第五夹角
Figure PCTCN2021078263-appb-000004
其中,该第三夹角
Figure PCTCN2021078263-appb-000005
为第四线段dpp ij与第五线段dpm ij之间所构成的夹角,该第四夹角
Figure PCTCN2021078263-appb-000006
为该第五线段dpm ij与第六线段dps ij之间所构成的夹角,该第五夹角
Figure PCTCN2021078263-appb-000007
为该第六线段dps ij与该第四线段dpp ij之间所构成的夹角。
进一步地,分别对第三夹角
Figure PCTCN2021078263-appb-000008
第四夹角
Figure PCTCN2021078263-appb-000009
和第五夹角
Figure PCTCN2021078263-appb-000010
取余弦值,得到三个夹角各自对应的余弦值
Figure PCTCN2021078263-appb-000011
Figure PCTCN2021078263-appb-000012
通过构建六维向量
Figure PCTCN2021078263-appb-000013
作为位点p i和邻域点p ij之间的局部位置特征,该局部位置特征能够描述在点云空间坐标系中位点p i和邻域点p ij之间的相对位置关系,通过全局位置特征和局部位置特征能够更加全面、精准地刻画出位点p i在蛋白质分子的 点云空间坐标系中的位置信息。
304、终端基于该全局位置特征和该至少一个局部位置特征,获取该位点的位置特征。
在上述步骤302中,终端获取到一个五维的全局位置特征,在上述步骤303中,终端获取到至少一个六维的局部位置特征,对每个局部位置特征,将该局部位置特征均与全局位置特征进行拼接,能够得到一个十一维的位置特征分量,将所有的位置特征分量所构成的矩阵确定为该位点的位置特征。
在上述步骤302-304中,对目标分子的每个位点,终端能够基于该位点、该第一目标点以及该第二目标点的三维坐标,提取到该位点的位置特征。在本申请实施例中,仅以位置特征包括全局位置特征和局部位置特征为例进行说明,在一些实施例中,位置特征等同于全局位置特征,也即是说,终端在执行步骤302中获取全局位置特征的操作之后不执行上述步骤303-304,直接将各个位点的全局位置特征输入到位点检测模型中,不获取各个位点的局部位置特征,能够简化结合位点检测方法的流程,降低结合位点检测过程的计算量。
在一个示例中,对于目标分子的第i个位点p i,存在与位点p i对应的第一目标点m i、第二目标点s i以及K(K≥1)个邻域点
Figure PCTCN2021078263-appb-000014
通过上述步骤302提取出一个5维(5-dim)的全局位置特征[dp i;dpm i;dsm i;cos(α i);cos(β i)],通过上述步骤303则提取出分别对应于K个邻域点的K个6维(6-dim)的局部位置特征
Figure PCTCN2021078263-appb-000015
将每个局部位置特征均与全局位置特征进行拼接,得到K个11维的位置特征分量,从而构成一个[K×11]维具有旋转不变性的位置特征,该位置特征的表达式如下:
Figure PCTCN2021078263-appb-000016
在上述矩阵形式的位置特征中,能够看出,矩阵左侧表明了位点p i的全局位置特征G i,用以表示位点p i在点云空间中的位置,矩阵右侧表明了位点p i与它的K个邻域点p i1~p iK之间的K个局部位置特征L i1~L iK,用以表示位点p i与它的K个邻域点p i1~p iK之间的相对位置。
305、终端对目标分子中至少一个位点重复执行上述步骤301-304,得到该至少一个位点的位置特征。
在上述步骤301-305中,终端能够基于至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取到该至少一个位点的三维坐标中具有旋转不变特性的位置特征,该位置特征用于表征该至少一个位点在目标分子中所处的位置信息,换言之,终端通过各个位点的三维坐标,构造出一个能够充分表征出各个位点的位置信息且具有旋转不变性的位置特征,具有较高的特征表达能力。
306、终端将该至少一个位点的位置特征输入图卷积神经网络中的输入层,通过该输入层输出该至少一个位点的图数据,该图数据用于以图的形式表示位点的位置特征。
在本申请实施例中,以位点检测模型为图卷积神经网络为例进行说明,该图卷积神经网络包括输入层、至少一个边卷积(EdgeConv)层以及输出层,该输入层用于提取各个位点的图数据,该至少一个边卷积层用于提取各个位点的全局生物学特征,该输出层用于进行特征融合和概率预测。
在一些实施例中,该图卷积神经网络的输入层中包括多层感知器和池化层,终端将该至少一个位点的位置特征输入该输入层中的多层感知器,通过该多层感知器对该至少一个位点的位置特征进行映射,得到该至少一个位点的第一特征,该第一特征的维度大于该位置特征的维度,将该至少一个位点的第一特征输入该输入层中的池化层,通过该池化层对该至少一个位点的第一特征进行降维,得到该至少一个位点的图数据。
在一些实施例中,该池化层是最大池化层(max pooling layer),在最大池化层中对第一特征进行最大池化操作,或者该池化层是均值池化层(average pooling layer),在均值池化层中对第一特征进行均值池化操作,本申请实施例不对池化层的类型进行具体限定。
在上述过程中,多层感知器将输入的位置特征映射到输出的第一特征,相当于对位置特征进行升维,提取出高维的第一特征,通过池化层对第一特征进行降维,相当于第一特征进行了过滤和选择,滤去了一些不重要的信息,得到图数据。
图5是本申请实施例提供的一种图卷积神经网络的原理性示意图,请参考图5,假设给定了一个蛋白质分子的[N×3]维的点云数据500,利用旋转不变表征提取器(类似步骤301)将点云数据转化为[N×K×11]维的旋转不变表征501,该旋转不变表征501也即各个位点的位置特征。而后利用多层感知器MLPs,在原始输入的[N×K×11]维的旋转不变表征501的基础上进一步地提取[N×K×32]维的第一特征502,并采用最大池化层对[N×K×32]维的第一特征502沿着K维的方向进行最大池化,将[N×K×32]维的第一特征502转化为[N×32]维的图数据503。
307、终端将至少一个位点的图数据输入该图卷积神经网络中的至少一个边卷积层,通过该至少一个边卷积层对该至少一个位点的图数据进行特征提取,得到该至少一个位点的全局生物学特征。
在一些实施例中,在提取全局生物学特征的过程中,终端执行下述子步骤3071-3074。
3071、对于该至少一个边卷积层中任一边卷积层,终端通过该边卷积层对上一边卷积层所输出的边卷积特征进行特征提取,将提取到的边卷积特征输入下一边卷积层。
在一些实施例中,每个边卷积层中均包括多层感知器和池化层,对于任一边卷积层,基于上一边卷积层所输出的边卷积特征构建聚类图;将该聚类图输入该边卷积层中的多层感知器,通过该多层感知器对该聚类图进行映射,得到该聚类图的中间特征;将该中间特征输入该边卷积层中的池化层,通过该池化层对该中间特征进行降维,将降维后的中间特征输入到下一边卷积层中。
在一些实施例中,在构建聚类图的过程中,对上一卷积层所输出的边卷积特征通过KNN(k-Nearest Neighbor最近邻)算法构建聚类图,此时构建出的聚类图称为KNN图,当然,也能够利用K均值算法构建聚类图,本申请实施例不对构建聚类图的方法进行具体限定。
在一些实施例中,该池化层是最大池化层(max pooling layer),在最大池化层中对中间特征进行最大池化操作,或者是均值池化层(average pooling layer),在均值池化层中对中间特征进行均值池化操作,本申请实施例不对池化层的类型进行具体限定。
图6是本申请实施例提供的一种边卷积层的结构示意图,请参考图6,在任一边卷积层中,对于上一卷积层所输出的[N×C]维的边卷积特征601,通过KNN算法建立聚类图(KNN图),利用多层感知器MLPs对聚类图进行高维特征的提取,能够将[N×C]维的边卷积特征601映射为[N×K×C’]维的中间特征602,利用池化层对[N×K×C’]维的中间特征602进行降维,得到[N×C’]维的边卷积特征603(降维后的中间特征),将[N×C’]维的边卷积特征603输入到下一边卷积层中。
在上述过程中,终端对至少一个边卷积层中每个边卷积层均执行上述操作,上一边卷积层输出的边卷积特征作为下一边卷积层的输入,从而通过该至少一个边卷积层,相当于对该至少一个位点的图数据进行了一系列更高维度的特征提取。
在一个示例中,请参考图5,以图卷积神经网络中包括2个边卷积层为例,终端将[N×32]维的图数据503输入到第一个边卷积层中,通过第一个边卷积层输出[N×64]维的边卷积特征504,终端将[N×64]维的边卷积特征504输入到第二个边卷积层中,通过第二个边卷积层输出[N×128]维的边卷积特征505,执行下述步骤3072。
3072、终端将该至少一个位点的图数据以及该至少一个边卷积层所输出的至少一个边卷积特征进行拼接,得到第二特征。
在上述过程中,终端对各个位点的图数据以及每个边卷积层所输出的边卷积层特征进行拼接,得到第二特征,该第二特征相当于该至少一个边卷积层的残差特征,从而能够在提取全局生物学特征的过程中,不仅考虑到最后一个边卷积层所输出的边卷积特征,而且能够考虑到最初输入的各个位点的图数据以及中间的每个边卷积层所输出的边卷积特征,有利于提升全局生物学特征的表达能力。
需要说明的是,这里所说的拼接是指将图数据与各个边卷积层所输出的边卷积特征直接在维度上相连,例如,假设边卷积层个数为1,那么将[N×32]维的图数据和[N×64]维的边卷积特征进行拼接,得到[N×96]维的第二特征。
在一个示例中,请参考图5,以图卷积神经网络中包括2个边卷积层为例,终端将[N×32]维的图数据503、第一个边卷积层输出的[N×64]维的边卷积特征504以及第二个边卷积层输出的[N×128]维的边卷积特征505进行拼接,得到[N×224]维的第二特征。
3073、终端将该第二特征输入多层感知器,通过该多层感知器对该第二特征进行映射,得到第三特征。
在上述过程中,终端通过多层感知器进行特征映射的过程,与前述各个步骤中通过多层 感知器进行特征映射的过程类似,这里不做赘述。
3074、终端将该第三特征输入池化层,通过该池化层对该第三特征进行降维,得到全局生物学特征。
在一些实施例中,该池化层是最大池化层(max pooling layer),在最大池化层中对第三特征进行最大池化操作,或者是均值池化层(average pooling layer),在均值池化层中对第三特征进行均值池化操作,本申请实施例不对池化层的类型进行具体限定。
在一个示例中,请参考图5,将[N×224]维的第二特征依次输入多层感知器MLPs和最大池化层,得到一个蛋白质点云的[1×1024]维的全局生物学特征506,执行下述步骤308。
308、终端将该全局生物学特征、该至少一个位点的图数据以及该至少一个边卷积层所输出的边卷积特征进行融合,将融合得到的特征输入该图卷积神经网络的输出层,通过该输出层对该融合得到的特征进行概率拟合,得到至少一个预测概率。
其中,一个预测概率用于表征一个位点属于结合位点的可能性。
在一些实施例中,在对融合得到的特征进行概率拟合的过程中,将融合得到的特征输入该输出层中的多层感知器,通过该多层感知器对该融合得到的特征进行映射,得到该至少一个预测概率。多层感知器的映射过程与前述各个步骤中多层感知器的映射过程类似,这里不做赘述。
在上述过程中,终端对全局生物学特征、各个位点的图数据以及各个边卷积层输出的边卷积特征进行融合,最终利用多层感知器对融合得到的特征进行概率拟合,拟合出每个位点属于结合位点的预测概率,在一些实施例中,上述融合过程是直接将全局生物学特征、各个位点的图数据以及各个边卷积层输出的边卷积特征进行拼接。
在一个示例中,请参考图5,以图卷积神经网络中包括2个边卷积层为例,终端将[N×32]维的图数据503、第一个边卷积层输出的[N×64]维的边卷积特征504、第二个边卷积层输出的[N×128]维的边卷积特征505以及[1×1024]维的全局生物学特征506进行拼接,得到一个[1×1248]维的融合特征507,将[1×1248]维的融合特征507输入到多层感知器MLPs中,利用多层感知器MLPs对每个位点都拟合该位点属于结合位点的预测概率,最终输出的检测结果是一个[N×1]维的数组508,数组508中每个值代表了一个位点属于结合位点的预测概率。在上述过程中,由于需要预测输入的蛋白质分子中每个位点是否为结合位点,因此将这一任务视为逐点分割任务。
在上述步骤306-308中,以位点检测模型为图卷积神经网络为例,示出了终端调用位点检测模型对提取到的位置特征进行预测处理,以得到该至少一个位点的至少一个预测概率的过程,在一些实施例中该位点检测模型是其他的深度学习网络,本申请实施例不对位点检测模型的类型进行具体限定。
309、终端基于该至少一个预测概率,确定该目标分子中该至少一个位点内的结合位点。
在上述过程中,终端从该至少一个位点中,将预测概率大于概率阈值的位点确定为结合位点,或者,终端按照预测概率从大到小的顺序对位点进行排序,将排序位于前目标数量的 位点确定为结合位点。
其中,该概率阈值是任一大于或等于0且小于或等于1的数值,该目标数量是任一大于或等于1的整数。例如,当目标数量为3时,电子设备按照预测概率从大到小的顺序对位点进行排序,将排序位于前3的位点确定为结合位点。
本申请实施例提供的方法,通过获取目标分子中各个位点的三维坐标,确定出各个位点分别对应的第一目标点和第二目标点,基于各个位点、各个第一目标点和各个第二目标点的三维坐标,提取出各个位点的三维坐标中具有旋转不变特性的位置特征,调用位点检测模型对提取到的位置特征进行预测,得到各个位点是否属于结合位点的预测概率,从而基于预测概率确定出目标分子的结合位点,由于第一目标点和第二目标点是与各个位点相关的且具有一定空间代表性的点,因此借助各个位点、各个第一目标点和各个第二目标点的三维坐标,构造出能够全面体现出目标分子细节结构的、具有旋转不变特性的位置特征,从而避免了为目标分子设计体素特征所带来的细节损失,使得基于位置特征进行结合位点检测时,能够充分利用目标分子的细节结构的位置信息,提升了分子结合位点检测过程的准确性。
在本申请实施例中,利用了深度学***稳性。
上述所有可选技术方案,采用任意结合形成本申请的可选实施例,在此不再一一赘述。
图7是本申请实施例提供的一种分子结合位点检测装置的结构示意图,请参考图7,该装置包括获取模块701、第一确定模块702、提取模块703、预测模块704和第二确定模块705。
获取模块701,用于获取待检测的目标分子中至少一个位点的三维坐标,该目标分子为待检测结合位点的化学分子;
第一确定模块702,用于分别确定每个位点对应的第一目标点和第二目标点,其中,任一个位点的第一目标点为目标球形空间内所包括的所有位点的中心点,所述目标球形空间是以所述任一个位点为球心、以目标长度为半径的球形空间,任一个位点的第二目标点为以原点为起点、指向所述位点的向量的正向延长线与所述目标球形空间的外表面的交点;
提取模块703,用于基于该至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取该至少一个位点的三维坐标中具有旋转不变特性的位置特征,该位置特征用于表征该至少一个位点在该目标分子中所处的位置信息;
预测模块704,用于调用位点检测模型对提取到的该位置特征进行预测处理,以得到该 至少一个位点的至少一个预测概率,其中,一个预测概率用于表征一个位点属于结合位点的可能性;
第二确定模块705,用于基于该至少一个预测概率,确定该目标分子中该至少一个位点内的结合位点。
本申请实施例提供的装置,通过获取目标分子中各个位点的三维坐标,确定出各个位点分别对应的第一目标点和第二目标点,基于各个位点、各个第一目标点和各个第二目标点的三维坐标,提取出各个位点的三维坐标中具有旋转不变特性的位置特征,调用位点检测模型对提取到的位置特征进行预测,得到各个位点是否属于结合位点的预测概率,从而基于预测概率确定出目标分子的结合位点,由于第一目标点和第二目标点是与各个位点相关的且具有一定空间代表性的点,因此借助各个位点、各个第一目标点和各个第二目标点的三维坐标,构造出能够全面体现出目标分子细节结构的、具有旋转不变特性的位置特征,从而避免了为目标分子设计体素特征所带来的细节损失,使得基于位置特征进行结合位点检测时,能够充分利用目标分子的细节结构的位置信息,提升了分子结合位点检测过程的准确性。
在一种可能实施方式中,基于图7的装置组成,该提取模块703包括:
提取单元,用于对于该至少一个位点中任一位点,基于该位点、该位点对应的第一目标点以及该位点对应的第二目标点的三维坐标,提取该位点的三维坐标中具有旋转不变特性的位置特征。
在一种可能实施方式中,该提取单元用于:
基于该位点、该第一目标点以及该第二目标点的三维坐标,构建该位点的全局位置特征,该全局位置特征用于表征该位点在目标分子内所处的空间位置信息;
基于该位点、该第一目标点、该第二目标点以及该位点的至少一个邻域点的三维坐标,构建该位点与该至少一个邻域点之间的至少一个局部位置特征,一个局部位置特征用于表征该位点与一个邻域点之间的相对位置信息;
基于该全局位置特征和该至少一个局部位置特征,获取该位点的位置特征。
在一种可能实施方式中,该全局位置特征包括:该位点的模长、该位点与该第一目标点之间的距离、该第一目标点与该第二目标点之间的距离、第一夹角的余弦值或者第二夹角的余弦值中至少一项,其中,该第一夹角为第一线段与第二线段之间所构成的夹角,该第二夹角为该第二线段与第三线段之间所构成的夹角,该第一线段为该位点与该第一目标点之间所构成的线段,该第二线段为该第一目标点与该第二目标点之间所构成的线段,该第三线段为该位点与该第二目标点之间所构成的线段。
在一种可能实施方式中,对于该至少一个邻域点中任一邻域点,该位点与该邻域点之间的局部位置特征包括:该邻域点与该位点之间的距离、该邻域点与该第一目标点之间的距离、该邻域点与该第二目标点之间的距离、第三夹角的余弦值、第四夹角的余弦值或者第五夹角的余弦值中至少一项,其中,该第三夹角为第四线段与第五线段之间所构成的夹角,该第四夹角为该第五线段与第六线段之间所构成的夹角,该第五夹角为该第六线段与该第四线段之 间所构成的夹角,该第四线段为该邻域点与该位点之间所构成的线段,该第五线段为该邻域点与该第一目标点之间所构成的线段,该第六线段为该邻域点与该第二目标点之间所构成的线段。
在一种可能实施方式中,该位点检测模型为图卷积神经网络,该图卷积神经网络包括输入层、至少一个边卷积层以及输出层;
基于图7的装置组成,该预测模块704包括:
输入输出单元,用于将该至少一个位点的位置特征输入图卷积神经网络中的输入层,通过该输入层输出该至少一个位点的图数据,该图数据用于以图的形式表示位点的位置特征;
特征提取单元,用于将至少一个位点的图数据输入该图卷积神经网络中的至少一个边卷积层,通过该至少一个边卷积层对该至少一个位点的图数据进行特征提取,得到该至少一个位点的全局生物学特征;
概率拟合单元,用于将该全局生物学特征、该至少一个位点的图数据以及该至少一个边卷积层所输出的边卷积特征进行融合,将融合得到的特征输入该图卷积神经网络的输出层,通过该输出层对该融合得到的特征进行概率拟合,得到该至少一个预测概率。
在一种可能实施方式中,该输入输出单元用于:
将该至少一个位点的位置特征输入该输入层中的多层感知器,通过该多层感知器对该至少一个位点的位置特征进行映射,得到该至少一个位点的第一特征,该第一特征的维度大于该位置特征的维度;
将该至少一个位点的第一特征输入该输入层中的池化层,通过该池化层对该至少一个位点的第一特征进行降维,得到该至少一个位点的图数据。
在一种可能实施方式中,基于图7的装置组成,该特征提取单元包括:
提取输入子单元,用于对于该至少一个边卷积层中任一边卷积层,对上一边卷积层所输出的边卷积特征进行特征提取,将提取到的边卷积特征输入下一边卷积层;
拼接子单元,用于将该至少一个位点的图数据以及该至少一个边卷积层所输出的至少一个边卷积特征进行拼接,得到第二特征;
映射子单元,用于将该第二特征输入多层感知器,通过该多层感知器对该第二特征进行映射,得到第三特征;
降维子单元,用于将该第三特征输入池化层,通过该池化层对该第三特征进行降维,得到该全局生物学特征。
在一种可能实施方式中,该提取输入子单元用于:
对于该至少一个边卷积层中任一边卷积层,基于上一边卷积层所输出的边卷积特征构建聚类图;
将该聚类图输入该边卷积层中的多层感知器,通过该多层感知器对该聚类图进行映射,得到该聚类图的中间特征;
将该中间特征输入该边卷积层中的池化层,通过该池化层对该中间特征进行降维,将降 维后的中间特征输入到下一边卷积层中。
在一种可能实施方式中,该概率拟合单元用于:
将融合得到的特征输入该输出层中的多层感知器,通过该多层感知器对该融合得到的特征进行映射,得到该至少一个预测概率。
在一种可能实施方式中,该第二确定模块705用于:
从该至少一个位点中,将预测概率大于概率阈值的位点确定为结合位点。
上述所有可选技术方案,采用任意结合形成本申请的可选实施例,在此不再一一赘述。
需要说明的是:上述实施例提供的分子结合位点检测装置在检测目标分子的结合位点时,仅以上述各功能模块的划分进行举例说明,实际应用中,能够根据需要而将上述功能分配由不同的功能模块完成,即将电子设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的分子结合位点检测装置与分子结合位点检测方法实施例属于同一构思,其具体实现过程详见分子结合位点检测方法实施例,这里不再赘述。
图8是本申请实施例提供的一种电子设备的结构示意图。请参考图8,以电子设备为终端800为例进行说明,该终端800是:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端800还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端800包括有:处理器801和存储器802。
处理器801包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器801采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。在一些实施例中,处理器801包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器801在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。在一些实施例中,处理器801包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器802包括一个或多个计算机可读存储介质,该计算机可读存储介质是非暂态的。在一些实施例中,存储器802还包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器802中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器801所执行以实现下述分子结合位点检测步骤:
获取待检测的目标分子中至少一个位点的三维坐标,该目标分子为待检测结合位点的化学分子;
分别确定每个位点对应的第一目标点和第二目标点,其中,任一个位点的第一目标点为目标球形空间内所包括的所有位点的中心点,该目标球形空间是以该任一个位点为球心、以目标长度为半径的球形空间,任一个位点的第二目标点为以原点为起点、指向该位点的向量的正向延长线与该目标球形空间的外表面的交点;
基于该至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取该至少一个位点的三维坐标中具有旋转不变特性的位置特征,该位置特征用于表征该至少一个位点在该目标分子中所处的位置信息;
调用位点检测模型对提取到的该位置特征进行预测处理,以得到该至少一个位点的至少一个预测概率,其中,一个预测概率用于表征一个位点属于结合位点的可能性;
基于该至少一个预测概率,确定该目标分子中该至少一个位点内的结合位点。
在一种可能实施方式中,该基于该至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取该至少一个位点的三维坐标中具有旋转不变特性的位置特征包括:
对于该至少一个位点中任一位点,基于该位点、该位点对应的第一目标点以及该位点对应的第二目标点的三维坐标,提取该位点的三维坐标中具有旋转不变特性的位置特征。
在一种可能实施方式中,该基于该位点、该位点对应的第一目标点以及该位点对应的第二目标点的三维坐标,提取该位点的三维坐标中具有旋转不变特性的位置特征包括:
基于该位点、该第一目标点以及该第二目标点的三维坐标,构建该位点的全局位置特征,该全局位置特征用于表征该位点在目标分子内所处的空间位置信息;
基于该位点、该第一目标点、该第二目标点以及该位点的至少一个邻域点的三维坐标,构建该位点与该至少一个邻域点之间的至少一个局部位置特征,一个局部位置特征用于表征该位点与一个邻域点之间的相对位置信息;
基于该全局位置特征和该至少一个局部位置特征,获取该位点的位置特征。
在一种可能实施方式中,该全局位置特征包括:该位点的模长、该位点与该第一目标点之间的距离、该第一目标点与该第二目标点之间的距离、第一夹角的余弦值或者第二夹角的余弦值中至少一项,其中,该第一夹角为第一线段与第二线段之间所构成的夹角,该第二夹角为该第二线段与第三线段之间所构成的夹角,该第一线段为该位点与该第一目标点之间所构成的线段,该第二线段为该第一目标点与该第二目标点之间所构成的线段,该第三线段为该位点与该第二目标点之间所构成的线段。
在一种可能实施方式中,对于该至少一个邻域点中任一邻域点,该位点与该邻域点之间的局部位置特征包括:该邻域点与该位点之间的距离、该邻域点与该第一目标点之间的距离、该邻域点与该第二目标点之间的距离、第三夹角的余弦值、第四夹角的余弦值或者第五夹角的余弦值中至少一项,其中,该第三夹角为第四线段与第五线段之间所构成的夹角,该第四夹角为该第五线段与第六线段之间所构成的夹角,该第五夹角为该第六线段与该第四线段之间所构成的夹角,该第四线段为该邻域点与该位点之间所构成的线段,该第五线段为该邻域点与该第一目标点之间所构成的线段,该第六线段为该邻域点与该第二目标点之间所构成的 线段。
在一种可能实施方式中,该位点检测模型为图卷积神经网络,该图卷积神经网络包括输入层、至少一个边卷积层以及输出层;
该调用位点检测模型对提取到的该位置特征进行预测处理,以得到该至少一个位点的至少一个预测概率包括:
将该至少一个位点的位置特征输入图卷积神经网络中的输入层,通过该输入层输出该至少一个位点的图数据,该图数据用于以图的形式表示位点的位置特征;
将至少一个位点的图数据输入该图卷积神经网络中的至少一个边卷积层,通过该至少一个边卷积层对该至少一个位点的图数据进行特征提取,得到该至少一个位点的全局生物学特征;
将该全局生物学特征、该至少一个位点的图数据以及该至少一个边卷积层所输出的边卷积特征进行融合,将融合得到的特征输入该图卷积神经网络的输出层,通过该输出层对该融合得到的特征进行概率拟合,得到该至少一个预测概率。
在一种可能实施方式中,该将该至少一个位点的位置特征输入图卷积神经网络中的输入层,通过该输入层输出该至少一个位点的图数据包括:
将该至少一个位点的位置特征输入该输入层中的多层感知器,通过该多层感知器对该至少一个位点的位置特征进行映射,得到该至少一个位点的第一特征,该第一特征的维度大于该位置特征的维度;
将该至少一个位点的第一特征输入该输入层中的池化层,通过该池化层对该至少一个位点的第一特征进行降维,得到该至少一个位点的图数据。
在一种可能实施方式中,该通过该至少一个边卷积层对该至少一个位点的图数据进行特征提取,得到该至少一个位点的全局生物学特征包括:
对于该至少一个边卷积层中任一边卷积层,对上一边卷积层所输出的边卷积特征进行特征提取,将提取到的边卷积特征输入下一边卷积层;
将该至少一个位点的图数据以及该至少一个边卷积层所输出的至少一个边卷积特征进行拼接,得到第二特征;
将该第二特征输入多层感知器,通过该多层感知器对该第二特征进行映射,得到第三特征;
将该第三特征输入池化层,通过该池化层对该第三特征进行降维,得到该全局生物学特征。
在一种可能实施方式中,该对于该至少一个边卷积层中任一边卷积层,对上一边卷积层所输出的边卷积特征进行特征提取,将提取到的边卷积特征输入下一边卷积层包括:
对于该至少一个边卷积层中任一边卷积层,基于上一边卷积层所输出的边卷积特征构建聚类图;
将该聚类图输入该边卷积层中的多层感知器,通过该多层感知器对该聚类图进行映射, 得到该聚类图的中间特征;
将该中间特征输入该边卷积层中的池化层,通过该池化层对该中间特征进行降维,将降维后的中间特征输入到下一边卷积层中。
在一种可能实施方式中,该将融合得到的特征输入该图卷积神经网络的输出层,通过该输出层对该融合得到的特征进行概率拟合,得到该至少一个预测概率包括:
将融合得到的特征输入该输出层中的多层感知器,通过该多层感知器对该融合得到的特征进行映射,得到该至少一个预测概率。
在一种可能实施方式中,该基于该至少一个预测概率,确定该目标分子中该至少一个位点内的结合位点包括:
从该至少一个位点中,将预测概率大于概率阈值的位点确定为结合位点。
在一些实施例中,终端800还可选包括有:***设备接口803和至少一个***设备。处理器801、存储器802和***设备接口803之间通过总线或信号线相连。各个***设备通过总线、信号线或电路板与***设备接口803相连。可选地,***设备包括:显示屏804。
***设备接口803可被用于将I/O(Input/Output,输入/输出)相关的至少一个***设备连接到处理器801和存储器802。
显示屏804用于显示UI(User Interface,用户界面)。该UI包括图形、文本、图标、视频及其它们的任意组合。当显示屏804是触摸显示屏时,显示屏804还具有采集在显示屏804的表面或表面上方的触摸信号的能力。在一些实施例中,该触摸信号作为控制信号输入至处理器801进行处理。此时,显示屏804还用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。
本领域技术人员能够理解,图8中示出的结构并不构成对终端800的限定,包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括至少一条程序代码的存储器,上述至少一条程序代码可由终端中的处理器执行以完成下述分子结合位点检测步骤:
获取待检测的目标分子中至少一个位点的三维坐标,该目标分子为待检测结合位点的化学分子;
分别确定每个位点对应的第一目标点和第二目标点,其中,任一个位点的第一目标点为目标球形空间内所包括的所有位点的中心点,该目标球形空间是以该任一个位点为球心、以目标长度为半径的球形空间,任一个位点的第二目标点为以原点为起点、指向该位点的向量的正向延长线与该目标球形空间的外表面的交点;
基于该至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取该至少一个位点的三维坐标中具有旋转不变特性的位置特征,该位置特征用于表征该至少一个位点在该目标分子中所处的位置信息;
调用位点检测模型对提取到的该位置特征进行预测处理,以得到该至少一个位点的至少 一个预测概率,其中,一个预测概率用于表征一个位点属于结合位点的可能性;
基于该至少一个预测概率,确定该目标分子中该至少一个位点内的结合位点。
在一种可能实施方式中,该基于该至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取该至少一个位点的三维坐标中具有旋转不变特性的位置特征包括:
对于该至少一个位点中任一位点,基于该位点、该位点对应的第一目标点以及该位点对应的第二目标点的三维坐标,提取该位点的三维坐标中具有旋转不变特性的位置特征。
在一种可能实施方式中,该基于该位点、该位点对应的第一目标点以及该位点对应的第二目标点的三维坐标,提取该位点的三维坐标中具有旋转不变特性的位置特征包括:
基于该位点、该第一目标点以及该第二目标点的三维坐标,构建该位点的全局位置特征,该全局位置特征用于表征该位点在目标分子内所处的空间位置信息;
基于该位点、该第一目标点、该第二目标点以及该位点的至少一个邻域点的三维坐标,构建该位点与该至少一个邻域点之间的至少一个局部位置特征,一个局部位置特征用于表征该位点与一个邻域点之间的相对位置信息;
基于该全局位置特征和该至少一个局部位置特征,获取该位点的位置特征。
在一种可能实施方式中,该全局位置特征包括:该位点的模长、该位点与该第一目标点之间的距离、该第一目标点与该第二目标点之间的距离、第一夹角的余弦值或者第二夹角的余弦值中至少一项,其中,该第一夹角为第一线段与第二线段之间所构成的夹角,该第二夹角为该第二线段与第三线段之间所构成的夹角,该第一线段为该位点与该第一目标点之间所构成的线段,该第二线段为该第一目标点与该第二目标点之间所构成的线段,该第三线段为该位点与该第二目标点之间所构成的线段。
在一种可能实施方式中,对于该至少一个邻域点中任一邻域点,该位点与该邻域点之间的局部位置特征包括:该邻域点与该位点之间的距离、该邻域点与该第一目标点之间的距离、该邻域点与该第二目标点之间的距离、第三夹角的余弦值、第四夹角的余弦值或者第五夹角的余弦值中至少一项,其中,该第三夹角为第四线段与第五线段之间所构成的夹角,该第四夹角为该第五线段与第六线段之间所构成的夹角,该第五夹角为该第六线段与该第四线段之间所构成的夹角,该第四线段为该邻域点与该位点之间所构成的线段,该第五线段为该邻域点与该第一目标点之间所构成的线段,该第六线段为该邻域点与该第二目标点之间所构成的线段。
在一种可能实施方式中,该位点检测模型为图卷积神经网络,该图卷积神经网络包括输入层、至少一个边卷积层以及输出层;
该调用位点检测模型对提取到的该位置特征进行预测处理,以得到该至少一个位点的至少一个预测概率包括:
将该至少一个位点的位置特征输入图卷积神经网络中的输入层,通过该输入层输出该至少一个位点的图数据,该图数据用于以图的形式表示位点的位置特征;
将至少一个位点的图数据输入该图卷积神经网络中的至少一个边卷积层,通过该至少一 个边卷积层对该至少一个位点的图数据进行特征提取,得到该至少一个位点的全局生物学特征;
将该全局生物学特征、该至少一个位点的图数据以及该至少一个边卷积层所输出的边卷积特征进行融合,将融合得到的特征输入该图卷积神经网络的输出层,通过该输出层对该融合得到的特征进行概率拟合,得到该至少一个预测概率。
在一种可能实施方式中,该将该至少一个位点的位置特征输入图卷积神经网络中的输入层,通过该输入层输出该至少一个位点的图数据包括:
将该至少一个位点的位置特征输入该输入层中的多层感知器,通过该多层感知器对该至少一个位点的位置特征进行映射,得到该至少一个位点的第一特征,该第一特征的维度大于该位置特征的维度;
将该至少一个位点的第一特征输入该输入层中的池化层,通过该池化层对该至少一个位点的第一特征进行降维,得到该至少一个位点的图数据。
在一种可能实施方式中,该通过该至少一个边卷积层对该至少一个位点的图数据进行特征提取,得到该至少一个位点的全局生物学特征包括:
对于该至少一个边卷积层中任一边卷积层,对上一边卷积层所输出的边卷积特征进行特征提取,将提取到的边卷积特征输入下一边卷积层;
将该至少一个位点的图数据以及该至少一个边卷积层所输出的至少一个边卷积特征进行拼接,得到第二特征;
将该第二特征输入多层感知器,通过该多层感知器对该第二特征进行映射,得到第三特征;
将该第三特征输入池化层,通过该池化层对该第三特征进行降维,得到该全局生物学特征。
在一种可能实施方式中,该对于该至少一个边卷积层中任一边卷积层,对上一边卷积层所输出的边卷积特征进行特征提取,将提取到的边卷积特征输入下一边卷积层包括:
对于该至少一个边卷积层中任一边卷积层,基于上一边卷积层所输出的边卷积特征构建聚类图;
将该聚类图输入该边卷积层中的多层感知器,通过该多层感知器对该聚类图进行映射,得到该聚类图的中间特征;
将该中间特征输入该边卷积层中的池化层,通过该池化层对该中间特征进行降维,将降维后的中间特征输入到下一边卷积层中。
在一种可能实施方式中,该将融合得到的特征输入该图卷积神经网络的输出层,通过该输出层对该融合得到的特征进行概率拟合,得到该至少一个预测概率包括:
将融合得到的特征输入该输出层中的多层感知器,通过该多层感知器对该融合得到的特征进行映射,得到该至少一个预测概率。
在一种可能实施方式中,该基于该至少一个预测概率,确定该目标分子中该至少一个位 点内的结合位点包括:
从该至少一个位点中,将预测概率大于概率阈值的位点确定为结合位点。
在一些实施例中,上述计算机可读存储介质是ROM(Read-Only Memory,只读存储器)、RAM(Random-Access Memory,随机存取存储器)、CD-ROM(Compact Disc Read-Only Memory,只读光盘)、磁带、软盘和光数据存储设备等。
本领域普通技术人员能够理解实现上述实施例的全部或部分步骤通过硬件来完成,或者通过程序来指令相关的硬件完成,该程序存储于一种计算机可读存储介质中,上述提到的存储介质是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种分子结合位点检测方法,其中,应用于电子设备,所述方法包括:
    获取待检测的目标分子中至少一个位点的三维坐标,所述目标分子为待检测结合位点的化学分子;
    分别确定每个位点对应的第一目标点和第二目标点,其中,任一个位点的第一目标点为目标球形空间内所包括的所有位点的中心点,所述目标球形空间是以所述任一个位点为球心、以目标长度为半径的球形空间,任一个位点的第二目标点为以原点为起点、指向所述位点的向量的正向延长线与所述目标球形空间的外表面的交点;
    基于所述至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取所述至少一个位点的三维坐标中具有旋转不变特性的位置特征,所述位置特征用于表征所述至少一个位点在所述目标分子中所处的位置信息;
    调用位点检测模型对提取到的所述位置特征进行预测处理,以得到所述至少一个位点的至少一个预测概率,其中,一个预测概率用于表征一个位点属于结合位点的可能性;
    基于所述至少一个预测概率,确定所述目标分子中所述至少一个位点内的结合位点。
  2. 根据权利要求1所述的方法,其中,所述基于所述至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取所述至少一个位点的三维坐标中具有旋转不变特性的位置特征包括:
    对于所述至少一个位点中任一位点,基于所述位点、所述位点对应的第一目标点以及所述位点对应的第二目标点的三维坐标,提取所述位点的三维坐标中具有旋转不变特性的位置特征。
  3. 根据权利要求2所述的方法,其中,所述基于所述位点、所述位点对应的第一目标点以及所述位点对应的第二目标点的三维坐标,提取所述位点的三维坐标中具有旋转不变特性的位置特征包括:
    基于所述位点、所述第一目标点以及所述第二目标点的三维坐标,构建所述位点的全局位置特征,所述全局位置特征用于表征所述位点在目标分子内所处的空间位置信息;
    基于所述位点、所述第一目标点、所述第二目标点以及所述位点的至少一个邻域点的三维坐标,构建所述位点与所述至少一个邻域点之间的至少一个局部位置特征,一个局部位置特征用于表征所述位点与一个邻域点之间的相对位置信息;
    基于所述全局位置特征和所述至少一个局部位置特征,获取所述位点的位置特征。
  4. 根据权利要求3所述的方法,其中,所述全局位置特征包括:所述位点的模长、所述 位点与所述第一目标点之间的距离、所述第一目标点与所述第二目标点之间的距离、第一夹角的余弦值或者第二夹角的余弦值中至少一项,其中,所述第一夹角为第一线段与第二线段之间所构成的夹角,所述第二夹角为所述第二线段与第三线段之间所构成的夹角,所述第一线段为所述位点与所述第一目标点之间所构成的线段,所述第二线段为所述第一目标点与所述第二目标点之间所构成的线段,所述第三线段为所述位点与所述第二目标点之间所构成的线段。
  5. 根据权利要求3所述的方法,其中,对于所述至少一个邻域点中任一邻域点,所述位点与所述邻域点之间的局部位置特征包括:所述邻域点与所述位点之间的距离、所述邻域点与所述第一目标点之间的距离、所述邻域点与所述第二目标点之间的距离、第三夹角的余弦值、第四夹角的余弦值或者第五夹角的余弦值中至少一项,其中,所述第三夹角为第四线段与第五线段之间所构成的夹角,所述第四夹角为所述第五线段与第六线段之间所构成的夹角,所述第五夹角为所述第六线段与所述第四线段之间所构成的夹角,所述第四线段为所述邻域点与所述位点之间所构成的线段,所述第五线段为所述邻域点与所述第一目标点之间所构成的线段,所述第六线段为所述邻域点与所述第二目标点之间所构成的线段。
  6. 根据权利要求1所述的方法,其中,所述位点检测模型为图卷积神经网络,所述图卷积神经网络包括输入层、至少一个边卷积层以及输出层;
    所述调用位点检测模型对提取到的所述位置特征进行预测处理,以得到所述至少一个位点的至少一个预测概率包括:
    将所述至少一个位点的位置特征输入图卷积神经网络中的输入层,通过所述输入层输出所述至少一个位点的图数据,所述图数据用于以图的形式表示位点的位置特征;
    将至少一个位点的图数据输入所述图卷积神经网络中的至少一个边卷积层,通过所述至少一个边卷积层对所述至少一个位点的图数据进行特征提取,得到所述至少一个位点的全局生物学特征;
    将所述全局生物学特征、所述至少一个位点的图数据以及所述至少一个边卷积层所输出的边卷积特征进行融合,将融合得到的特征输入所述图卷积神经网络的输出层,通过所述输出层对所述融合得到的特征进行概率拟合,得到所述至少一个预测概率。
  7. 根据权利要求6所述的方法,其中,所述将所述至少一个位点的位置特征输入图卷积神经网络中的输入层,通过所述输入层输出所述至少一个位点的图数据包括:
    将所述至少一个位点的位置特征输入所述输入层中的多层感知器,通过所述多层感知器对所述至少一个位点的位置特征进行映射,得到所述至少一个位点的第一特征,所述第一特征的维度大于所述位置特征的维度;
    将所述至少一个位点的第一特征输入所述输入层中的池化层,通过所述池化层对所述至 少一个位点的第一特征进行降维,得到所述至少一个位点的图数据。
  8. 根据权利要求6所述的方法,其中,所述通过所述至少一个边卷积层对所述至少一个位点的图数据进行特征提取,得到所述至少一个位点的全局生物学特征包括:
    对于所述至少一个边卷积层中任一边卷积层,对上一边卷积层所输出的边卷积特征进行特征提取,将提取到的边卷积特征输入下一边卷积层;
    将所述至少一个位点的图数据以及所述至少一个边卷积层所输出的至少一个边卷积特征进行拼接,得到第二特征;
    将所述第二特征输入多层感知器,通过所述多层感知器对所述第二特征进行映射,得到第三特征;
    将所述第三特征输入池化层,通过所述池化层对所述第三特征进行降维,得到所述全局生物学特征。
  9. 根据权利要求8所述的方法,其中,所述对于所述至少一个边卷积层中任一边卷积层,对上一边卷积层所输出的边卷积特征进行特征提取,将提取到的边卷积特征输入下一边卷积层包括:
    对于所述至少一个边卷积层中任一边卷积层,基于上一边卷积层所输出的边卷积特征构建聚类图;
    将所述聚类图输入所述边卷积层中的多层感知器,通过所述多层感知器对所述聚类图进行映射,得到所述聚类图的中间特征;
    将所述中间特征输入所述边卷积层中的池化层,通过所述池化层对所述中间特征进行降维,将降维后的中间特征输入到下一边卷积层中。
  10. 根据权利要求6所述的方法,其中,所述将融合得到的特征输入所述图卷积神经网络的输出层,通过所述输出层对所述融合得到的特征进行概率拟合,得到所述至少一个预测概率包括:
    将融合得到的特征输入所述输出层中的多层感知器,通过所述多层感知器对所述融合得到的特征进行映射,得到所述至少一个预测概率。
  11. 根据权利要求1所述的方法,其中,所述基于所述至少一个预测概率,确定所述目标分子中所述至少一个位点内的结合位点包括:
    从所述至少一个位点中,将预测概率大于概率阈值的位点确定为结合位点。
  12. 一种分子结合位点检测装置,其中,所述装置包括:
    获取模块,用于获取待检测的目标分子中至少一个位点的三维坐标,所述目标分子为待 检测结合位点的化学分子;
    第一确定模块,用于分别确定每个位点对应的第一目标点和第二目标点,其中,任一个位点的第一目标点为目标球形空间内所包括的所有位点的中心点,所述目标球形空间是以所述任一个位点为球心、以目标长度为半径的球形空间,任一个位点的第二目标点为以原点为起点、指向所述位点的向量的正向延长线与所述目标球形空间的外表面的交点;
    提取模块,用于基于所述至少一个位点、至少一个第一目标点以及至少一个第二目标点的三维坐标,提取所述至少一个位点的三维坐标中具有旋转不变特性的位置特征,所述位置特征用于表征所述至少一个位点在所述目标分子中所处的位置信息;
    预测模块,用于调用位点检测模型对提取到的所述位置特征进行预测处理,以得到所述至少一个位点的至少一个预测概率,其中,一个预测概率用于表征一个位点属于结合位点的可能性;
    第二确定模块,用于基于所述至少一个预测概率,确定所述目标分子中所述至少一个位点内的结合位点。
  13. 根据权利要求12所述的装置,其中,所述提取模块包括:
    提取单元,用于对于所述至少一个位点中任一位点,基于所述位点、所述位点对应的第一目标点以及所述位点对应的第二目标点的三维坐标,提取所述位点的三维坐标中具有旋转不变特性的位置特征。
  14. 一种电子设备,其中,所述电子设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条程序代码,所述至少一条程序代码由所述一个或多个处理器加载并执行以实现如权利要求1至权利要求11任一项所述的分子结合位点检测方法。
  15. 一种存储介质,其中,所述存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行以实现如权利要求1至权利要求11任一项所述的分子结合位点检测方法。
PCT/CN2021/078263 2020-04-09 2021-02-26 分子结合位点检测方法、装置、电子设备及存储介质 WO2021203865A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2021545445A JP7246813B2 (ja) 2020-04-09 2021-02-26 分子結合部位検出方法、装置、電子機器及びコンピュータプログラム
KR1020217028480A KR102635777B1 (ko) 2020-04-09 2021-02-26 분자 결합 부위를 검출하기 위한 방법 및 장치, 전자 디바이스 및 저장 매체
EP21759220.3A EP3920188A4 (en) 2020-04-09 2021-02-26 METHOD AND DEVICE FOR DETECTING A MOLECULAR BINDING SITE, ELECTRONIC DEVICE AND STORAGE MEDIUM
US17/518,953 US20220059186A1 (en) 2020-04-09 2021-11-04 Method and apparatus for detecting molecule binding site, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010272124.0A CN111243668B (zh) 2020-04-09 2020-04-09 分子结合位点检测方法、装置、电子设备及存储介质
CN202010272124.0 2020-04-09

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/518,953 Continuation US20220059186A1 (en) 2020-04-09 2021-11-04 Method and apparatus for detecting molecule binding site, electronic device, and storage medium

Publications (2)

Publication Number Publication Date
WO2021203865A1 WO2021203865A1 (zh) 2021-10-14
WO2021203865A9 true WO2021203865A9 (zh) 2021-11-25

Family

ID=70864447

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078263 WO2021203865A1 (zh) 2020-04-09 2021-02-26 分子结合位点检测方法、装置、电子设备及存储介质

Country Status (6)

Country Link
US (1) US20220059186A1 (zh)
EP (1) EP3920188A4 (zh)
JP (1) JP7246813B2 (zh)
KR (1) KR102635777B1 (zh)
CN (1) CN111243668B (zh)
WO (1) WO2021203865A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111243668B (zh) * 2020-04-09 2020-08-07 腾讯科技(深圳)有限公司 分子结合位点检测方法、装置、电子设备及存储介质
CN111755065B (zh) * 2020-06-15 2024-05-17 重庆邮电大学 一种基于虚拟网络映射和云并行计算的蛋白质构象预测加速方法
RU2743316C1 (ru) * 2020-08-14 2021-02-17 Автономная некоммерческая образовательная организация высшего образования Сколковский институт науки и технологий Способ идентификации участков связывания белковых комплексов
CN114120006B (zh) * 2020-08-28 2024-02-06 腾讯科技(深圳)有限公司 图像处理方法、装置、电子设备和计算机可读存储介质
US11860977B1 (en) * 2021-05-04 2024-01-02 Amazon Technologies, Inc. Hierarchical graph neural networks for visual clustering
CN113593634B (zh) * 2021-08-06 2022-03-11 中国海洋大学 一种融合dna形状特征的转录因子结合位点预测方法
CN114066888B (zh) * 2022-01-11 2022-04-19 浙江大学 一种血流动力学指标确定方法、装置、设备及存储介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
BR112015027327B1 (pt) * 2013-04-29 2022-08-02 Rutgers, The State University Of New Jersey Composto, modulador de cgas, composição farmacêutica compreendendo o referido composto ou modulador
JP7048065B2 (ja) * 2017-08-02 2022-04-05 学校法人立命館 結合性予測方法、装置、プログラム、記録媒体、および機械学習アルゴリズムの学習方法
CN108875298B (zh) * 2018-06-07 2019-06-07 北京计算科学研究中心 基于分子形状匹配的药物筛选方法
US11830582B2 (en) * 2018-06-14 2023-11-28 University Of Miami Methods of designing novel antibody mimetics for use in detecting antigens and as therapeutic agents
CN109637596B (zh) * 2018-12-18 2023-05-16 广州市爱菩新医药科技有限公司 一种药物靶点预测方法
CN109887541A (zh) * 2019-02-15 2019-06-14 张海平 一种靶点蛋白质与小分子结合预测方法及***
CN110544506B (zh) * 2019-08-27 2022-02-11 上海源兹生物科技有限公司 基于蛋白互作网络的靶点PPIs可药性预测方法及装置
CN110706738B (zh) * 2019-10-30 2020-11-20 腾讯科技(深圳)有限公司 蛋白质的结构信息预测方法、装置、设备及存储介质
CN110910951B (zh) * 2019-11-19 2023-07-07 江苏理工学院 一种基于渐进式神经网络预测蛋白质与配体结合自由能的方法
CN111243668B (zh) * 2020-04-09 2020-08-07 腾讯科技(深圳)有限公司 分子结合位点检测方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
EP3920188A4 (en) 2022-06-15
CN111243668B (zh) 2020-08-07
KR20210126646A (ko) 2021-10-20
KR102635777B1 (ko) 2024-02-08
EP3920188A1 (en) 2021-12-08
JP7246813B2 (ja) 2023-03-28
CN111243668A (zh) 2020-06-05
WO2021203865A1 (zh) 2021-10-14
JP2022532009A (ja) 2022-07-13
US20220059186A1 (en) 2022-02-24

Similar Documents

Publication Publication Date Title
WO2021203865A9 (zh) 分子结合位点检测方法、装置、电子设备及存储介质
CN111797893B (zh) 一种神经网络的训练方法、图像分类***及相关设备
WO2021227726A1 (zh) 面部检测、图像检测神经网络训练方法、装置和设备
WO2021190451A1 (zh) 训练图像处理模型的方法和装置
US20210264227A1 (en) Method for locating image region, model training method, and related apparatus
WO2019228358A1 (zh) 深度神经网络的训练方法和装置
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
EP4002161A1 (en) Image retrieval method and apparatus, storage medium, and device
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
CN111898636B (zh) 一种数据处理方法及装置
WO2021218238A1 (zh) 图像处理方法和图像处理装置
WO2024041479A1 (zh) 一种数据处理方法及其装置
CN112419326B (zh) 图像分割数据处理方法、装置、设备及存储介质
WO2021136058A1 (zh) 一种处理视频的方法及装置
CN111368656A (zh) 一种视频内容描述方法和视频内容描述装置
WO2021190433A1 (zh) 更新物体识别模型的方法和装置
WO2023165361A1 (zh) 一种数据处理方法及相关设备
CN111091010A (zh) 相似度确定、网络训练、查找方法及装置和存储介质
Cao et al. Real-time gesture recognition based on feature recalibration network with multi-scale information
CN116129141A (zh) 医学数据处理方法、装置、设备、介质和计算机程序产品
WO2023197910A1 (zh) 一种用户行为预测方法及其相关设备
WO2022179599A1 (zh) 一种感知网络及数据处理方法
CN116109449A (zh) 一种数据处理方法及相关设备
CN115795025A (zh) 一种摘要生成方法及其相关设备
CN101180654A (zh) 用于器官模型放置的自动器官链接

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021545445

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20217028480

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2021759220

Country of ref document: EP

Effective date: 20210903

NENP Non-entry into the national phase

Ref country code: DE