CN117497064A

CN117497064A - Single-cell three-dimensional genome data analysis method based on semi-supervised learning

Info

Publication number: CN117497064A
Application number: CN202311644584.1A
Authority: CN
Inventors: 吕昊; 刀福英; 林昊; 丁辉
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2024-02-02

Abstract

The application provides a single-cell three-dimensional genome data analysis method based on semi-supervised learning, which relates to the technical field of biological information, and comprises the following steps: acquiring a first scH i-C data set and a second scH i-C data set; the first scH i-C data are the scH i-C data marked with cell type labels; the second scH i-C data are scH i-C data without labeling cell type labels; training the semi-supervised generation model based on the first scH i-C data set and the second scH i-C data set to obtain a cell type prediction model; the cell type prediction model is used to predict cell types based on scH i-C data of cells. The method is suitable for the process of predicting the cell types by utilizing scH i-C data and is used for solving the problem of under fitting of an unsupervised learning model.

Description

Single-cell three-dimensional genome data analysis method based on semi-supervised learning

Technical Field

The application relates to the technical field of biological information, in particular to a single-cell three-dimensional genome data analysis method based on semi-supervised learning.

Background

The advent of chromatin conformation capture (Chromosome Conformation Capture, 3C) technology revealed a fine three-dimensional genomic structure. These three-dimensional genomic structures play an important role in controlling basic biological functions such as transcription, replication, and deoxyribonucleic acid (Deoxyribo Nucleic Acid, DNA) repair.

The related art may use an unsupervised learning model to analyze three-dimensional genomic features at single-cell (sc) resolution to predict a scheme of cell types of single cells.

However, the unsupervised learning model in the related art may have a problem of under fitting.

Disclosure of Invention

Based on the technical problems, the application provides a single-cell three-dimensional genome data analysis method based on semi-supervised learning, which can learn the characteristics of a label by using a semi-supervised generation model and avoid the problem of under fitting.

In a first aspect, the present application provides a method for analyzing single-cell three-dimensional genome data based on semi-supervised learning, the method comprising: acquiring a first sc hi-C dataset and a second sc hi-C dataset; the first set of sci-C data includes a plurality of first sci-C data; the second set of scai-C data includes a plurality of second scai-C data; the first scai-C data is scai-C data labeled with a cell type tag; the second scai-C data is scai-C data without a labeled cell type; training a preset semi-supervised generation model based on the first sc Hi-C data set and the second sc Hi-C data set to obtain a cell type prediction model; the cell type prediction model was used to predict cell types from the scai-C data of the cells.

It is understood that the scai-C data can depict chromatin three-dimensional conformation at single cell resolution, revealing the potential effect of genomic interactions on regulatory cell identity. However, the high dimensionality and sparsity of the scai-C data often complicate analysis, while the unsupervised learning model usually ignores the inherent characteristics of the cell type tag, resulting in the lack of fitting of the unsupervised learning model.

Optionally, training a preset semi-supervised generation model based on the first scai-C dataset and the second scai-C dataset to obtain a cell type prediction model, including: generating a first contact matrix from the first scai-C data; generating a second contact matrix from the second scHi-C data; deleting the contact matrixes of which the number of non-zero elements is smaller than a preset threshold value from the first contact matrix and the second contact matrix; the preset threshold is in a preset proportion to the length of the chromosome of the cell; and training the semi-supervised generation model based on the first contact matrix and the second contact matrix to obtain a cell type prediction model.

Optionally, training a preset semi-supervised generation model based on the first contact matrix and the second contact matrix to obtain a cell type prediction model, including: performing strip conversion on the first contact matrix and the second contact matrix respectively to obtain a strip matrix; and training the semi-supervised generation model based on the strip matrix to obtain a cell type prediction model.

Optionally, training a preset semi-supervised generation model based on the strip matrix to obtain a cell type prediction model, including: performing BandNarm normalization on the band matrix to obtain a normalized band matrix; training the semi-supervised generation model based on the normalized strip matrix to obtain a cell type prediction model.

Optionally, the semi-supervised generation model satisfies the relationship of the following formulas:

wherein c _t A cell type label; u (u) _t Indicating that the intracellular type characteristics of the putative cells follow a normal distribution; z _t Represents a low 0-dimensional random variable; f (f) _z ^μ And f _z ^σ A learnable parameter representing a neural network; l (L) _t Representing the encoded cell-specific scaling factor; x is x _tg Representation according to the basisA contact matrix is generated from the likelihood distribution of counts.

Optionally, the method further comprises: obtaining an external evaluation index and an internal evaluation index of a cell type prediction model; the external evaluation index includes one or more of the following: adjusting the Rand coefficient, normalizing the mutual information, and F-measure and purity; the internal evaluation index includes one or more of the following: contour coefficients, calinski-Harabasz index sum, and Davies-Bouldin index; and determining the clustering performance scores of the cell type prediction model according to the external evaluation index and the internal evaluation index.

Optionally, the method further comprises: acquiring batch mixing evaluation indexes when different batches of training are carried out on the semi-supervised generation model; the batch mixing evaluation index includes one or more of the following: local inverse simpson index, mean profile factor, and cell-specific value; and determining the batch effect score of the cell type prediction model according to the batch mixing evaluation index.

In a second aspect, the present application provides a single-cell three-dimensional genomic data analysis device based on semi-supervised learning, the device may include: an acquisition module and a processing module.

An acquisition module for acquiring a first sc hi-C dataset and a second sc hi-C dataset; the first set of sci-C data includes a plurality of first sci-C data; the second set of scai-C data includes a plurality of second scai-C data; the first scai-C data is scai-C data labeled with a cell type tag; the second scai-C data is non-tagged cell type scai-C data.

The processing module is used for training a preset semi-supervised generation model based on the first sc Hi-C data set and the second sc Hi-C data set to obtain a cell type prediction model; the cell type prediction model was used to predict cell types from the scai-C data of the cells.

Optionally, the processing module is specifically configured to generate a first contact matrix according to the first scai-C data; generating a second contact matrix from the second scHi-C data; deleting the contact matrixes of which the number of non-zero elements is smaller than a preset threshold value from the first contact matrix and the second contact matrix; the preset threshold is in a preset proportion to the length of the chromosome of the cell; and training the semi-supervised generation model based on the first contact matrix and the second contact matrix to obtain a cell type prediction model.

Optionally, the processing module is specifically configured to perform stripe conversion on each of the first contact matrix and the second contact matrix to obtain a stripe matrix; and training the semi-supervised generation model based on the strip matrix to obtain a cell type prediction model.

Optionally, the processing module is specifically configured to perform BandNarm normalization on the stripe matrix to obtain a normalized stripe matrix; training the semi-supervised generation model based on the normalized strip matrix to obtain a cell type prediction model.

wherein c _t A cell type label; u (u) _t Indicating that the intracellular type characteristics of the putative cells follow a normal distribution; z _t Represents a low 0-dimensional random variable; f (f) _z ^μ And f _z ^σ A learnable parameter representing a neural network; l (L) _t Representing the encoded cell-specific scaling factor; x is x _tg Representing a contact matrix generated from a count-based likelihood distribution.

Optionally, the obtaining module is further configured to obtain an external evaluation index and an internal evaluation index of the cell type prediction model; the external evaluation index includes one or more of the following: adjusting the Rand coefficient, normalizing the mutual information, and F-measure and purity; the internal evaluation index includes one or more of the following: contour coefficients, calinski-Harabasz index sum, and Davies-Bouldin index; the processing module is also used for determining the clustering performance scores of the cell type prediction model according to the external evaluation index and the internal evaluation index.

Optionally, the acquiring module is further configured to acquire batch mixing evaluation indexes when the semi-supervised generation model performs different batch training; the batch mixing evaluation index includes one or more of the following: local inverse simpson index, mean profile factor, and cell-specific value; the processing module is also used for determining the batch effect score of the cell type prediction model according to the batch mixing evaluation index.

In a third aspect, the present application provides an electronic device comprising a processor and a memory; the memory stores instructions executable by the processor; the processor is configured to execute the instructions to cause the electronic device to implement the method of the first aspect described above.

In a fourth aspect, the present application provides a computer program product for, when run in an electronic device, causing the electronic device to perform the steps of the related method of the first aspect described above, to implement the method of the first aspect described above.

In a fifth aspect, the present application provides a readable storage medium comprising: a software instruction; the software instructions, when executed in an electronic device, cause the electronic device to implement the method according to the first aspect described above.

The advantageous effects of the second aspect to the fifth aspect described above may be described with reference to the first aspect, and will not be repeated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an electronic device according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for analyzing single-cell three-dimensional genome data based on semi-supervised learning according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of another method for analyzing single-cell three-dimensional genome data based on semi-supervised learning according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a stripe matrix provided in an embodiment of the present application;

FIG. 5 is a schematic representation of sequencing depth bias and genome contact attenuation effects provided by embodiments of the present application;

FIG. 6 is a schematic flow chart of a method for analyzing single-cell three-dimensional genome data based on semi-supervised learning according to an embodiment of the present application;

FIG. 7 is a schematic flow chart of a single-cell three-dimensional genome data analysis method based on semi-supervised learning according to an embodiment of the present application;

fig. 8 is a schematic diagram of a single-cell three-dimensional genome data analysis device based on semi-supervised learning according to an embodiment of the present application.

Detailed Description

Hereinafter, the terms "first," "second," and "third," etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", or "a third", etc., may explicitly or implicitly include one or more such feature.

Chromatin conformation capture technology is a technology that studies the spatial conformation of chromatin. The technology uses cross-linking agents such as formaldehyde to cross-link proteins in cells with DNA and DNA to preserve interaction relation, uses restriction enzyme to cut DNA, breaks fragments with flat ends or sticky ends, and uses ligase to connect broken fragments so as to obtain interaction information.

In recent years, various new technologies such as 4C, 5C, hi-C and the like have been developed on the basis of the above technology. The Hi-C technology is also called a high-throughput chromosome conformation capture technology, mainly takes the whole cell nucleus as a research object, utilizes a high-throughput sequencing technology and combines a biological information analysis method to research the spatial position relation of the whole chromatin in the whole genome range, and obtains a high-resolution chromatin regulatory element interaction map, thereby describing the three-dimensional genome structure more comprehensively.

The three-dimensional genomic structure may include, among other things, a/B compartments, sub-compartments, topologically related domains (topologically associating domains, TAD), and chromatin loops.

In the related art, a scVI-3D model of an unsupervised learning model is proposed to analyze three-dimensional genome features at single-cell (sc) resolution, and predict cell types of single cells. The model systematically considers the structural characteristics of the scHi-C data, the genomic distance deviation, the sequencing depth effect, the sparsity effect and the batch effect, and can realize biological interpretation of the fine three-dimensional genomic clustering result on the gene level.

However, scVI-3D models may suffer from under-fitting problems.

Based on the above, the embodiment of the application provides a single-cell three-dimensional genome data analysis method based on semi-supervised learning, which can learn the characteristics of a label by using a semi-supervised generation model and avoid the problem of under fitting.

The execution subject of the single-cell three-dimensional genome data analysis method based on semi-supervised learning provided by the embodiment of the application may be a single-cell three-dimensional genome data analysis device based on semi-supervised learning, and the device may be an electronic device with a calculation processing function, such as a computer or a server. The server may be a single server, or may be a server cluster formed by a plurality of servers. In some implementations, the server cluster may also be a distributed cluster. Optionally, the server may also be implemented on a cloud platform, which may include, for example, a private cloud, public cloud, hybrid cloud, community cloud (community cloud), distributed cloud, inter-cloud (inter-cloud), multi-cloud (multi-cloud), and the like, or any combination thereof. The embodiments of the present application are not limited in this regard.

Alternatively, the single-cell three-dimensional genome data analysis device based on semi-supervised learning may also be a processor (e.g., central processing unit (central processing unit, CPU)) in the aforementioned electronic apparatus; alternatively, the apparatus may be an Application (APP) installed in the aforementioned electronic device for performing a single-cell three-dimensional genome data analysis method based on semi-supervised learning; still alternatively, the apparatus may be a software system or platform deployed in the foregoing electronic device; alternatively, the apparatus may be a functional module or the like for performing a single-cell three-dimensional genome data analysis method based on semi-supervised learning in the electronic device. The embodiments of the present application are not limited in this regard.

For simplicity of description, the single-cell three-dimensional genome data analysis device based on semi-supervised learning will be described below by taking an electronic device as an example.

Fig. 1 is a schematic diagram of an electronic device according to an embodiment of the present application. As shown in fig. 1, the electronic device may include: processor 10, memory 20, communication line 30, and communication interface 40, and input-output interface 50.

The processor 10, the memory 20, the communication interface 40, and the input/output interface 50 may be connected by a communication line 30.

The processor 10 is configured to execute the instructions stored in the memory 20 to implement the method for analyzing single-cell three-dimensional genome data based on semi-supervised learning provided in the following embodiments of the present application. The processor 10 may be a CPU, general purpose processor network processor (network processor, NP), digital signal processor (digital signal processing, DSP), microprocessor, microcontroller (micro control unit, MCU)/single-chip microcomputer (single chip microcomputer)/single-chip microcomputer, programmable logic device (programmable logic device, PLD), or any combination thereof. The processor 10 may also be any other apparatus having a processing function, such as a circuit, a device, or a software module, which is not limited in this embodiment. In one example, processor 10 may include one or more CPUs, such as CPU0 and CPU1 in fig. 1. As an alternative implementation, the electronic device may include multiple processors, for example, and may include processor 60 (illustrated in phantom in fig. 1) in addition to processor 10.

Memory 20 for storing instructions. For example, the instructions may be a computer program. Alternatively, memory 20 may be a read-only memory (ROM) or other type of static storage device that may store static information and/or instructions, an access memory (random access memory, RAM) or other type of dynamic storage device that may store information and/or instructions, an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc read-only memory (compact disc read-only memory, CD-ROM) or other optical storage, optical storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media, or other magnetic storage devices, etc., as examples of which are not limited in this application.

It should be noted that, the memory 20 may exist separately from the processor 10 or may be integrated with the processor 10. The memory 20 may be located inside the electronic device or outside the electronic device, which is not limited in this embodiment of the present application.

A communication line 30 for communicating information between the components comprised by the electronic device.

A communication interface 40 for communicating with other devices or other communication networks. The other communication network may be an ethernet, a radio access network (radio access network, RAN), a wireless local area network (wireless local area networks, WLAN), etc. The communication interface 40 may be a module, a circuit, a transceiver, or any device capable of enabling communication.

And an input-output interface 50 for implementing man-machine interaction between the user and the electronic device. Such as enabling action interactions or information interactions between a user and an electronic device.

The input/output interface 50 may be a mouse, a keyboard, a display screen, or a touch display screen, for example. The action interaction or information interaction between the user and the electronic equipment can be realized through a mouse, a keyboard, a display screen, a touch display screen or the like.

It should be noted that the structure shown in fig. 1 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown in fig. 1 (for example, only the processor 10 and the memory 20), or a combination of some components, or a different arrangement of components.

The following describes a single-cell three-dimensional genome data analysis method based on semi-supervised learning provided in the embodiments of the present application.

Fig. 2 is a flow chart of a single-cell three-dimensional genome data analysis method based on semi-supervised learning according to an embodiment of the present application. As shown in fig. 2, the method includes S101 to S102.

S101, the electronic device acquires a first sc Hi-C data set and a second sc Hi-C data set.

Wherein the first set of sci-C data comprises a plurality of first sci-C data. The second set of scai-C data includes a plurality of second scai-C data. The first scai-C data is a scai-C data labeled with a cell type tag. The second scai-C data is non-tagged cell type scai-C data. The scai-C data may include a/B compartments, sub-compartments, topologically related domains (topologically associating domains, TAD), chromatin loops, and the like.

S102, the electronic equipment trains a preset semi-supervised generation model based on the first sc Hi-C data set and the second sc Hi-C data set to obtain a cell type prediction model.

Wherein the cell type prediction model is used to predict cell type based on the scai-C data of the cell. In a semi-supervised generation model, unlabeled data (second scai-C data) is used to learn potential distributions and feature representations of the data, while labeled data (first scai-C data) is used to fine tune the model to enable the model to better predict and classify labels. In this way, the semi-supervised generation model can utilize information in the unlabeled data to infer and predict labels and transmit these labels to the labeled data, thereby expanding the label dataset and improving classification performance of the model.

Optionally, the semi-supervised generation model satisfies the following equation (1):

in the formula (1), c _t Representing cell type tags. z _t Representing a low-dimensional random variable. f (f) _z ^μ And f _z ^σ Representation ofThe learnable parameters of the neural network. u (u) _t Indicating that the intracellular type characteristics of the putative cells follow a normal distribution; z _t Representing a low-dimensional random variable; l (L) _t Representing the encoded cell-specific scaling factor; x is x _tg Representing a contact matrix generated from a count-based likelihood distribution.

In some possible embodiments, the electronic device may generate a contact matrix from the scai-C data and train the semi-supervised generation model based on the contact matrix. In this case, fig. 3 is another flow chart of the method for analyzing single-cell three-dimensional genome data based on semi-supervised learning according to the embodiment of the present application. As shown in fig. 3, S102 may specifically include S1021 to S1024.

S1021, the electronic equipment generates a first contact matrix according to the first scHi-C data.

Wherein the first contact matrix may be a two-dimensional matrix, the rows and columns of the two-dimensional matrix representing different chromosomal or genomic locations, respectively, each element of the matrix representing the strength or frequency of interaction between the two locations. Typically, if there is an interaction between two chromosomal or genomic locations, the value of the element at that location will be greater than 0, otherwise the value of the element at that location will be 0. Thus, the number of non-zero chromatin interactions in the contact matrix can be used to measure the abundance of chromatin interactions within a single cell. For sc (single cell) Hi-C data, the contact matrix is different for each cell, as the genomic status is different in each cell. Thus, how much of the non-zero chromatin interaction number in the contact matrix may reflect differences in chromatin interactions within a single cell.

For example, for all first scai-C data, the electronic device may divide the data into n=l/r non-overlapping bins of resolution 1Mb, and generate a first contact matrix from the non-overlapping bins. L represents the length of the chromosome, and r is a preset value.

And S1022, the electronic device generates a second contact matrix according to the second scHi-C data.

The second contact matrix may be referred to the first contact matrix, and will not be described herein.

S1023, deleting the contact matrixes of which the number of non-zero elements is smaller than a preset threshold value from the first contact matrix and the second contact matrix by the electronic equipment.

Wherein the predetermined threshold is predetermined proportional to the length of the chromosome of the cell. For example, the chromosome length is L, and the predetermined threshold is L/6 if the predetermined ratio is 1/6.

It is understood that very sparse cells generally refer to very small numbers of cells in a tissue or sample. In biological and medical research, these cells may have important functions or represent a specific state, but are difficult to study due to their rare number. In the single-cell three-dimensional genome data analysis method based on semi-supervised learning, the electronic equipment can delete the contact matrix of the extremely sparse cells with small non-zero interaction number, so that the extremely sparse cells are prevented from affecting the model performance, and the sample quality is improved.

And S1024, the electronic equipment trains the semi-supervised generation model based on the first contact matrix and the second contact matrix to obtain a cell type prediction model.

In one possible implementation, the electronic device may strip convert the contact matrix and train the semi-supervised generation model based on the converted strip matrix. In this case, the step S1024 may specifically include the following steps:

step 1, the electronic equipment performs strip conversion on the first contact matrix and the second contact matrix respectively to obtain a strip matrix.

Among them, the banding (band transformation) is a data processing method for performing the degradation and visualization of single cell Hi-C data.

Specifically, the band transformation converts the original Hi-C contact matrix into a series of bands, each representing interactions over a specific genomic distance range. Such conversion may reduce the dimensionality and complexity of the data while preserving important information of genome interactions.

The basic idea of band switching is to break down the contact matrix into a series of bands, each corresponding to a different genomic distance range. By calculating the interaction frequency within each stripe, a low-dimensional representation of the data can be obtained, which can be more easily visualized and analyzed. Furthermore, by selecting the appropriate number and width of bands, different scales of genomic interactions can be captured at different resolutions.

For example, the electronics can first stratify the upper triangle of the symmetric contact matrix for each cell into diagonal bands, each band representing genomic distances between interacting loci, and then the bands from the same genomic distances can be organized into a matrix of bands across cells.

For example, referring to fig. 4, fig. 4 is a schematic diagram of a stripe matrix according to an embodiment of the present application. A strip matrix for characterizing a three-dimensional genomic structure using a low-dimensional potential space is exemplarily shown in fig. 4.

And step 2, the electronic equipment trains the semi-supervised generation model based on the strip matrix to obtain a cell type prediction model.

Optionally, step 2 may specifically include the following steps:

and 2.1, carrying out BandNarm normalization on the band matrix by the electronic equipment to obtain a normalized band matrix.

Wherein BandNarm normalization is a data normalization method, and is generally used in the fields of deep learning and machine learning. Its main purpose is to normalize the distribution of the input data in order to better train the neural network model. The normalization method can effectively reduce internal covariate offset (internal covariate shift) and improve generalization capability of the model.

Alternatively, the BandNorm normalization calculation process can be divided into the following steps:

first, the mean and standard deviation of the data in each of the different channels are calculated.

And secondly, normalizing the data in the channel by using the calculated mean value and standard deviation.

And finally, transforming the normalized data through the leachable scaling and shifting parameters so as to restore the expression capacity of the data.

And 2.2, training the semi-supervised generation model by the electronic equipment based on the normalized strip matrix to obtain a cell type prediction model.

For example, referring to fig. 5, fig. 5 is a schematic diagram of sequencing depth bias and genome contact attenuation effects provided in the embodiments of the present application. Taking F measure as an evaluation criterion for semi-supervised generation model performance as shown in FIG. 5, the sequencing depth bias is typically due to inconsistent sequencing depths for different regions or different genomes during the sequencing process. Such deviations may lead to inaccurate sequencing results for certain regions of the genome, thereby affecting subsequent analysis and experimental results. The genome contact attenuation effect means that the probability of contact between two genomes gradually decreases as the distance between them increases. This is because the farther the distance between the two genomes on the DNA strand, the higher the energy required to form interactions between them and therefore the lower the probability of contact.

In some possible embodiments, after training to obtain the cell type prediction model, the electronic device may further obtain the sc hi-C data to be predicted, and input the sc hi-C data to be predicted into the cell type prediction model to obtain the cell type predicted by the cell type prediction model.

In some possible embodiments, fig. 6 is a schematic flow chart of a single-cell three-dimensional genome data analysis method based on semi-supervised learning according to an embodiment of the present application. As shown in fig. 6, after S102 described above, the method may further include S201 to S202.

S201, the electronic equipment acquires an external evaluation index and an internal evaluation index of the cell type prediction model.

Wherein the external evaluation index comprises one or more of the following: the Rand coefficient, normalized mutual information, and F-measure and purity are adjusted. The internal evaluation index includes one or more of the following: contour coefficients, calinski-Harabasz index sum, and Davies-Bouldin index. The processes of adjusting the Rankine coefficient, normalizing the mutual information, F-measure and purity, calinski-Harabasz index sum, and obtaining the Davies-Bouldin index may be described in the related art and will not be described in detail herein.

S202, the electronic equipment determines clustering performance scores of the cell type prediction model according to the external evaluation indexes and the internal evaluation indexes.

For example, the electronic device may use the average value of the external evaluation index and the internal evaluation index as the clustering performance score, or the electronic device may perform weighted summation on the external evaluation index and the internal evaluation index by using a preset weight, and use the weighted summation result as the clustering performance score, which is not limited in the embodiment of the present application.

In other possible embodiments, fig. 7 is a schematic flow chart of a single-cell three-dimensional genome data analysis method based on semi-supervised learning according to an embodiment of the present application. As shown in fig. 7, after S102 described above, the method may further include S301 to S302.

S301, the electronic equipment acquires batch mixing evaluation indexes when different batches of training are carried out on the semi-supervised generation model.

Wherein the batch mixing evaluation index comprises one or more of the following: local inverse simpson index, mean profile factor, and cell-specific values. Specific procedures for obtaining the local inverse simpson index, the average profile factor, and the cell-specific value may be described in the related art, and will not be described here.

S302, the electronic equipment determines the batch effect score of the cell type prediction model according to the batch mixing evaluation index.

S302 may be described with reference to S202 above, and will not be described here again.

The foregoing description of the solution provided in the embodiments of the present application has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. The technical aim may be to use different methods to implement the described functions for each particular application, but such implementation should not be considered beyond the scope of the present application.

In an exemplary embodiment, the embodiment of the application also provides a single-cell three-dimensional genome data analysis device based on semi-supervised learning. Fig. 8 is a schematic diagram of a single-cell three-dimensional genome data analysis device based on semi-supervised learning according to an embodiment of the present application. As shown in fig. 8, the apparatus may include: an acquisition module 801 and a processing module 802.

An acquisition module 801 for acquiring a first sc hi-C dataset and a second sc hi-C dataset; the first set of sci-C data includes a plurality of first sci-C data; the second set of scai-C data includes a plurality of second scai-C data; the first scai-C data is scai-C data labeled with a cell type tag; the second scai-C data is non-tagged cell type scai-C data.

A processing module 802, configured to train a preset semi-supervised generation model based on the first scai-C data set and the second scai-C data set, to obtain a cell type prediction model; the cell type prediction model was used to predict cell types from the scai-C data of the cells.

In some possible embodiments, the processing module 802 is specifically configured to generate a first contact matrix according to the first scai-C data; generating a second contact matrix from the second scHi-C data; deleting the contact matrixes of which the number of non-zero elements is smaller than a preset threshold value from the first contact matrix and the second contact matrix; the preset threshold is in a preset proportion to the length of the chromosome of the cell; and training the semi-supervised generation model based on the first contact matrix and the second contact matrix to obtain a cell type prediction model.

In other possible embodiments, the processing module 802 is specifically configured to perform stripe conversion on each of the first contact matrix and the second contact matrix to obtain a stripe matrix; and training the semi-supervised generation model based on the strip matrix to obtain a cell type prediction model.

In still other possible embodiments, the processing module 802 is specifically configured to normalize the band matrix by BandNorm to obtain a normalized band matrix; training the semi-supervised generation model based on the normalized strip matrix to obtain a cell type prediction model.

In still other possible embodiments, the semi-supervised generation model satisfies the relationship of the following formulas:

wherein c _t A cell type label; u (u) _t Indicating that the intracellular type characteristics of the putative cells follow a normal distribution; z _t Representing a low-dimensional random variable; f (f) _z ^μ And f _z ^σ A learnable parameter representing a neural network; l (L) _t Representing the encoded cell-specific scaling factor; x is x _tg Representing a contact matrix generated from a count-based likelihood distribution.

In still other possible embodiments, the obtaining module 801 is further configured to obtain an external evaluation index and an internal evaluation index of the cell type prediction model; the external evaluation index includes one or more of the following: adjusting the Rand coefficient, normalizing the mutual information, and F-measure and purity; the internal evaluation index includes one or more of the following: contour coefficients, calinski-Harabasz index sum, and Davies-Bouldin index; the processing module 802 is further configured to determine a clustering performance score of the cell type prediction model according to the external evaluation index and the internal evaluation index.

In still other possible embodiments, the obtaining module 801 is further configured to obtain a batch mixing evaluation index when the semi-supervised generation model performs different batch training; the batch mixing evaluation index includes one or more of the following: local inverse simpson index, mean profile factor, and cell-specific value; the processing module 802 is further configured to determine a batch effect score of the cell type prediction model according to the batch mixture evaluation index.

It should be noted that the division of the modules in fig. 8 is schematic, and is merely a logic function division, and other division manners may be implemented in practice. For example, two or more functions may also be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional units.

In an exemplary embodiment, a readable storage medium is also provided, comprising software instructions that, when run on an electronic device, cause the electronic device to perform any of the methods provided by the above embodiments.

In an exemplary embodiment, the present application also provides a computer program product comprising computer-executable instructions, which, when run on an electronic device, cause the electronic device to perform any of the methods provided by the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer-executable instructions. When the computer-executable instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer-executable instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, from one website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Although the present application has been described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the figures, the disclosure, and the appended claims. In the claims, the word "Comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Although the present application has been described in connection with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely exemplary illustrations of the present application as defined in the appended claims and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present application. It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for analyzing single-cell three-dimensional genome data based on semi-supervised learning, the method comprising:

acquiring a first sc hi-C dataset and a second sc hi-C dataset; the first set of sci-C data includes a plurality of first sci-C data; the second set of scai-C data comprises a plurality of second scai-C data; the first sc Hi-C data are sc Hi-C data marked with a cell type tag; the second sc Hi-C data are sc Hi-C data without labeling cell type tags;

training a preset semi-supervised generation model based on the first sc Hi-C data set and the second sc Hi-C data set to obtain a cell type prediction model; the cell type prediction model is used to predict cell types based on the scai-C data of the cells.

2. The method of claim 1, wherein training a pre-set semi-supervised generation model based on the first and second scai-C data sets to obtain a cell type prediction model comprises:

generating a first contact matrix according to the first scai-C data;

generating a second contact matrix according to the second scai-C data;

deleting the contact matrixes of which the number of non-zero elements is smaller than a preset threshold value from the first contact matrix and the second contact matrix; the preset threshold is in preset proportion to the length of the cell chromosome;

and training the semi-supervised generation model based on the first contact matrix and the second contact matrix to obtain the cell type prediction model.

3. The method of claim 2, wherein training a pre-set semi-supervised generation model based on the first and second contact matrices to obtain the cell type predictive model comprises:

performing strip conversion on the first contact matrix and the second contact matrix respectively to obtain a strip matrix;

and training the semi-supervised generation model based on the strip matrix to obtain the cell type prediction model.

4. A method according to claim 3, wherein training a pre-set semi-supervised generation model based on the strip matrices to obtain the cell type predictive model comprises:

performing BandNarm normalization on the band matrix to obtain a normalized band matrix;

and training the semi-supervised generation model based on the normalized strip matrix to obtain the cell type prediction model.

5. The method of any one of claims 1-4, wherein the semi-supervised generation model satisfies the relationship of the following formulas:

wherein c _t Representing the cell type tag; u (u) _t Indicating that the intracellular type characteristics of the putative cells follow a normal distribution; z _t Representing a low-dimensional random variable;and->A learnable parameter representing a neural network; l (L) _t Representing the encoded cell-specific scaling factor; x is x _tg Representing a contact matrix generated from a count-based likelihood distribution.

6. The method of claim 5, wherein the method further comprises:

obtaining an external evaluation index and an internal evaluation index of the cell type prediction model; the external evaluation index includes one or more of the following: adjusting the Rand coefficient, normalizing the mutual information, and F-measure and purity; the internal evaluation index includes one or more of the following: contour coefficients, calinski-Harabasz index sum, and Davies-Bouldin index;

and determining a clustering performance score of the cell type prediction model according to the external evaluation index and the internal evaluation index.

7. The method of claim 5, wherein the method further comprises:

acquiring batch mixing evaluation indexes when different batches of training are carried out on the semi-supervised generation model; the batch mixing evaluation index includes one or more of the following: local inverse simpson index, mean profile factor, and cell-specific value;

and determining the batch effect score of the cell type prediction model according to the batch mixing evaluation index.

8. A single-cell three-dimensional genomic data analysis device based on semi-supervised learning, the device comprising: the device comprises an acquisition module and a processing module;

the acquisition module is used for acquiring a first sc Hi-C data set and a second sc Hi-C data set; the first set of sci-C data includes a plurality of first sci-C data; the second set of scai-C data comprises a plurality of second scai-C data; the first sc Hi-C data are sc Hi-C data marked with a cell type tag; the second sc Hi-C data are sc Hi-C data without labeling cell type tags;

the processing module is used for training a preset semi-supervised generation model based on the first scHi-C data set and the second scHi-C data set to obtain a cell type prediction model; the cell type prediction model is used to predict cell types based on the scai-C data of the cells.

9. An electronic device, the electronic device comprising: a processor and a memory;

the memory stores instructions executable by the processor;

the processor is configured to, when executing the instructions, cause the electronic device to implement the method of any one of claims 1-7.

10. A readable storage medium, the readable storage medium comprising: a software instruction;

when the software instructions are run in an electronic device, the electronic device is caused to implement the method of any one of claims 1-7.