CN117437976B

CN117437976B - Disease risk screening method and system based on gene detection

Info

Publication number: CN117437976B
Application number: CN202311767529.1A
Authority: CN
Inventors: 杨骁�
Original assignee: Shenzhen Body Code Gene Technology Co ltd
Current assignee: Shenzhen Body Code Gene Technology Co ltd
Priority date: 2023-12-21
Filing date: 2023-12-21
Publication date: 2024-04-02
Anticipated expiration: 2043-12-21
Also published as: CN117437976A

Abstract

The invention relates to the field of data processing, and discloses a disease risk screening method and system based on gene detection, which are used for improving the disease risk screening accuracy of gene detection. The method comprises the following steps: acquiring a plurality of first genome sequence data of a plurality of sample users and second genome sequence data of a target user; carrying out gene mutation recognition and primer amplification target sequence design to obtain a target mutation information set; performing feature cluster analysis to obtain a plurality of variation pattern features and performing feature modeling to obtain a plurality of first variation feature models; performing variation characteristic model matching to obtain a corresponding second variation characteristic model; performing sequence circulation traversal on the second genome sequence data based on the second variation characteristic model to obtain a target circulation traversal result; and verifying the target circulation traversing result to obtain a target verification result, and optimizing the feature model of the second variation feature model according to the target verification result to obtain a target variation feature model.

Description

Disease risk screening method and system based on gene detection

Technical Field

The invention relates to the field of data processing, in particular to a disease risk screening method and system based on gene detection.

Background

In recent years, machine learning and artificial intelligence techniques have been widely used in the field of bioinformatics, particularly in genomic data analysis. With these advanced computing techniques, large-scale genomic data can be efficiently processed and analyzed to identify complex genetic patterns associated with disease.

Existing schemes are often limited by problems of slow processing speed, low accuracy, or inability to handle complex genetic patterns. Therefore, how to accurately extract risk factors associated with a particular disease from a large amount of gene sequence data remains a challenge.

Disclosure of Invention

The invention provides a disease risk screening method and system based on gene detection, which are used for improving the disease risk screening accuracy of gene detection.

The first aspect of the present invention provides a disease risk screening method based on gene detection, comprising: acquiring a plurality of first genome sequence data of a plurality of sample users, and simultaneously acquiring second genome sequence data of a target user; inputting the plurality of first genome sequence data into a preset long-short-time memory network model for gene mutation identification to obtain a gene mutation information set of each first genome sequence data, and amplifying target sequences of the plurality of sample users by designing primers according to the gene mutation information set through biology to obtain a target mutation information set of each first genome sequence data; performing feature cluster analysis on the target mutation information set of each first genome sequence data to obtain a plurality of mutation mode features, and performing feature modeling on the mutation mode features to obtain a plurality of first mutation feature models; performing mutation feature model matching on the second genome sequence data according to the plurality of first mutation feature models to obtain corresponding second mutation feature models; performing sequence circulation traversal on the second genome sequence data based on the second variation characteristic model to obtain a target circulation traversal result; and verifying the target circulation traversing result to obtain a target verification result, and optimizing the feature model of the second variation feature model according to the target verification result to obtain a target variation feature model.

With reference to the first aspect, in a first implementation manner of the first aspect of the present invention, the acquiring a plurality of first genomic sequence data of a plurality of sample users, and simultaneously acquiring second genomic sequence data of a target user includes: acquiring a plurality of first nucleotide sequence data of a plurality of sample users, and acquiring second nucleotide sequence data of a target user; creating a set of nucleotide codes, the set of nucleotide codes comprising: adenine a= [1, 0], cytosine c= [0,1,0 ]; guanine g= [0,1, 0] and thymine t= [0, 1]; respectively carrying out sequence data slicing on the plurality of first genome sequence data and the second genome sequence data to obtain N first slice sequence data of each first genome sequence data and N second slice sequence data of the second genome sequence data; and based on the nucleotide coding set, carrying out sequence coding and coding fusion on the N first slice sequence data to obtain a plurality of corresponding first genome sequence data, and carrying out sequence coding and coding fusion on the N second slice sequence data to obtain a corresponding second genome sequence data.

With reference to the first aspect, in a second implementation manner of the first aspect of the present invention, the inputting the plurality of first genomic sequence data into a preset long-short-time memory network model to perform genetic variation recognition to obtain a genetic variation information set of each first genomic sequence data, and amplifying target sequences of the plurality of sample users by using a primer according to the genetic variation information set through biology to obtain a target variation information set of each first genomic sequence data includes: inputting the plurality of first genome sequence data into a preset long-short time memory network model respectively, wherein the long-short time memory network model comprises a first long-short time memory layer, a second long-short time memory layer, a full connection layer and an output layer; extracting low-level features of each first genome sequence data through the first long-short time memory layer to obtain low-level sequence features of each first genome sequence data; performing high-dimensional feature conversion and feature combination on the low-level sequence features through the second long-short time memory layer to obtain target output features of each first genome sequence data; performing prediction output conversion on the target output characteristics through the full-connection layer to obtain the prediction output characteristics of each first genome sequence data; carrying out characteristic gene variation classification on the predicted output characteristics through a softmax function in the output layer to obtain a gene variation classification result of each first genome sequence data; and carrying out mutation information integration on the gene mutation classification result to obtain a gene mutation information set of each first genome sequence data, and amplifying target sequences of the plurality of sample users by designing primers according to the gene mutation information set by biology to obtain a target mutation information set of each first genome sequence data.

With reference to the first aspect, in a third implementation manner of the first aspect of the present invention, performing feature cluster analysis on the target mutation information set of each first genomic sequence data to obtain a plurality of mutation mode features, and performing feature modeling on the plurality of mutation mode features to obtain a plurality of first mutation feature models, where the feature cluster analysis includes: information extraction is carried out on the target variation information set of each first genome sequence data to obtain variation type information, variation position information and variation frequency information of each first genome sequence data; performing mutation feature association relation identification on the mutation type information, the mutation position information and the mutation frequency information to obtain a mutation feature association relation; determining a plurality of corresponding initial clustering centers according to the variation characteristic association relation, and inputting the gene variation information set into a preset characteristic clustering analysis model; carrying out feature clustering on the gene variation information set according to the plurality of initial clustering centers by the feature cluster analysis model to obtain a first feature clustering result of each initial clustering center; according to the first characteristic clustering result, carrying out clustering center correction on the plurality of initial clustering centers to obtain a plurality of target clustering centers; performing cluster analysis on the gene mutation information set according to the target cluster centers to obtain a plurality of second feature cluster results, and performing mutation mode analysis on the second feature cluster results to obtain a plurality of mutation mode features; and adopting a random forest algorithm to respectively perform feature standardization modeling on each variation mode feature to obtain a plurality of first variation feature models.

With reference to the first aspect, in a fourth implementation manner of the first aspect of the present invention, performing mutation feature model matching on the second genomic sequence data according to the plurality of first mutation feature models to obtain a corresponding second mutation feature model, where the method includes: performing characteristic preliminary detection on the second genome sequence data to obtain first characteristic identification information; respectively creating second characteristic identification information of each first variation characteristic model according to the variation mode characteristics; vector conversion is carried out on the first characteristic identification information to obtain a first characteristic identification vector, and vector conversion is carried out on the second characteristic identification information to obtain a second characteristic identification vector; performing Euclidean distance calculation on the first feature identification vector and the second feature identification vector to obtain a target Euclidean distance of each first variation feature model; and performing model optimization selection on the plurality of first variation feature models according to the target Euclidean distance to obtain corresponding second variation feature models.

With reference to the first aspect, in a fifth implementation manner of the first aspect of the present invention, the performing, based on the second variation feature model, a sequence loop traversal on the second genome sequence data to obtain a target loop traversal result includes: performing movement detection on the second genome sequence data according to a preset direction and a preset first coding length through the second variation characteristic model to obtain a movement detection result; performing next-round code length analysis through the movement detection result to obtain a second code length, and performing next-round movement detection according to the second code length through the second variation characteristic model until the second genome sequence data is traversed to obtain a plurality of movement detection results; judging whether the plurality of movement detection results meet a preset exit condition or not; if the first variation characteristic model is not met, performing sequence circulation traversal on the first genome sequence data through the first variation characteristic model to generate a target circulation traversal result; and if yes, carrying out result comprehensive analysis on the plurality of movement detection results to obtain a target circulation traversing result and outputting the target circulation traversing result.

With reference to the first aspect, in a sixth implementation manner of the first aspect of the present invention, the verifying the target cycle traversal result to obtain a target verification result, and performing feature model optimization on the second variant feature model according to the target verification result to obtain a target variant feature model includes: defining a plurality of check rules, and performing set conversion on the check rules to obtain a check rule set; checking the target circulation traversing result according to the checking rule set to obtain an initial checking result corresponding to each checking rule; carrying out result aggregation on the initial verification result corresponding to each verification rule to obtain a target verification result; determining the model parameter ranges of the second variation characteristic model according to the target verification result to obtain a plurality of model parameter ranges; generating a random initial value of the second variation characteristic model through the plurality of model parameter ranges to obtain a corresponding random initial value set, and constructing a particle population of the random initial value set through a preset inverse particle propagation algorithm to obtain the particle population; performing particle fitness calculation on the particle population to obtain a particle fitness set corresponding to the particle population, and performing iterative calculation on the particle fitness set until a preset condition is met, so as to generate an optimal solution corresponding to the particle population; and carrying out feature model optimization on the second variation feature model through the optimal solution to obtain a target variation feature model.

In a second aspect, the present invention provides a disease risk screening system based on gene detection, the disease risk screening system based on gene detection comprising: the acquisition module is used for acquiring a plurality of first genome sequence data of a plurality of sample users and simultaneously acquiring second genome sequence data of a target user; the identification module is used for respectively inputting the plurality of first genome sequence data into a preset long-short-time memory network model for carrying out genetic variation identification to obtain a genetic variation information set of each first genome sequence data, and amplifying target sequences of the plurality of sample users through a primer designed according to the genetic variation information set by biology to obtain a target variation information set of each first genome sequence data; the modeling module is used for carrying out feature cluster analysis on the target mutation information set of each first genome sequence data to obtain a plurality of mutation mode features, and carrying out feature modeling on the mutation mode features to obtain a plurality of first mutation feature models; the matching module is used for carrying out mutation feature model matching on the second genome sequence data according to the plurality of first mutation feature models to obtain a corresponding second mutation feature model; the traversing module is used for performing sequence circulation traversing on the second genome sequence data based on the second variation characteristic model to obtain a target circulation traversing result; and the optimization module is used for verifying the target circulation traversing result to obtain a target verification result, and optimizing the feature model of the second variation feature model according to the target verification result to obtain a target variation feature model.

With reference to the second aspect, in a first implementation manner of the second aspect of the present invention, the acquiring module is specifically configured to: acquiring a plurality of first nucleotide sequence data of a plurality of sample users, and acquiring second nucleotide sequence data of a target user; creating a set of nucleotide codes, the set of nucleotide codes comprising: adenine a= [1, 0], cytosine c= [0,1,0 ]; guanine g= [0,1, 0] and thymine t= [0, 1]; respectively carrying out sequence data slicing on the plurality of first genome sequence data and the second genome sequence data to obtain N first slice sequence data of each first genome sequence data and N second slice sequence data of the second genome sequence data; and based on the nucleotide coding set, carrying out sequence coding and coding fusion on the N first slice sequence data to obtain a plurality of corresponding first genome sequence data, and carrying out sequence coding and coding fusion on the N second slice sequence data to obtain a corresponding second genome sequence data.

With reference to the second aspect, in a second implementation manner of the second aspect of the present invention, the identification module is specifically configured to: inputting the plurality of first genome sequence data into a preset long-short time memory network model respectively, wherein the long-short time memory network model comprises a first long-short time memory layer, a second long-short time memory layer, a full connection layer and an output layer; extracting low-level features of each first genome sequence data through the first long-short time memory layer to obtain low-level sequence features of each first genome sequence data; performing high-dimensional feature conversion and feature combination on the low-level sequence features through the second long-short time memory layer to obtain target output features of each first genome sequence data; performing prediction output conversion on the target output characteristics through the full-connection layer to obtain the prediction output characteristics of each first genome sequence data; carrying out characteristic gene variation classification on the predicted output characteristics through a softmax function in the output layer to obtain a gene variation classification result of each first genome sequence data; and carrying out mutation information integration on the gene mutation classification result to obtain a gene mutation information set of each first genome sequence data, and amplifying target sequences of the plurality of sample users by designing primers according to the gene mutation information set by biology to obtain a target mutation information set of each first genome sequence data.

With reference to the second aspect, in a third implementation manner of the second aspect of the present invention, the modeling module is specifically configured to: information extraction is carried out on the target variation information set of each first genome sequence data to obtain variation type information, variation position information and variation frequency information of each first genome sequence data; performing mutation feature association relation identification on the mutation type information, the mutation position information and the mutation frequency information to obtain a mutation feature association relation; determining a plurality of corresponding initial clustering centers according to the variation characteristic association relation, and inputting the gene variation information set into a preset characteristic clustering analysis model; carrying out feature clustering on the gene variation information set according to the plurality of initial clustering centers by the feature cluster analysis model to obtain a first feature clustering result of each initial clustering center; according to the first characteristic clustering result, carrying out clustering center correction on the plurality of initial clustering centers to obtain a plurality of target clustering centers; performing cluster analysis on the gene mutation information set according to the target cluster centers to obtain a plurality of second feature cluster results, and performing mutation mode analysis on the second feature cluster results to obtain a plurality of mutation mode features; and adopting a random forest algorithm to respectively perform feature standardization modeling on each variation mode feature to obtain a plurality of first variation feature models.

With reference to the second aspect, in a fourth implementation manner of the second aspect of the present invention, the matching module is specifically configured to: performing characteristic preliminary detection on the second genome sequence data to obtain first characteristic identification information; respectively creating second characteristic identification information of each first variation characteristic model according to the variation mode characteristics; vector conversion is carried out on the first characteristic identification information to obtain a first characteristic identification vector, and vector conversion is carried out on the second characteristic identification information to obtain a second characteristic identification vector; performing Euclidean distance calculation on the first feature identification vector and the second feature identification vector to obtain a target Euclidean distance of each first variation feature model; and performing model optimization selection on the plurality of first variation feature models according to the target Euclidean distance to obtain corresponding second variation feature models.

With reference to the second aspect, in a fifth implementation manner of the second aspect of the present invention, the traversal module is specifically configured to: performing movement detection on the second genome sequence data according to a preset direction and a preset first coding length through the second variation characteristic model to obtain a movement detection result; performing next-round code length analysis through the movement detection result to obtain a second code length, and performing next-round movement detection according to the second code length through the second variation characteristic model until the second genome sequence data is traversed to obtain a plurality of movement detection results; judging whether the plurality of movement detection results meet a preset exit condition or not; if the first variation characteristic model is not met, performing sequence circulation traversal on the first genome sequence data through the first variation characteristic model to generate a target circulation traversal result; and if yes, carrying out result comprehensive analysis on the plurality of movement detection results to obtain a target circulation traversing result and outputting the target circulation traversing result.

With reference to the second aspect, in a sixth implementation manner of the second aspect of the present invention, the optimization module is specifically configured to: defining a plurality of check rules, and performing set conversion on the check rules to obtain a check rule set; checking the target circulation traversing result according to the checking rule set to obtain an initial checking result corresponding to each checking rule; carrying out result aggregation on the initial verification result corresponding to each verification rule to obtain a target verification result; determining the model parameter ranges of the second variation characteristic model according to the target verification result to obtain a plurality of model parameter ranges; generating a random initial value of the second variation characteristic model through the plurality of model parameter ranges to obtain a corresponding random initial value set, and constructing a particle population of the random initial value set through a preset inverse particle propagation algorithm to obtain the particle population; performing particle fitness calculation on the particle population to obtain a particle fitness set corresponding to the particle population, and performing iterative calculation on the particle fitness set until a preset condition is met, so as to generate an optimal solution corresponding to the particle population; and carrying out feature model optimization on the second variation feature model through the optimal solution to obtain a target variation feature model.

In the technical scheme provided by the invention, a plurality of first genome sequence data of a plurality of sample users and a plurality of second genome sequence data of a target user are obtained; carrying out genetic variation recognition to obtain a genetic variation information set of each first genome sequence data; performing feature cluster analysis to obtain a plurality of variation pattern features and performing feature modeling to obtain a plurality of first variation feature models; performing variation characteristic model matching to obtain a corresponding second variation characteristic model; performing sequence circulation traversal on the second genome sequence data based on the second variation characteristic model to obtain a target circulation traversal result; and verifying the target circulation traversing result to obtain a target verification result, and optimizing the characteristic model of the second variation characteristic model according to the target verification result to obtain a target variation characteristic model. The hierarchical structure of LSTM can capture long-term dependency relationship when processing sequence data, thereby improving accuracy of mutation identification. The large-scale genome data can be efficiently processed by slicing and encoding the genome sequence data, and different mutation patterns can be identified by performing feature cluster analysis on the genetic mutation information. This in-depth analysis helps to understand the complex pattern behind the variation, providing more information for disease risk prediction. By matching the second genomic sequence data of the target user with the established variant feature model, the method can provide a customized disease risk assessment, making the results more personalized and accurate. Through the cyclic traversal and verification of the model matching result and the optimization of the variation characteristic model according to the verification result, the accuracy and reliability of the model can be continuously improved, the timeliness and accuracy of the screening result are ensured, and the disease risk screening accuracy of the gene detection is further improved.

Drawings

FIG. 1 is a schematic diagram showing an embodiment of a disease risk screening method based on gene detection in an embodiment of the present invention;

FIG. 2 is a flow chart of the identification of genetic variation in an embodiment of the present invention;

FIG. 3 is a flow chart of feature modeling in an embodiment of the invention;

FIG. 4 is a flowchart of variant feature model matching in an embodiment of the present invention;

FIG. 5 is a schematic diagram of an embodiment of a disease risk screening system based on gene detection in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a disease risk screening method and system based on gene detection, which are used for improving the disease risk screening accuracy of the gene detection. The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, a specific flow of an embodiment of the present invention will be described below with reference to fig. 1, and an embodiment of a disease risk screening method based on gene detection in an embodiment of the present invention includes:

s101, acquiring a plurality of first genome sequence data of a plurality of sample users, and simultaneously, acquiring second genome sequence data of a target user;

it will be appreciated that the subject of the present invention may be a disease risk screening system based on genetic testing, or may be a terminal or server, and is not limited in this regard. The embodiment of the invention is described by taking a server as an execution main body as an example.

Specifically, first genomic sequence data for a plurality of sample users is obtained. These data are typically obtained by high throughput sequencing techniques that enable rapid and accurate reading of the genetic information of a sample user. At the same time, second genomic sequence data of the target user is acquired. The collection of these data provides the basis for subsequent analysis. A set of nucleotide codes is created. Nucleotides are the basic units constituting DNA, and include adenine (A), cytosine (C), guanine (G) and thymine (T). In this process, each nucleotide is assigned a unique code, e.g., adenine A is encoded as [1, 0], cytosine C is encoded as [0,1, 0], and so on. The coding method simplifies the subsequent data processing flow, so that the gene sequence can be efficiently processed by a computer. Genomic sequence data was sliced. Genomic data is often very bulky and processing such data directly is not only computationally intensive but also inefficient. Thus, these sequence data are divided into smaller portions. The first genomic sequence data and the second genomic sequence data are respectively sliced into a plurality of small segments. Each small segment contains a portion of the gene sequence, and such segmentation both preserves the sequence integrity and improves processing efficiency. And (5) performing sequence coding and coding fusion. Each slice sequence is encoded based on the previously created set of nucleotide codes. This process converts nucleotides into the digital codes defined previously. For example, a slice containing "AGCT" may be converted to the form of [1, 0], [0,1, 0], [0,1 ]. By fusing these codes, corresponding genomic sequence data can be reconstructed. This method not only retains all the information of the original sequence, but also converts it into a form that is easier to calculate and analyze.

S102, respectively inputting a plurality of first genome sequence data into a preset long-short-time memory network model for gene mutation recognition to obtain a gene mutation information set of each first genome sequence data, and amplifying target sequences of a plurality of sample users by using a primer according to the gene mutation information set through biology to obtain a target mutation information set of each first genome sequence data;

specifically, a plurality of first genome sequence data are input into a preset long-short-time memory network model. The LSTM model is a special type of Recurrent Neural Network (RNN) that is suitable for processing and predicting important events in sequence data. Here, the LSTM model is designed to include two long and short time memory layers, one full connection layer, and one output layer. A Long Short-Term Memory network (LSTM) is a special Recurrent Neural Network (RNN) that is particularly suitable for processing and predicting important events with Long intervals and delays in a time series. LSTM solves the gradient disappearance or gradient explosion problems of the traditional RNN in long sequence learning through the unique network structure. LSTM networks generally comprise the following major components: a first long short time memory layer: this layer is responsible for receiving the input sequence and processing the time steps in the sequence. It consists of a plurality of LSTM cells, each cell including a forget gate, an input gate, a cell state, and an output gate. A second long short time memory layer: this layer is typically used for deeper feature extraction. In a multilayer LSTM structure, the second and subsequent layers receive as input the output of the previous layer LSTM. Full tie layer: after the LSTM Layer there is typically one or more fully connected layers (layers) for further learning from features extracted from the LSTM Layer and starting to prepare for final output. Output layer: depending on the requirements of a particular task, the output layer may be a regression layer (e.g., a single neuron for continuous value prediction) or a classification layer (e.g., a softmax layer for multi-class classification). For the calculation inside the LSTM unit, taking a standard LSTM unit as an example, the calculation formula is as follows: forget Gate (Forget Gate): [ f_t= \sigma (w_f\cdot [ h_t-1 }, x_t ] +b_f) ], where (f_t) represents the output of the forgetting gate, (\sigma) is the sigmoid activation function, (w_f) and (b_f) are the weight and bias of the forgetting gate, (h_t-1 }) is the hidden state of the last time step and (x_t) is the input of the current time step. Input Gate (Input Gate): [ i_t= \sigma (w_i\cdot [ h_ { t-1}, x_t ] +b_i) ]; [ \tille { C } t= \tanh (W_C\cdot [ h { t-1}, x_t ] +b_C) ], (i_t) is the output of the input gate, (W_i) and (b_i) are the weights and biases of the input gate, (\tille { C } -t) is the candidate cell state, (W_C) and (b_C) are the weights and biases of the candidate cell state. Cell state update: [ c_t=f_t ] c_t_1 } +i_t_tille { C } t ], (c_t) is the cell state of the current time step. Output Gate (Output Gate): [ o_t= \sigma (w_o\cdot [ h_ { t-1}, x_t ] +b_o) ], [ h_t=o_t\tanh (c_t) ], (o_t) is the output of the output gate, (w_o) and (b_o) are the weights and biases of the output gate, and (h_t) is the hidden state of the current time step. This structure enables the model to efficiently process complex genetic sequence data, identifying key features therein. And carrying out low-level feature extraction on each first genome sequence data through the first long short-term memory layer. At this stage, the model focuses on extracting the basic sequence features from the original gene sequence. Since the gene sequence data generally contains a large amount of information and is structurally complex, preliminary feature extraction facilitates subsequent in-depth analysis. The purpose of this step is to convert the raw data into a more easily analyzed format while retaining sufficient information for use in subsequent steps. And performing high-dimensional feature conversion and feature combination on the low-level sequence features through the second long short-time memory layer. The model further analyzes and transforms the extracted features to generate more complex and abstract high-dimensional features. This process helps identify genetic variations that are not readily observed but are closely related to disease risk. The process of high-dimensional feature transformation and combination enhances the analytical capabilities of the model, enabling it to capture deeper, more subtle patterns of gene sequences. Then, the target output characteristics are subjected to prediction output conversion through the full connection layer. The fully connected layer plays a role in integrating and converting features in the neural network, and gathers and converts the features obtained in the previous layer to generate predicted output features, so that the accuracy and reliability of model output are ensured. And classifying the characteristic genetic variation of the predicted output characteristics through a softmax function in the output layer. The softmax function is a function commonly used for multi-class classification that converts the output of a model into a probability distribution so that each genomic sequence data can be classified into a specific class of genetic variation. After classification is completed, mutation information integration of the genetic mutation classification result is performed. And collecting and sorting all genetic variation classification results output by the model to obtain a genetic variation information set of each first genome sequence data. Once possible pathogenic variations are identified, the next step is to verify under laboratory conditions. This typically involves designing specific primers to amplify the gene sequences containing these variations. And designing a primer according to the variant sequence, and amplifying a target sequence of the sample to be detected. Laboratory-level validation of model predicted variations was performed in combination with biological experiments and belief analysis. And further obtaining a target variation information set of each first genome sequence data. This set provides an overall overview of the genetic variation characteristics of each sample user.

S103, performing feature cluster analysis on a target mutation information set of each first genome sequence data to obtain a plurality of mutation mode features, and performing feature modeling on the mutation mode features to obtain a plurality of first mutation feature models;

it should be noted that, key genetic variation information is extracted from each first genomic sequence data. This includes mutation type information, mutation position information, and mutation frequency information. The mutation type information may reveal the nature of the genetic mutation, such as whether it is a single nucleotide mutation or a more complex structural mutation; mutation location information indicates at which specific location of the genome these mutations occur; and the mutation frequency information reflects the frequency with which these mutations occur in the sample population. The accurate extraction of the information lays a foundation for subsequent analysis. And identifying the association relationship between the variant features. By analyzing the relationship between the type of variation, location and frequency, the pattern of variation present can be revealed. These patterns are those where a particular type of variation often occurs in certain genetic regions, or where certain variations occur abnormally high in a particular population. The identification of the association not only increases the understanding of the genetic risk by the server, but also facilitates the subsequent feature cluster analysis. And then, determining an initial clustering center according to the identified mutation characteristic association relation, and inputting the gene mutation information set into a preset characteristic clustering analysis model. A process of grouping similar variant features using an algorithm, wherein an initial cluster center serves as a starting point for the grouping. In this way, variations with similar characteristics can be grouped together, thereby more clearly identifying potential variation patterns. And carrying out feature clustering through a feature cluster analysis model. Grouping the gene variation information sets by using the initial clustering centers to obtain a first characteristic clustering result of each initial clustering center. This process may involve iterative adjustments and optimizations to ensure that the clustering results are as accurate as possible. And correcting the initial clustering center according to the first characteristic clustering result to obtain a target clustering center. The previous clustering result is refined and improved, and the mutation characteristics can be more accurately grouped by adjusting the clustering center, so that a more definite mutation mode is further revealed. And then, carrying out a second clustering analysis on the gene variation information set according to the target clustering center. This round of cluster analysis was performed on a previous basis with the aim of further refining and refining the pattern of variation recognition. By this method, more detailed and specific second feature clustering results can be obtained, which can more clearly reflect the relationship and pattern between different genetic variations. And carrying out mutation mode analysis on the obtained second characteristic clustering result. The genetic variation characteristics in each cluster are analyzed to understand how they co-act, and their potential impact on disease risk. And carrying out feature standardized modeling on each mutation mode feature by using a random forest algorithm, so as to obtain a plurality of first mutation feature models. Random forests are a powerful machine learning algorithm suitable for processing large complex data sets. At this stage, it is used to construct models that accurately identify and classify patterns of genetic variation. In this way, complex information extracted from genomic data can be converted into operational models that can be used to predict the risk of genetic disease in an individual.

S104, performing mutation feature model matching on the second genome sequence data according to the plurality of first mutation feature models to obtain corresponding second mutation feature models;

specifically, the feature primary detection is performed on the second genome sequence data, and key genetic features, such as specific genetic variation type, position, frequency and the like, are extracted from the genome sequence of the target user. The extraction of this information provides the necessary underlying data for the subsequent matching process. And creating corresponding second characteristic identification information for each first mutation characteristic model according to the determined mutation mode characteristics. These identification information reflects the pattern of variation of interest for each model, such as a particular combination of genetic variations or a particular region where the variation occurs. Through this step, the abstract mutation pattern can be converted into operable feature identification information, and a standardized reference is provided for subsequent matching. And carrying out vector conversion on the first characteristic identification information to obtain a first characteristic identification vector, and carrying out similar vector conversion on the second characteristic identification information to obtain a second characteristic identification vector. This step converts the feature identification information into a mathematically processible form, i.e., a vector. Vectorization is a common practice in machine learning and data analysis that makes the comparison and computation between different data simple. The feature identification vectors are subjected to Euclidean distance calculation to determine the similarity between each first variant feature model and the target user genome data. Euclidean distance is a common distance metric that quantifies the difference between two vectors. Here, by calculating the euclidean distance between the first feature identification vector and the second feature identification vector, the degree of matching of the different variant feature models with the target user genome data can be evaluated. And optimally selecting the first variation feature models according to the calculated target Euclidean distance. And selecting a model with highest matching degree with the genome data of the target user from the multiple models. The optimization selection is not only based on the Euclidean distance calculation result, but also considers other factors such as the stability and the prediction accuracy of the model.

S105, performing sequence circulation traversal on the second genome sequence data based on the second variation characteristic model to obtain a target circulation traversal result;

specifically, the second genomic sequence data is subjected to movement detection by the second mutation feature model. The model scans the genome sequence step by step according to the preset direction and the coding length. This movement detection method allows the model to analyze each part of the sequence one by one to identify regions containing risk variations. The result of the movement detection provides a preliminary understanding of the specific region in the sequence, which lays a foundation for further analysis. And performing next round of code length analysis through the movement detection result to determine a second code length. This process adjusts the detection strategy based on the results of previous detections to improve the accuracy and efficiency of the detection. And according to the newly determined coding length, performing movement detection again by using the second mutation characteristic model until the whole second genome sequence is completely traversed. This iterative detection process ensures a comprehensive and detailed analysis of the genomic data, enabling more accurate identification of the relevant genetic variation. And judging whether the plurality of movement detection results meet a preset exit condition. This is to determine if enough information has been obtained to make an accurate risk assessment. If the exit condition is not met, then sequence loop traversal of the second genomic sequence data through the second variant feature model is continued to generate a more detailed target loop traversal result. This process involves further adjustments to the detection strategy to ensure that all risk variations are covered. And if the exit condition is met, comprehensively analyzing a plurality of movement detection results. The aim is to extract key information from previous test results to form a comprehensive understanding of the target user genome data. By this comprehensive analysis, it is possible to determine which mutations in the genome are associated with disease risk, and to explain and classify these mutations in detail.

S106, checking the target circulation traversing result to obtain a target checking result, and optimizing the feature model of the second variation feature model according to the target checking result to obtain a target variation feature model.

Specifically, a plurality of check rules are defined and converted into a set. These rules relate to the type, frequency, known disease association, etc. of genetic variation. By forming these rules into a set, they can be applied more systematically to evaluate the accuracy and integrity of the target loop traversal results. And verifying the target circulation traversing result according to the verification rule sets to obtain an initial verification result corresponding to each rule. The results of the loop traversal are checked in detail to ensure that they meet established criteria. The initial verification result for each rule provides a preliminary assessment of the accuracy of the traversal result. And collecting initial verification results corresponding to each verification rule to obtain target verification results. All individual verification results are aggregated to form a comprehensive assessment of the overall traversal result. From this aggregation, it can be determined which portions of the traversal results require further review or adjustment. And determining a model parameter range of the second variation characteristic model according to the target verification result. And adjusting various parameters of the model, such as weights, thresholds and the like, according to the verification result so as to improve the accuracy and the prediction capability of the model. By defining a suitable parameter range, it can be ensured that the model remains within an effective operating interval during the subsequent optimization process. Then, through the parameter ranges of the models, random initial value generation is carried out on the second variation characteristic model, and a preset inverse particle propagation algorithm is used for constructing particle populations. A series of random initial values are generated and used to create a population of particles. The inverse particle propagation algorithm is an efficient optimization method that can find the optimal solution among multiple candidate solutions. And carrying out fitness calculation on the particles, and carrying out iterative calculation on the particle fitness set until a preset condition is met, so as to generate an optimal solution corresponding to the particle population. This process is the core of model optimization, which involves evaluating the performance of each particle (i.e., a particular combination of model parameters) and adjusting according to how well the performance is. The purpose of fitness calculation is to determine which particles (model parameter combinations) can most accurately predict the target cycle traversal results, i.e., those that best meet the known verification rules. Through iterative computation, the system gradually approaches to an optimal solution, namely a model parameter combination which can be most in line with a verification rule. And carrying out feature model optimization on the second variation feature model through the obtained optimal solution. And applying the optimal parameter combination obtained through particle fitness calculation to the model so as to improve the performance and accuracy of the model. The optimized model can more accurately identify and predict genetic variation related to disease risk, and provides more accurate health risk assessment for target users.

According to the embodiment of the invention, the genome sequence data is analyzed by using the long-short-term memory network model, so that the genetic variation can be more accurately identified. The hierarchical structure of LSTM can capture long-term dependency relationship when processing sequence data, thereby improving accuracy of mutation identification. The large-scale genome data can be efficiently processed by slicing and encoding the genome sequence data, and different mutation patterns can be identified by performing feature cluster analysis on the genetic mutation information. This in-depth analysis helps to understand the complex pattern behind the variation, providing more information for disease risk prediction. By matching the second genomic sequence data of the target user with the established variant feature model, the method can provide a customized disease risk assessment, making the results more personalized and accurate. Through the cyclic traversal and verification of the model matching result and the optimization of the variation characteristic model according to the verification result, the accuracy and reliability of the model can be continuously improved, the timeliness and accuracy of the screening result are ensured, and the disease risk screening accuracy of the gene detection is further improved.

In a specific embodiment, the process of executing step S101 may specifically include the following steps:

(1) Acquiring a plurality of first nucleotide sequence data of a plurality of sample users, and acquiring second nucleotide sequence data of a target user;

(2) Creating a set of nucleotide codes, the set of nucleotide codes comprising: adenine a= [1, 0], cytosine c= [0,1,0 ]; guanine g= [0,1, 0] and thymine t= [0, 1];

(3) Respectively slicing the sequence data of the first genome sequence data and the second genome sequence data to obtain N first slice sequence data of each first genome sequence data and N second slice sequence data of each second genome sequence data;

(4) And carrying out sequence coding and coding fusion on the N first slice sequence data based on the nucleotide coding set to obtain a plurality of corresponding first genome sequence data, and carrying out sequence coding and coding fusion on the N second slice sequence data to obtain a corresponding second genome sequence data.

Specifically, first and second nucleotide sequence data are obtained from a plurality of sample users and target users. These data are typically obtained by high throughput sequencing techniques, which enable rapid and accurate reading of genetic information of individuals. At this stage, the sample user's data is used to build a reference model, while the target user's data is used for personalized risk assessment. A set of nucleotide codes is created. Nucleotides are the basic units constituting DNA, and include adenine (A), cytosine (C), guanine (G) and thymine (T). In this process, each nucleotide is assigned a unique code, e.g., adenine A is encoded as [1, 0], cytosine C is [0,1, 0], etc. The coding method simplifies the subsequent data processing flow, so that the gene sequence can be efficiently processed by a computer. The first and second genomic sequence data are separately sliced. Genomic data is often very bulky and processing such data directly is not only computationally intensive but also inefficient. Thus, these sequence data are divided into smaller portions. The first genomic sequence data and the second genomic sequence data are respectively sliced into a plurality of small segments. Each small segment contains a portion of the gene sequence, and such segmentation both preserves the sequence integrity and improves processing efficiency. These slice sequences are encoded and fused based on a set of nucleotide codes. This process converts nucleotides into the digital codes defined previously. For example, a slice containing "AGCT" may be converted to the form of [1, 0], [0,1, 0], [0,1 ]. By fusing these codes, corresponding genomic sequence data can be reconstructed. This method not only retains all the information of the original sequence, but also converts it into a form that is easier to calculate and analyze.

In a specific embodiment, as shown in fig. 2, the process of executing step S102 may specifically include the following steps:

s201, inputting a plurality of first genome sequence data into a preset long-short time memory network model, wherein the long-short time memory network model comprises a first long-short time memory layer, a second long-short time memory layer, a full connection layer and an output layer;

s202, extracting low-level features of each first genome sequence data through a first long-short time memory layer to obtain low-level sequence features of each first genome sequence data;

s203, performing high-dimensional feature conversion and feature combination on the low-level sequence features through the second long-short-time memory layer to obtain target output features of each first genome sequence data;

s204, performing prediction output conversion on the target output characteristics through the full connection layer to obtain the prediction output characteristics of each first genome sequence data;

s205, carrying out characteristic gene variation classification on the predicted output characteristics through a softmax function in the output layer to obtain a gene variation classification result of each first genome sequence data;

s206, carrying out mutation information integration on the gene mutation classification result to obtain a gene mutation information set of each first genome sequence data, and amplifying target sequences of a plurality of sample users through biological design primers according to the gene mutation information set to obtain a target mutation information set of each first genome sequence data.

Specifically, a plurality of first genome sequence data are input into a preset long-short-time memory network model, respectively. An LSTM network is a special recurrent neural network, suitable for processing sequence data. These sequence data originate from the genome of the sample user and contain abundant genetic information. The LSTM model is designed to have a multi-layered structure including two long and short memory layers, a full connection layer, and an output layer. The multilayer structure is designed to enable the model to extract and learn complex genetic features from the genetic sequence data. Low-level feature extraction is performed through the first long short-term memory layer, extracting basic genetic features, such as specific nucleotide arrangements and patterns, from the genomic sequence. The second long short time memory layer then further converts these low level features into high dimensional features and performs feature combinations. The model captures and combines these features to reveal more complex patterns of genetic variation. For example, the model may identify the association of a particular sequence pattern with a certain genetic disease. The full connection layer further integrates and converts these high-dimensional features to generate predicted output features. The fully connected layer serves to integrate the features extracted and converted from all previous layers, ready for final output. These predicted output features represent a comprehensive assessment of the model for various genetic variability. These predicted output features are classified by softmax function of the output layer, generating a genetic variation classification result for each first genomic sequence data. The softmax function is a commonly used probability distribution function that converts model output into probabilities for each class, thereby enabling the model to effectively classify genetic variations. And summarizing and integrating the classification results to obtain a comprehensive gene variation information set of each first genome sequence data. This set provides the necessary information basis for subsequent disease risk analysis.

In a specific embodiment, as shown in fig. 3, the process of executing step S103 may specifically include the following steps:

s301, extracting information from a target variation information set of each first genome sequence data to obtain variation type information, variation position information and variation frequency information of each first genome sequence data;

s302, performing mutation characteristic association relation identification on mutation type information, mutation position information and mutation frequency information to obtain a mutation characteristic association relation;

s303, determining a plurality of corresponding initial clustering centers according to the mutation characteristic association relation, and inputting the gene mutation information set into a preset characteristic clustering analysis model;

s304, carrying out feature clustering on the gene variation information set according to a plurality of initial clustering centers by a feature cluster analysis model to obtain a first feature clustering result of each initial clustering center;

s305, carrying out cluster center correction on a plurality of initial cluster centers according to the first characteristic cluster result to obtain a plurality of target cluster centers;

s306, carrying out cluster analysis on the gene variation information set according to a plurality of target cluster centers to obtain a plurality of second feature cluster results, and carrying out variation pattern analysis on the plurality of second feature cluster results to obtain a plurality of variation pattern features;

S307, performing feature standardization modeling on each variation mode feature by adopting a random forest algorithm to obtain a plurality of first variation feature models.

Specifically, the target mutation information set is extracted from each first genome sequence data. Specific information for each mutation is identified and recorded from a large number of genomic data, including the type of mutation (e.g., single nucleotide polymorphism, insertion, deletion, etc.), the specific location of the mutation, and the frequency with which the mutation occurs in the sample. This information provides the necessary data base for a thorough understanding of genetic variation. And carrying out deep analysis on the extracted variation information to identify the association relation between the variation characteristics. The aim is to understand whether there is a certain pattern or correlation between different variants, e.g. whether certain variant types tend to occur in a particular gene region or whether certain variants frequently occur in a particular population. The identification of such associations provides an important clue to understanding the complexity of genetic variation and the correlation of disease risk. Based on the association relations, an initial cluster center is determined, and the gene variation information set is input into a preset feature cluster analysis model. The variants with similar characteristics are grouped together to more clearly reveal the underlying variant patterns. The initial cluster center is determined based on a previously identified association, which provides a starting point for the cluster analysis. Based on the initial cluster centers, classifying the genetic variation information through a feature cluster analysis model. Further refinement and improvement of the clustering result, the variant features can be more accurately grouped by iteratively adjusting the clustering center. The first feature cluster result of each initial cluster center provides a basis for subsequent cluster center correction. And correcting the plurality of initial clustering centers according to the first characteristic clustering result to obtain more accurate target clustering centers. The method is characterized in that the clustering process is refined and optimized, and the final clustering result can accurately reflect the real mode of genetic variation. And carrying out a second clustering analysis on the gene variation information set according to the target clustering centers. This round of cluster analysis is based on previous results, aimed at further refining and refining the recognition of the mutation patterns. By this method, more detailed and specific second feature clustering results can be obtained, which can more clearly reflect the relationship and pattern between different genetic variations. And carrying out mutation mode analysis on the obtained second characteristic clustering result. The genetic variation characteristics in each cluster are analyzed to understand how they co-act, and their potential impact on disease risk. And carrying out feature standardized modeling on each mutation mode feature by using a random forest algorithm, so as to obtain a plurality of first mutation feature models. Random forests are a powerful machine learning algorithm suitable for processing large complex data sets. At this stage, it is used to construct models that accurately identify and classify patterns of genetic variation. In this way, complex information extracted from genomic data can be converted into operational models that can be used to predict the risk of genetic disease in an individual.

In a specific embodiment, as shown in fig. 4, the process of executing step S104 may specifically include the following steps:

s401, performing characteristic preliminary detection on the second genome sequence data to obtain first characteristic identification information;

s402, respectively creating second characteristic identification information of each first variation characteristic model according to the variation mode characteristics;

s403, carrying out vector conversion on the first characteristic identification information to obtain a first characteristic identification vector, and carrying out vector conversion on the second characteristic identification information to obtain a second characteristic identification vector;

s404, performing Euclidean distance calculation on the first feature identification vector and the second feature identification vector to obtain a target Euclidean distance of each first variation feature model;

and S405, performing model optimization selection on the plurality of first variation feature models according to the target Euclidean distance to obtain corresponding second variation feature models.

Specifically, the second genomic sequence data is subjected to characteristic preliminary detection. The key genetic characteristics, such as specific genetic mutation types, mutation positions, mutation frequencies and the like, are extracted from the genome data of the target user. And creating corresponding second feature identification information for each first mutation feature model according to the plurality of mutation pattern features which are already identified. These identification information reflect the specific pattern of variation of interest for each model, such as a combination of types of variation associated with a certain genetic disease or variation of a specific gene region. Such identification information is created to convert the abstract mutation pattern into an analyzable data form. Then, vector conversion is carried out on the extracted first feature identification information to obtain a first feature identification vector. Similarly, the second feature identification information is converted into a second feature identification vector by similar processing. Vectorization is the step of converting biological information into a digital form, which allows the data to be used for further calculation and analysis. And carrying out Euclidean distance calculation on the two groups of characteristic identification vectors. Euclidean distance is a method of measuring the linear distance of two points in a multidimensional space, which is used here to quantify the difference between a first feature identification vector and a second feature identification vector. The calculated target euclidean distance for each first variant feature model may indicate which models are closest to the genomic data of the target user. And optimally selecting the first variation feature models according to the calculated target Euclidean distance. And selecting a model with highest matching degree with the genome data of the target user from the multiple models. The optimization selection is not only based on the Euclidean distance calculation result, but also considers other factors such as the stability and the prediction accuracy of the model.

In a specific embodiment, the process of executing step S105 may specifically include the following steps:

(1) Performing mobile detection on the second genome sequence data according to a preset direction and a preset first coding length through a second variation characteristic model to obtain a mobile detection result;

(2) Performing next-round code length analysis through the movement detection result to obtain a second code length, and performing next-round movement detection according to the second code length through a second variation characteristic model until the second genome sequence data is traversed to obtain a plurality of movement detection results;

(3) Judging whether a plurality of movement detection results meet a preset exit condition or not;

(4) If the first variation characteristic model is not met, performing sequence circulation traversal on the first genome sequence data through the first variation characteristic model to generate a target circulation traversal result;

(5) And if the result is met, carrying out result comprehensive analysis on the plurality of movement detection results to obtain a target circulation traversing result and outputting the target circulation traversing result.

Specifically, the second genomic sequence data is subjected to movement detection by the second mutation feature model. And gradually scanning the gene sequence according to the preset direction and the coding length. This approach allows the model to analyze each part of the sequence one by one to identify regions containing risk variations. The manner of motion detection is dynamic and depends on preset parameters, such as the code length, which define the specific way the model moves and analyzes in the sequence. Based on the result of the movement detection, the next round of analysis of the code length is performed. This process is an attempt to adjust the detection strategy based on previous detection results to improve the accuracy and efficiency of the detection. For example, if significant variant patterns are found at a particular code length, the code length needs to be adjusted to explore these patterns further. And according to the newly determined coding length, performing the next round of movement detection by using the second mutation characteristic model until the whole genome sequence is completely traversed. This iterative detection method ensures a comprehensive and detailed analysis of the genomic data. Judging whether a plurality of mobile detection results meet preset exit conditions or not, and determining whether enough information is obtained to make accurate risk assessment or not. If the exit condition is not met, then sequence loop traversal of the second genomic sequence data through the second variant feature model is continued to generate a more detailed target loop traversal result. This involves further adjustments to the detection strategy to ensure that all risk variations are covered. If the exit condition is satisfied, then a comprehensive analysis is performed on these movement detection results. And extracting key information from the previous detection result to form comprehensive understanding of the genome data of the target user. By this comprehensive analysis, it is possible to determine which mutations in the genome are associated with disease risk, and to explain and classify these mutations in detail.

In a specific embodiment, the process of executing step S106 may specifically include the following steps:

(1) Defining a plurality of check rules, and performing set conversion on the plurality of check rules to obtain a check rule set;

(2) Checking the target circulation traversing result according to the checking rule set to obtain an initial checking result corresponding to each checking rule;

(3) Carrying out result aggregation on the initial verification result corresponding to each verification rule to obtain a target verification result;

(4) Determining a model parameter range of the second variation characteristic model according to the target verification result to obtain a plurality of model parameter ranges;

(5) Generating random initial values of the second variation characteristic model through a plurality of model parameter ranges to obtain a corresponding random initial value set, and constructing a particle population of the random initial value set through a preset inverse particle propagation algorithm to obtain the particle population;

(6) Performing particle fitness calculation on the particle population to obtain a particle fitness set corresponding to the particle population, and performing iterative calculation on the particle fitness set until a preset condition is met, so as to generate an optimal solution corresponding to the particle population;

(7) And carrying out feature model optimization on the second variation feature model through the optimal solution to obtain a target variation feature model.

Specifically, a plurality of check rules are defined and converted into a set. These rules relate to the type, frequency, known disease association, etc. of genetic variation. By forming these rules into a set, they can be applied more systematically to evaluate the accuracy and integrity of the target loop traversal results. And verifying the target circulation traversing result according to the verification rule sets to obtain an initial verification result corresponding to each rule. The results of the loop traversal are checked in detail to ensure that they meet established criteria. The initial verification result for each rule provides a preliminary assessment of the accuracy of the traversal result. And collecting initial verification results corresponding to each verification rule to obtain target verification results. All individual verification results are aggregated to form a comprehensive assessment of the overall traversal result. From this aggregation, it can be determined which portions of the traversal results require further review or adjustment. And determining a model parameter range of the second variation characteristic model according to the target verification result. And adjusting various parameters of the model, such as weights, thresholds and the like, according to the verification result so as to improve the accuracy and the prediction capability of the model. By defining a suitable parameter range, it can be ensured that the model remains within an effective operating interval during the subsequent optimization process. Then, through the parameter ranges of the models, random initial value generation is carried out on the second variation characteristic model, and a preset inverse particle propagation algorithm is used for constructing particle populations. A population of particles is created by generating a series of random initial values and using these values. The inverse particle propagation algorithm is an efficient optimization method that can find the optimal solution among multiple candidate solutions. And carrying out fitness calculation on the particles, and carrying out iterative calculation on the particle fitness set until a preset condition is met, so as to generate an optimal solution corresponding to the particle population. By evaluating the performance of each particle (i.e., a particular combination of model parameters) and adjusting according to how well the performance is. The purpose of fitness calculation is to determine which particles (model parameter combinations) can most accurately predict the target cycle traversal results, i.e., those that best meet the known verification rules. Through iterative computation, the system gradually approaches to an optimal solution, namely a model parameter combination which can be most in line with a verification rule. And carrying out feature model optimization on the second variation feature model through the obtained optimal solution. The optimal parameter combination obtained through particle fitness calculation is applied to the model, so that the performance and accuracy of the model are improved. The optimized model can more accurately identify and predict genetic variation related to disease risk, and provides more accurate health risk assessment for target users. The optimization process is not only based on mathematical and statistical methods, but also combines deep understanding of biological characteristics, and ensures that the model can achieve high accuracy and reliability when analyzing complex genetic data.

The method for screening risk of disease based on gene detection in the embodiment of the present invention is described above, and the system for screening risk of disease based on gene detection in the embodiment of the present invention is described below, referring to fig. 5, one embodiment of the system for screening risk of disease based on gene detection in the embodiment of the present invention includes:

an obtaining module 501, configured to obtain a plurality of first genomic sequence data of a plurality of sample users, and simultaneously obtain second genomic sequence data of a target user;

the identification module 502 is configured to input the plurality of first genomic sequence data into a preset long-short-term memory network model for performing genetic variation identification, obtain a genetic variation information set of each first genomic sequence data, and amplify target sequences of the plurality of sample users by designing primers according to the genetic variation information set through biology, so as to obtain a target variation information set of each first genomic sequence data;

the modeling module 503 is configured to perform feature cluster analysis on the target mutation information set of each first genomic sequence data to obtain a plurality of mutation mode features, and perform feature modeling on the plurality of mutation mode features to obtain a plurality of first mutation feature models;

A matching module 504, configured to perform mutation feature model matching on the second genome sequence data according to the plurality of first mutation feature models, so as to obtain a corresponding second mutation feature model;

the traversing module 505 is configured to perform a sequence loop traversal on the second genome sequence data based on the second variation feature model, so as to obtain a target loop traversal result;

and the optimization module 506 is configured to verify the target cycle traversal result to obtain a target verification result, and perform feature model optimization on the second variation feature model according to the target verification result to obtain a target variation feature model.

Optionally, the obtaining module 501 is specifically configured to: acquiring a plurality of first nucleotide sequence data of a plurality of sample users, and acquiring second nucleotide sequence data of a target user; creating a set of nucleotide codes, the set of nucleotide codes comprising: adenine a= [1, 0], cytosine c= [0,1,0 ]; guanine g= [0,1, 0] and thymine t= [0, 1]; respectively carrying out sequence data slicing on the plurality of first genome sequence data and the second genome sequence data to obtain N first slice sequence data of each first genome sequence data and N second slice sequence data of the second genome sequence data; and based on the nucleotide coding set, carrying out sequence coding and coding fusion on the N first slice sequence data to obtain a plurality of corresponding first genome sequence data, and carrying out sequence coding and coding fusion on the N second slice sequence data to obtain a corresponding second genome sequence data.

Optionally, the identifying module 502 is specifically configured to: inputting the plurality of first genome sequence data into a preset long-short time memory network model respectively, wherein the long-short time memory network model comprises a first long-short time memory layer, a second long-short time memory layer, a full connection layer and an output layer; extracting low-level features of each first genome sequence data through the first long-short time memory layer to obtain low-level sequence features of each first genome sequence data; performing high-dimensional feature conversion and feature combination on the low-level sequence features through the second long-short time memory layer to obtain target output features of each first genome sequence data; performing prediction output conversion on the target output characteristics through the full-connection layer to obtain the prediction output characteristics of each first genome sequence data; carrying out characteristic gene variation classification on the predicted output characteristics through a softmax function in the output layer to obtain a gene variation classification result of each first genome sequence data; and carrying out mutation information integration on the gene mutation classification result to obtain a gene mutation information set of each first genome sequence data, and amplifying target sequences of the plurality of sample users by designing primers according to the gene mutation information set by biology to obtain a target mutation information set of each first genome sequence data.

Optionally, the modeling module 503 is specifically configured to: information extraction is carried out on the target variation information set of each first genome sequence data to obtain variation type information, variation position information and variation frequency information of each first genome sequence data; performing mutation feature association relation identification on the mutation type information, the mutation position information and the mutation frequency information to obtain a mutation feature association relation; determining a plurality of corresponding initial clustering centers according to the variation characteristic association relation, and inputting the gene variation information set into a preset characteristic clustering analysis model; carrying out feature clustering on the gene variation information set according to the plurality of initial clustering centers by the feature cluster analysis model to obtain a first feature clustering result of each initial clustering center; according to the first characteristic clustering result, carrying out clustering center correction on the plurality of initial clustering centers to obtain a plurality of target clustering centers; performing cluster analysis on the gene mutation information set according to the target cluster centers to obtain a plurality of second feature cluster results, and performing mutation mode analysis on the second feature cluster results to obtain a plurality of mutation mode features; and adopting a random forest algorithm to respectively perform feature standardization modeling on each variation mode feature to obtain a plurality of first variation feature models.

Optionally, the matching module 504 is specifically configured to: performing characteristic preliminary detection on the second genome sequence data to obtain first characteristic identification information; respectively creating second characteristic identification information of each first variation characteristic model according to the variation mode characteristics; vector conversion is carried out on the first characteristic identification information to obtain a first characteristic identification vector, and vector conversion is carried out on the second characteristic identification information to obtain a second characteristic identification vector; performing Euclidean distance calculation on the first feature identification vector and the second feature identification vector to obtain a target Euclidean distance of each first variation feature model; and performing model optimization selection on the plurality of first variation feature models according to the target Euclidean distance to obtain corresponding second variation feature models.

Optionally, the traversing module 505 is specifically configured to: performing movement detection on the second genome sequence data according to a preset direction and a preset first coding length through the second variation characteristic model to obtain a movement detection result; performing next-round code length analysis through the movement detection result to obtain a second code length, and performing next-round movement detection according to the second code length through the second variation characteristic model until the second genome sequence data is traversed to obtain a plurality of movement detection results; judging whether the plurality of movement detection results meet a preset exit condition or not; if the first variation characteristic model is not met, performing sequence circulation traversal on the first genome sequence data through the first variation characteristic model to generate a target circulation traversal result; and if yes, carrying out result comprehensive analysis on the plurality of movement detection results to obtain a target circulation traversing result and outputting the target circulation traversing result.

Optionally, the optimizing module 506 is specifically configured to: defining a plurality of check rules, and performing set conversion on the check rules to obtain a check rule set; checking the target circulation traversing result according to the checking rule set to obtain an initial checking result corresponding to each checking rule; carrying out result aggregation on the initial verification result corresponding to each verification rule to obtain a target verification result; determining the model parameter ranges of the second variation characteristic model according to the target verification result to obtain a plurality of model parameter ranges; generating a random initial value of the second variation characteristic model through the plurality of model parameter ranges to obtain a corresponding random initial value set, and constructing a particle population of the random initial value set through a preset inverse particle propagation algorithm to obtain the particle population; performing particle fitness calculation on the particle population to obtain a particle fitness set corresponding to the particle population, and performing iterative calculation on the particle fitness set until a preset condition is met, so as to generate an optimal solution corresponding to the particle population; and carrying out feature model optimization on the second variation feature model through the optimal solution to obtain a target variation feature model.

Through the cooperation of the components, the genome sequence data can be analyzed by using a long-short-term memory network model, so that the genetic variation can be more accurately identified. The hierarchical structure of LSTM can capture long-term dependency relationship when processing sequence data, thereby improving accuracy of mutation identification. The large-scale genome data can be efficiently processed by slicing and encoding the genome sequence data, and different mutation patterns can be identified by performing feature cluster analysis on the genetic mutation information. This in-depth analysis helps to understand the complex pattern behind the variation, providing more information for disease risk prediction. By matching the second genomic sequence data of the target user with the established variant feature model, the method can provide a customized disease risk assessment, making the results more personalized and accurate. Through the cyclic traversal and verification of the model matching result and the optimization of the variation characteristic model according to the verification result, the accuracy and reliability of the model can be continuously improved, the timeliness and accuracy of the screening result are ensured, and the disease risk screening accuracy of the gene detection is further improved.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A disease risk screening method based on gene detection, characterized in that the disease risk screening method based on gene detection comprises:

acquiring a plurality of first genome sequence data of a plurality of sample users, and simultaneously acquiring second genome sequence data of a target user;

inputting the plurality of first genome sequence data into a preset long-short-time memory network model for gene mutation identification to obtain a gene mutation information set of each first genome sequence data, and amplifying target sequences of the plurality of sample users by designing primers according to the gene mutation information set through biology to obtain a target mutation information set of each first genome sequence data;

Performing feature cluster analysis on the target mutation information set of each first genome sequence data to obtain a plurality of mutation mode features, and performing feature modeling on the mutation mode features to obtain a plurality of first mutation feature models;

performing mutation feature model matching on the second genome sequence data according to the plurality of first mutation feature models to obtain corresponding second mutation feature models;

performing sequence circulation traversal on the second genome sequence data based on the second variation characteristic model to obtain a target circulation traversal result;

and verifying the target circulation traversing result to obtain a target verification result, and optimizing the feature model of the second variation feature model according to the target verification result to obtain a target variation feature model.

2. The method of claim 1, wherein obtaining a plurality of first genomic sequence data of a plurality of sample users and simultaneously obtaining second genomic sequence data of a target user comprises:

acquiring a plurality of first nucleotide sequence data of a plurality of sample users, and acquiring second nucleotide sequence data of a target user;

Creating a set of nucleotide codes, the set of nucleotide codes comprising: adenine a= [1, 0], cytosine c= [0,1,0 ]; guanine g= [0,1, 0] and thymine t= [0, 1];

respectively carrying out sequence data slicing on the plurality of first genome sequence data and the second genome sequence data to obtain N first slice sequence data of each first genome sequence data and N second slice sequence data of the second genome sequence data;

and based on the nucleotide coding set, carrying out sequence coding and coding fusion on the N first slice sequence data to obtain a plurality of corresponding first genome sequence data, and carrying out sequence coding and coding fusion on the N second slice sequence data to obtain a corresponding second genome sequence data.

3. The method for screening risk of disease based on genetic testing according to claim 1, wherein the inputting the plurality of first genomic sequence data into a preset long-short-term memory network model for genetic variation identification to obtain a set of genetic variation information of each first genomic sequence data, and amplifying target sequences of the plurality of sample users by designing primers according to the set of genetic variation information to obtain a set of target variation information of each first genomic sequence data, comprises:

Inputting the plurality of first genome sequence data into a preset long-short time memory network model respectively, wherein the long-short time memory network model comprises a first long-short time memory layer, a second long-short time memory layer, a full connection layer and an output layer;

extracting low-level features of each first genome sequence data through the first long-short time memory layer to obtain low-level sequence features of each first genome sequence data;

performing high-dimensional feature conversion and feature combination on the low-level sequence features through the second long-short time memory layer to obtain target output features of each first genome sequence data;

performing prediction output conversion on the target output characteristics through the full-connection layer to obtain the prediction output characteristics of each first genome sequence data;

carrying out characteristic gene variation classification on the predicted output characteristics through a softmax function in the output layer to obtain a gene variation classification result of each first genome sequence data;

and carrying out mutation information integration on the gene mutation classification result to obtain a gene mutation information set of each first genome sequence data, and amplifying target sequences of the plurality of sample users by designing primers according to the gene mutation information set by biology to obtain a target mutation information set of each first genome sequence data.

4. The method for screening risk of disease based on genetic testing according to claim 1, wherein performing feature cluster analysis on the target mutation information set of each first genomic sequence data to obtain a plurality of mutation pattern features, and performing feature modeling on the plurality of mutation pattern features to obtain a plurality of first mutation feature models, comprises:

information extraction is carried out on the target variation information set of each first genome sequence data to obtain variation type information, variation position information and variation frequency information of each first genome sequence data;

performing mutation feature association relation identification on the mutation type information, the mutation position information and the mutation frequency information to obtain a mutation feature association relation;

determining a plurality of corresponding initial clustering centers according to the variation characteristic association relation, and inputting the gene variation information set into a preset characteristic clustering analysis model;

carrying out feature clustering on the gene variation information set according to the plurality of initial clustering centers by the feature cluster analysis model to obtain a first feature clustering result of each initial clustering center;

According to the first characteristic clustering result, carrying out clustering center correction on the plurality of initial clustering centers to obtain a plurality of target clustering centers;

performing cluster analysis on the gene mutation information set according to the target cluster centers to obtain a plurality of second feature cluster results, and performing mutation mode analysis on the second feature cluster results to obtain a plurality of mutation mode features;

and adopting a random forest algorithm to respectively perform feature standardization modeling on each variation mode feature to obtain a plurality of first variation feature models.

5. The method according to claim 4, wherein the performing mutation feature model matching on the second genomic sequence data according to the plurality of first mutation feature models to obtain a corresponding second mutation feature model comprises:

performing characteristic preliminary detection on the second genome sequence data to obtain first characteristic identification information;

respectively creating second characteristic identification information of each first variation characteristic model according to the variation mode characteristics;

vector conversion is carried out on the first characteristic identification information to obtain a first characteristic identification vector, and vector conversion is carried out on the second characteristic identification information to obtain a second characteristic identification vector;

Performing Euclidean distance calculation on the first feature identification vector and the second feature identification vector to obtain a target Euclidean distance of each first variation feature model;

and performing model optimization selection on the plurality of first variation feature models according to the target Euclidean distance to obtain corresponding second variation feature models.

6. The method according to claim 1, wherein the performing a sequence loop traversal on the second genomic sequence data based on the second mutation feature model to obtain a target loop traversal result comprises:

performing movement detection on the second genome sequence data according to a preset direction and a preset first coding length through the second variation characteristic model to obtain a movement detection result;

performing next-round code length analysis through the movement detection result to obtain a second code length, and performing next-round movement detection according to the second code length through the second variation characteristic model until the second genome sequence data is traversed to obtain a plurality of movement detection results;

judging whether the plurality of movement detection results meet a preset exit condition or not;

If the first variation characteristic model is not met, performing sequence circulation traversal on the first genome sequence data through the first variation characteristic model to generate a target circulation traversal result;

and if yes, carrying out result comprehensive analysis on the plurality of movement detection results to obtain a target circulation traversing result and outputting the target circulation traversing result.

7. The method for screening risk of disease based on genetic testing according to claim 6, wherein the verifying the target cycle traversal result to obtain a target verification result, and optimizing the feature model of the second variation feature model according to the target verification result to obtain a target variation feature model, comprises:

defining a plurality of check rules, and performing set conversion on the check rules to obtain a check rule set;

checking the target circulation traversing result according to the checking rule set to obtain an initial checking result corresponding to each checking rule;

carrying out result aggregation on the initial verification result corresponding to each verification rule to obtain a target verification result;

determining the model parameter ranges of the second variation characteristic model according to the target verification result to obtain a plurality of model parameter ranges;

Generating a random initial value of the second variation characteristic model through the plurality of model parameter ranges to obtain a corresponding random initial value set, and constructing a particle population of the random initial value set through a preset inverse particle propagation algorithm to obtain the particle population;

performing particle fitness calculation on the particle population to obtain a particle fitness set corresponding to the particle population, and performing iterative calculation on the particle fitness set until a preset condition is met, so as to generate an optimal solution corresponding to the particle population;

and carrying out feature model optimization on the second variation feature model through the optimal solution to obtain a target variation feature model.

8. A disease risk screening system based on gene detection, the disease risk screening system based on gene detection comprising:

the acquisition module is used for acquiring a plurality of first genome sequence data of a plurality of sample users and simultaneously acquiring second genome sequence data of a target user;

the identification module is used for respectively inputting the plurality of first genome sequence data into a preset long-short-time memory network model for carrying out genetic variation identification to obtain a genetic variation information set of each first genome sequence data, and amplifying target sequences of the plurality of sample users through a primer designed according to the genetic variation information set by biology to obtain a target variation information set of each first genome sequence data;

The modeling module is used for carrying out feature cluster analysis on the target mutation information set of each first genome sequence data to obtain a plurality of mutation mode features, and carrying out feature modeling on the mutation mode features to obtain a plurality of first mutation feature models;

the matching module is used for carrying out mutation feature model matching on the second genome sequence data according to the plurality of first mutation feature models to obtain a corresponding second mutation feature model;

the traversing module is used for performing sequence circulation traversing on the second genome sequence data based on the second variation characteristic model to obtain a target circulation traversing result;

and the optimization module is used for verifying the target circulation traversing result to obtain a target verification result, and optimizing the feature model of the second variation feature model according to the target verification result to obtain a target variation feature model.