CN115910217A

CN115910217A - Base determination method, base determination device, computer equipment and storage medium

Info

Publication number: CN115910217A
Application number: CN202211667721.9A
Authority: CN
Inventors: 王玉垚; 王丹阳; 陈懂懂; 袁静贤
Original assignee: Zhengzhou Sikun Biological Engineering Co ltd
Current assignee: Zhengzhou Sikun Biological Engineering Co ltd
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-04-04
Anticipated expiration: 2042-12-23
Also published as: CN115910217B

Abstract

The present disclosure provides a base determination method, apparatus, computer device and storage medium, comprising: obtaining a sample digital signal and a base type of the sample digital signal, wherein the base type of the sample digital signal is determined based on a first model of unsupervised training; training a second model to be trained based on the sample digital signal, the base type of the sample digital signal and the quality score corresponding to the base type output by the unsupervised training first model to obtain a target classification model, and determining the base type corresponding to the digital signal to be recognized through the target classification model.

Description

Base determination method, base determination device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of signal recognition technologies, and in particular, to a method and an apparatus for determining a base, a computer device, and a storage medium.

Background

With the development of sequencing, the research on methods for base identification of double-stranded DNA molecules has been increasing, base identification is a process of converting captured information into a base sequence through an algorithm, and the base identification precision affects the base identification result, so how to identify bases is particularly important.

Disclosure of Invention

The embodiment of the disclosure at least provides a base determination method, a base determination device, computer equipment and a storage medium.

In a first aspect, embodiments of the present disclosure provide a method for base determination, including:

obtaining a sample digital signal and a base type of the sample digital signal, wherein the base type of the sample digital signal is determined based on a first model of unsupervised training;

training a second model to be trained based on the sample digital signal, the base type of the sample digital signal and the quality score corresponding to the base type output by the unsupervised training first model to obtain a target classification model, and determining the base type corresponding to the digital signal to be recognized through the target classification model.

In one possible embodiment, after obtaining the sample digital signal, and the base type of the sample digital signal, the method further comprises:

preliminarily screening the sample digital signal based on the mass fraction corresponding to the base type of the sample digital signal to obtain a first sample digital signal;

combining the base types of the first sample digital signals to obtain a plurality of nucleic acid sequences, and determining the mass fraction of the nucleic acid sequences based on the mass fraction corresponding to the base types of the first sample digital signals;

screening the plurality of nucleic acid sequences based on the mass fractions of the nucleic acid sequences to obtain a first nucleic acid sequence;

the training of the second model to be trained based on the sample digital signal, the base type of the sample digital signal and the quality score corresponding to the base type output by the unsupervised training first model comprises:

and training the second model to be trained on the basis of a second sample digital signal corresponding to the base type contained in the first nucleic acid sequence, the base type of the second sample digital signal and the mass fraction corresponding to the base type.

In a possible embodiment, the training the second model to be trained based on the second sample digital signal corresponding to the base type included in the first nucleic acid sequence, the base type of the second sample digital signal, and the quality score corresponding to the base type includes:

comparing the first nucleic acid sequence with each template nucleic acid sequence in a pre-constructed reference gene template library and/or with a genome sequence in a genome database;

removing base groups which are not successfully matched from the first nucleic acid sequence of which the comparison result meets the preset condition to obtain a second nucleic acid sequence;

and training the second model to be trained on the basis of a third sample digital signal corresponding to the base type contained in the second nucleic acid sequence, the base type of the third sample digital signal and the mass fraction corresponding to the base type.

In one possible embodiment, the method further comprises constructing the library of reference gene templates according to the following method:

obtaining a genome sequence in the genome database;

intercepting the template nucleic acid sequence from the genome sequence according to a preset step length and a preset length, wherein the preset length is the number of bases contained in the template nucleic acid sequence.

In one possible implementation, after acquiring the sample digital signal, the method further comprises:

scaling the sample digital signal;

and training a second model to be trained based on the scaled sample digital signal, the base type of the scaled sample digital signal and the quality score corresponding to the base type output by the unsupervised training first model.

In one possible embodiment, after obtaining the sample digital signal, the method further comprises:

determining a signal interval corresponding to each sample digital signal based on the intensity value of the sample digital signal;

sampling the sample digital signals in any signal interval according to a sampling proportion;

and training a second model to be trained based on the sampled digital sample signal, the base type of the sampled digital sample signal and the quality score corresponding to the base type output by the unsupervised training first model.

In one possible embodiment, the process of training the second model to be trained includes N rounds of iterative training, each round of iterative training includes training M classification trees, each classification tree corresponds to one base type, the target model includes M × N target classification trees, N is a preset positive integer, and M is the number of base types;

aiming at any round of iterative training, determining an initial prediction result corresponding to the sample digital signal output by the previous round of iterative training; the initial prediction result of the first iteration training is a preset initial value;

and determining the optimal splitting node of each classification tree in the iterative training of the current round and the weight corresponding to the leaf node of each classification tree based on the base type corresponding to the sample digital signal, the initial prediction result and the quality score corresponding to the base type to obtain the target classification tree after the iterative training of the current round, wherein the optimal splitting node is the classification condition with the highest splitting gain after splitting, and the weight of the leaf node of any classification tree is used for representing the probability that the sample digital signal falling into the leaf node is the base type corresponding to the classification tree.

In one possible embodiment, the determining, based on the base type corresponding to the sample digital signal, the initial prediction result, and the quality score corresponding to the base type, the best split node of each classification tree in the current iteration training, and the weight corresponding to the leaf node of each classification tree includes:

traversing a plurality of splitting nodes corresponding to the sample digital signal, and determining splitting gains after splitting according to the splitting nodes and weights corresponding to leaf nodes of each classification tree for any splitting node based on the base type corresponding to the sample digital signal, the initial prediction result and the quality score corresponding to the base type;

and determining the optimal splitting node based on the splitting gain corresponding to each splitting node.

In a possible embodiment, after the training of the second model to be trained is completed, the method further includes:

obtaining test sample data;

respectively inputting the test sample data into the target classification trees obtained by each iteration training to obtain a first intermediate prediction result of each target classification tree;

fusing the first intermediate prediction results of the target classification trees corresponding to the same base type, and determining the target prediction result corresponding to the test sample data;

determining a test result of the second model based on the target prediction result.

In one possible embodiment, the first intermediate prediction result of the target classification tree comprises a probability that the test sample data belongs to a base type corresponding to the target classification tree;

the fusing the first intermediate prediction results of the target classification trees corresponding to the same base type to determine the target prediction result corresponding to the test sample data comprises:

adding the first intermediate prediction results of the target classification trees corresponding to the same base type to obtain second intermediate prediction results corresponding to each base type;

and taking the base type with the highest probability in the second intermediate prediction results as the target prediction results.

In a possible embodiment, the determining the test result of the second model based on the target prediction result includes:

constructing a confusion matrix based on a target prediction result corresponding to the test sample data and a base type corresponding to the test sample data, and determining a test result corresponding to the target prediction result based on the confusion matrix; and/or the presence of a gas in the atmosphere,

and combining the target prediction results of the test sample data to obtain a plurality of predicted nucleic acid sequences, comparing the predicted nucleic acid sequences with each template nucleic acid sequence in a reference gene template library constructed in advance or genome sequences in a genome database, and determining a test result corresponding to the target prediction result based on the comparison result.

In a second aspect, embodiments of the present disclosure provide a base determination apparatus, including:

an acquisition module, configured to acquire a sample digital signal and a base type of the sample digital signal, where the base type of the sample digital signal is determined based on a first model of unsupervised training;

and the training module is used for training a second model to be trained on the basis of the sample digital signal, the base type of the sample digital signal and the quality score corresponding to the base type output by the unsupervised training first model to obtain a target classification model, so that the base type corresponding to the digital signal to be recognized is determined through the target classification model.

In a possible embodiment, after acquiring the sample digital signal and the base type of the sample digital signal, the acquiring module is further configured to:

the training module is configured to, when training a second model to be trained based on the sample digital signal, the base type of the sample digital signal, and the quality score corresponding to the base type output by the unsupervised training first model, be configured to:

and training the second model to be trained based on the second sample digital signal corresponding to the base type contained in the first nucleic acid sequence, the base type of the second sample digital signal and the quality score corresponding to the base type.

In a possible embodiment, the training module, when training the second model to be trained based on the second sample digital signal corresponding to the base type included in the first nucleic acid sequence, the base type of the second sample digital signal, and the quality score corresponding to the base type, is configured to:

and training the second model to be trained based on a third sample digital signal corresponding to the base type contained in the second nucleic acid sequence, the base type of the third sample digital signal and the mass fraction corresponding to the base type.

In a possible embodiment, the obtaining module is further configured to construct the reference gene template library according to the following method:

obtaining a genome sequence in the genome database;

In a possible implementation, after acquiring the sample digital signal, the acquiring module is further configured to:

scaling the sample digital signal;

the training module is used for training a second model to be trained based on the sample digital signal, the base type of the sample digital signal and the quality score corresponding to the base type output by the unsupervised training first model, and is used for:

In one possible embodiment, the training module, when determining the best splitting node of each classification tree in the iterative training round and the weight corresponding to the leaf node of each classification tree based on the base type corresponding to the sample digital signal, the initial prediction result and the quality score corresponding to the base type, is configured to:

In a possible implementation, after the training of the second model to be trained is completed, the apparatus further includes a testing module configured to:

obtaining test sample data;

the test module is used for fusing the first intermediate prediction results of the target classification trees corresponding to the same base type and determining the target prediction results corresponding to the test sample data:

In one possible embodiment, the test module, when determining the test result of the second model based on the target prediction result, is configured to:

constructing a confusion matrix based on a target prediction result corresponding to the test sample data and a base type corresponding to the test sample data, and determining a test result corresponding to the target prediction result based on the confusion matrix; and/or the presence of a gas in the gas,

combining the target prediction results of the test sample data to obtain a plurality of predicted nucleic acid sequences, comparing the predicted nucleic acid sequences with template nucleic acid sequences in a pre-constructed reference gene template library or genome sequences in a genome database, and determining the test result corresponding to the target prediction result based on the comparison result.

In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, this disclosed embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

According to the base determination method, the base determination device, the computer equipment and the storage medium, the base type of the sample digital signal determined based on the first model of the unsupervised training can be used as the supervision data to train the second model to be trained, and the supervision data is determined through the first model of the unsupervised training, and the precision of the supervision data is deviated to a certain extent, so that the quality score corresponding to the base type output by the first model of the unsupervised training is combined in the training process, the model precision of the target classification model is improved, the trained target classification model can be suitable for base detection of various scenes, and the universality is high.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

FIG. 1 is a flow chart illustrating a method of base determination provided by an embodiment of the disclosure;

FIG. 2 is a diagram schematically illustrating the determination of a nucleic acid sequence in the method for determining a base provided in the embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a constructed template nucleic acid sequence provided by an embodiment of the disclosure;

FIG. 4 illustrates a schematic diagram of sampling a sample digital signal provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating base recognition based on a target classification model provided by embodiments of the present disclosure;

FIG. 6 is a schematic diagram illustrating an architecture of a base determination apparatus provided in an embodiment of the present disclosure;

fig. 7 shows a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

Based on the above research, the present disclosure provides a base determination method, an apparatus, a computer device, and a storage medium, which may use a base type of a sample digital signal determined by a first model based on unsupervised training as supervised data to train a second model to be trained, and since the supervised data is determined by the first model based on unsupervised training and has a certain deviation in precision, in a training process, a quality score corresponding to the base type output by the first model based on unsupervised training is combined, so that the model precision of a target classification model is improved, and the trained target classification model may be suitable for base detection in various scenarios, and has a high universality.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of a, B, C, and may mean including any one or more elements selected from the group consisting of a, B, and C.

For the purpose of understanding this example, a detailed description will first be given of a base determination method disclosed in the examples of the present disclosure. Referring to fig. 1, a flow chart of a base determination method provided in the embodiments of the present disclosure is shown, the method includes steps 101 to 102, where:

step 101, obtaining a sample digital signal and a base type of the sample digital signal, wherein the base type of the sample digital signal is determined based on a first model of unsupervised training.

And 102, training a second model to be trained based on the sample digital signal, the base type of the sample digital signal and the quality score corresponding to the base type output by the unsupervised training first model to obtain a target classification model, and determining the base type corresponding to the digital signal to be recognized through the target classification model.

The following is a detailed description of the above steps.

The sample digital signal may include signals in various detection scenarios, such as an electrical signal, a fluorescent signal, and the like, and the base type of the sample digital signal may refer to the base type to which the sample digital signal belongs, and may include, for example, four types of adenine (a), guanine (T), cytosine (C), and thymine (G).

In one possible embodiment, the sample digital signals may be obtained from a plurality of batches including, but not limited to, a plurality of sequencers, a plurality of species, and the like.

In one possible embodiment, when determining the base type of the sample digital signal, the following steps can be performed:

and a1, acquiring the sample digital signal.

And a2, generating a classifier corresponding to the nucleic acid sequence to be detected based on the sample digital signal and the neural model.

And a3, detecting the sample digital signal by using a classifier, and determining the base type corresponding to the sample digital signal.

In step a2, the neural model may be a self-organizing map neural network. The network structure of the constructed neural model may include an input layer composed of input nodes each for accepting an input value and an output layer composed of weight nodes each for accepting a decision and update of the input layer.

And (3) a process of training the neural model, namely continuously adjusting the network parameters of the neural network until a convergence condition is reached, and obtaining the trained neural model comprising the target weight parameters (namely the target network parameters), namely the classifier. The convergence condition may be set as needed, for example, the convergence condition may include: the training times are greater than a set time threshold; each signal strength value in the training dataset has been calculated; and returning a quality value of the trained neural model to be larger than a set quality threshold value, and the like.

As can be seen from the above, the classifier is the first model of unsupervised training, and the classifier is trained based on the sample digital signal, but is not suitable for other digital signals, and has poor generalization capability.

After the sample digital signal is acquired, since the quality of a part of the acquired sample digital signal may be low, in order to avoid the influence of the part of the sample digital signal on the model accuracy, the sample digital signal may be screened first.

Specifically, the output of the unsupervised training first model includes the quality score of the base type of the sample digital signal in addition to the base type of the sample digital signal, so that the sample digital signal can be preliminarily screened based on the quality score corresponding to the base type of the sample digital signal, for example, the sample digital signal corresponding to the base type with the quality score smaller than a preset score can be removed to obtain the first sample digital signal.

Furthermore, after the sample digital signals are combined, a nucleic acid sequence is formed, and the sample digital signals can be further screened according to the mass fraction of the nucleic acid sequence.

Illustratively, the steps may be performed by:

step b1, combining the base types of the first sample digital signals to obtain a plurality of nucleic acid sequences, and determining the mass fraction of the nucleic acid sequences based on the mass fraction corresponding to the base types of the first sample digital signals.

Illustratively, as shown in fig. 2, taking the fluorescence signal as an example, the base type of each base position can be detected in the same fluorescence image, the sequence of change of the base type of each base position can be detected according to the sequence of the captured fluorescence image, and the base types are spliced according to the sequence of change to obtain the nucleic acid sequence corresponding to each base position, in fig. 2, the left half part comprises a base positions, the combined nucleic acid sequence has a pieces, and if there are b fluorescence images, the number of bases contained in each nucleic acid sequence has b pieces.

In one possible embodiment, when determining the mass fraction of the nucleic acid sequence based on the mass fraction corresponding to the base type of the first sample digital signal, the sum of the mass fractions corresponding to the base types of the sample digital signals included in the nucleic acid sequence may be used as the mass fraction of the nucleic acid sequence, or the average of the mass fractions corresponding to the base types of the sample digital signals included in the nucleic acid sequence may be used as the mass fraction of the nucleic acid sequence.

And b2, screening the plurality of nucleic acid sequences based on the mass fractions of the nucleic acid sequences to obtain a first nucleic acid sequence.

Illustratively, the nucleic acid sequences with the corresponding mass fractions smaller than the preset mass fractions can be removed to obtain the first nucleic acid sequence, and the base types of the sample digital signals contained in the first nucleic acid sequence are the base types after the mass screening.

Correspondingly, when the second model to be trained is trained, the second model to be trained may be trained based on the second sample digital signal corresponding to the base type included in the first nucleic acid sequence, the base type of the second sample digital signal, and the mass fraction corresponding to the base type.

By this method, a high-quality sample digital signal and supervised data of the sample digital signal can be obtained, whereby the accuracy of the trained model can be improved.

In another possible embodiment, after obtaining the first nucleic acid sequence, the mass fraction of the first nucleic acid sequence is determined based on the mass fraction output by the unsupervised trained first model, and although the mass fraction is higher, there is still a probability that the first nucleic acid sequence may contain a wrong base, so that further base screening may be performed based on the first nucleic acid sequence.

Illustratively, the first nucleic acid sequence may be compared with each template nucleic acid sequence in a pre-constructed reference gene template library, and/or compared with a genome sequence in a genome database, and then a base which is not successfully matched in the first nucleic acid sequence whose comparison result meets a preset condition is removed to obtain a second nucleic acid sequence; and training the second model to be trained based on a third sample digital signal corresponding to the base type contained in the second nucleic acid sequence, the base type of the third sample digital signal and the quality score corresponding to the base type.

The comparison result may include a comparison success rate, and the comparison result meeting the preset condition may indicate that the comparison success rate exceeds a preset ratio. Since the first nucleic acid sequence whose alignment result meets the predetermined condition is not always 100% successful, the first nucleic acid sequence whose alignment result meets the predetermined condition may also include unmatched bases.

1. When the first nucleic acid sequence is aligned with a genomic sequence in a genomic database.

The genomic sequence is a long sequence of a known plurality of species, and when aligning the first nucleic acid sequence with the genomic sequence, open source software Burrows-Wheeler Aligner, for example, may be utilized.

2. And (c) aligning the first nucleic acid sequence with each template nucleic acid sequence in a pre-constructed reference gene template library.

The reference gene template library may include a plurality of template nucleic acid sequences constructed based on the genomic sequence, the template nucleic acid sequences having a sequence length less than the genomic sequence.

In one possible embodiment, the number of bases included in each template nucleic acid sequence in the reference gene template library may be the same as the number of bases included in the first nucleic acid sequence, or the number of bases included in each template nucleic acid sequence in the reference gene template library may be larger than the number of bases included in the first nucleic acid sequence.

Illustratively, when aligning the first nucleic acid sequence to each template nucleic acid sequence, a string exact match algorithm may be employed for the alignment.

Wherein the template nucleic acid sequence may be constructed by a sliding window method. Illustratively, as shown in fig. 3, after obtaining the genomic sequence in the genomic database, the template nucleic acid sequence may be truncated from the genomic sequence according to a preset step size and a preset length, where the preset length is the number of bases included in the template nucleic acid sequence.

The preset step length is the number of bases moved when the window moves, and the preset length can be understood as the length of the window, that is, the number of bases that can be included in the window.

Since the first nucleic acid sequence is a certain sequence in the known sequences, the wrong bases in the first nucleic acid sequence can be screened out by the comparison method, and the quality of the sample number is improved.

In another possible embodiment, in order to improve the training speed of the second model, after the sample digital signal is acquired, scaling the sample digital signal, and then training the second model to be trained based on the scaled sample digital signal, the base type of the scaled sample digital signal, and the quality score corresponding to the base type output by the unsupervised training first model.

In practice, the compression process may include, but is not limited to: a quantile-based compression method, a linear function-based compression method, a maximum-minimum-based compression method, a Z-score-based compression method, an S-curve-based compression method, and the like. For example, the compression method based on the maximum value and the minimum value may determine the maximum value (Xmax) and the minimum value (Xmin) in the initial intensity value data set, calculate the range (R = Xmax-Xmin), subtract the minimum value (Xmin) from the intensity value (X) of the sample digital signal, and divide the value by the range (R) to obtain the compressed intensity value X ', i.e., X' = (X-Xmin)/(Xmax-Xmin).

Here, the intensity value of the sample digital signal may be determined according to a signal type of the sample digital signal, for example, if the sample digital signal is a fluorescence signal, the intensity value of the sample digital signal may refer to a pixel value.

In another possible implementation, the number of the acquired sample digital signals may be large, and the sample digital signals may also be sampled in order to improve the training speed of the model and avoid overfitting the model.

For example, a signal interval corresponding to each sample digital signal may be determined based on an intensity value of the sample digital signal, for any signal interval, the sample digital signal in the signal interval may be sampled according to a sampling ratio, and then the second model to be trained is trained based on the sample digital signal obtained after sampling, the base type of the sample digital signal obtained after sampling, and the quality score corresponding to the base type output by the unsupervised training first model, where the sampling ratio is an adjustable preset parameter.

Specifically, when determining the signal interval corresponding to each sample digital signal, the dimensions of the sample digital signal may be combined exemplarily.

Taking the sample digital signal as a one-dimensional signal as an example, each signal interval may be determined according to a preset signal interval, for example, the signal interval may be 0 to 50, 51 to 100, \8230, and then the signal interval corresponding to each sample digital signal may be determined according to the intensity value of each sample digital signal.

Taking the sample digital signals as two-dimensional signals as an example, as shown in fig. 4, a coordinate system may be constructed by taking each dimension as a coordinate axis, then a two-dimensional grid, i.e., the signal interval, is constructed on each coordinate axis according to the corresponding signal interval, and then the two-dimensional grid, i.e., the corresponding signal interval, corresponding to each digital sample signal is determined according to the two-dimensional signal intensity value of each sample digital signal.

Here, the process of selecting bases (such as screening by mass fraction or comparison), the process of sampling, and the process of compressing may be performed in no order.

The above exemplary process of constructing the training samples applied for training the second model can be understood, and the following describes the training process of the second model.

The process of training the second model to be trained may include N rounds of iterative training, each round of iterative training includes training M classification trees, each classification tree corresponds to one base type, the target model includes M × N target classification trees, N is an adjustable preset parameter, N is a preset positive integer, and M is the number of base types.

In practical application, the base types to be detected generally include four types, so that four classification trees can be trained in each iteration training process, each classification tree can be a binary classification tree, and each binary classification tree is used for detecting one base type.

Illustratively, when training the second model to be trained, the following steps may be performed:

step c1, aiming at any round of iterative training, determining an initial prediction result corresponding to the sample digital signal output by the previous round of iterative training; and the initial prediction result of the first iteration training is a preset initial value.

And c2, determining the optimal splitting node of each classification tree in the iterative training of the current round and the weight corresponding to the leaf node of each classification tree based on the base type corresponding to the sample digital signal, the initial prediction result and the quality score corresponding to the base type, and obtaining the target classification tree after the iterative training of the current round.

The optimal splitting node is a classification condition with the highest splitting gain after splitting, and the weight of the leaf node of any classification tree is used for representing the probability that the sample digital signal falling into the leaf node is the base type corresponding to the classification tree.

Here, before training the sample digital signal, the base type of the sample digital signal may be converted into a supervised sequence. That is, the base types corresponding to different sequence positions can be set, the value of the base type of any sample digital signal at the corresponding sequence position is 1, and the value at the corresponding sequence position is 0.

Illustratively, if there are 4 base types, i.e., a, T, C, and G, the corresponding sequence positions are 1, 2, 3, and 4, respectively, if the base type of a sample digital signal is a, the monitor sequence of the sample digital signal is (1, 0), and if the base type of a sample digital signal is C, the monitor sequence of the sample digital signal is (0, 1, 0).

Taking the first round of iterative training as an example, the initial prediction result is used to characterize the probability that the sample digital signal is of each base type, if there are 4 base types, the corresponding initial prediction result is (0.25 ), and the probability that the base type of the sample digital signal is of each base type is 0.25.

When determining the optimal splitting node of each classification tree in the current iteration training based on the base type corresponding to the sample digital signal, the initial prediction result and the quality score corresponding to the base type, traversing a plurality of splitting nodes corresponding to the sample digital signal, and for any splitting node, determining splitting gain after splitting according to the splitting node and a weight corresponding to a leaf node of each classification tree based on the base type corresponding to the sample digital signal, the initial prediction result and the quality score corresponding to the base type; and then determining the optimal splitting node based on the splitting gain corresponding to each splitting node.

Illustratively, for a single base, the predicted value for each tree is 0.25, as exemplified by the first split of each tree. Formula based on first and second order partial derivatives of loss function

And &>

The first and second order partial derivative losses for each sample can be obtained, where wi represents the mass fraction of the ith base, y _i Is the true value>

Is a predicted value.

Further, the calculation formula of the objective function can be used

And obtaining an objective function before the leaf node of each tree is split for the first time, wherein the splitting gain is the difference of objective function values before and after splitting.

All sample digital signals can then be split at a certain node, and the sum of the sample weights of all samples at the leaf node after splitting can be calculated.

The weight calculation formula of the leaf node may be:

G _j is the cumulative sum of the first partial derivative losses of the samples contained in leaf node j, H _j Is the cumulative sum of the second partial derivative losses for the samples contained in leaf node j. λ is a constant that is abstracted from the regularization term.

If the sum of the sample weights of all samples of a certain leaf node after splitting is smaller than the minimum sample weight threshold, the splitting node can be discarded, and if the sum of the sample weights is larger than the minimum sample weight threshold, the splitting gain of the splitting node can be calculated.

The splitting gain can be used to characterize the variation of the loss values before and after splitting, and can be calculated by the following formula:

where Gain denotes the splitting Gain, G _L And G _R Respectively representing the sum of first-order partial derivative losses of the left node and the right node after splitting, H _L And H _R Respectively representing the sum of second-order partial derivative losses of the split left node and the split right node, wherein lambda is a regular term.

And (4) the splitting gains of all the splitting nodes are required to be larger than the gain threshold value gamma, and the splitting nodes of which the splitting nodes are smaller than the gain threshold value are abandoned.

After the splitting gain of each splitting node is calculated, the splitting node with the maximum splitting gain can be reserved, next-layer splitting is carried out on the basis, after one-time splitting is completed, the depth +1 of the classification tree is repeatedly executed until the depth of the tree reaches K, wherein K is an adjustable preset parameter.

After the splitting is finished, the weights of the last leaf nodes can be calculated according to the formula, the initial prediction result of the sample digital signal is updated, specifically, the sample digital signal can be input into the four classification trees of the iterative training in the current round, the leaf nodes where the sample digital signal finally falls are determined according to the splitting nodes of the classification trees, and the initial prediction result is updated according to the weights of the leaf nodes. Exemplary may be represented by the following formula:

wherein

The predicted value obtained in the previous iteration is 0.25 in the first iteration. f. of _t (x _i ) The weight value of the iteration sample falling into the leaf node in the current round is shown, and eta is the learning rate.

In a possible embodiment, after the initial prediction result is updated, the updated initial prediction result may be further subjected to normalization conversion, for example, the updated initial prediction result may be subjected to softmax conversion.

After one round of iterative training is finished, four classification trees can be obtained, wherein each classification tree corresponds to one base; and obtaining a target classification model after completing N rounds of iterative training, wherein N is an adjustable preset parameter.

It should be noted that, each iteration training cycle needs to reinitialize four classification trees, the best split nodes of different classification trees may be different, and the best split nodes of classification trees corresponding to the same base type may also be different.

In a possible embodiment, after the training of the second model is completed, the second model may be tested, and the adjustable preset parameter may be adjusted if the test result does not satisfy the preset condition.

Illustratively, the testing process may include the following steps:

and d1, obtaining test sample data.

Here, the type of the test sample data may be the same as the type of the sample digital signal, the base type of the test sample data may be obtained while obtaining the test sample data, the base of the test sample data may be used to determine the test result of the second model, a specific test process will be described below, and the method for determining the base type of the test sample data may be the same as the method for determining the base type of the sample digital signal, which will not be described herein again.

And d2, respectively inputting the test sample data into the target classification trees obtained by each iteration training to obtain a first intermediate prediction result of each target classification tree.

In practical application, after the test sample data is input into the target classification tree of each iteration training, the output of the test sample data can be a prediction vector, and vector values of the prediction vector respectively correspond to output values of each target classification tree of the iteration training in the current round, namely the weight of a leaf node where the test sample data finally falls.

Exemplarily, as shown in fig. 5, if the target classification trees A1, A2, A3, and A4 obtained by the first round of iterative training respectively correspond to four bases ATCG, the target classification trees B1, B2, B3, and B4 obtained by the second round of iterative training respectively correspond to four bases ATCG \8230304, and the target classification trees N1, N2, N3, and N4 obtained by the nth round of iterative training respectively input the test sample data to the target classification trees A1, A2, A3, and A4, and the target classification trees A1, A2, A3, and A4 respectively output weights of leaf nodes into which the test sample data falls, that is, four weights; the test sample data is respectively input into target classification trees B1, B2, B3 and B4, the target classification trees B1, B2, B3 and B4 respectively output the weight of leaf nodes where the test sample data falls, 8230, and by analogy, four weight values can be output in each iteration training.

And d3, fusing the first intermediate prediction results of the target classification trees corresponding to the same base type, and determining the target prediction result corresponding to the test sample data.

Illustratively, the first intermediate prediction results of the target classification trees corresponding to the same base type may be added to obtain second intermediate prediction results corresponding to each base type; and then taking the base type with the highest probability in the second intermediate prediction results as the target prediction results.

It should be noted here that four weight values (i.e. the first intermediate prediction result) of each iteration training need to be added to the predicted value obtained in the previous iteration, and in order to ensure the uniformity of the data, normalization conversion may be performed, for example, by softmax processing.

And d4, determining the test result of the second model based on the target prediction result.

For example, when determining the test result of the second model based on the target prediction result, any one of the following methods may be used:

and A, constructing a confusion matrix based on a target prediction result corresponding to the test sample data and a base type corresponding to the test sample data, and determining a test result corresponding to the target prediction result based on the confusion matrix.

Illustratively, the constructed confusion matrix may be as shown in table 1 below:

TABLE 1

When determining the test result corresponding to the target prediction result based on the confusion matrix, for example, the proportion of the test sample data with the same true value and predicted value in the confusion matrix in the whole test sample data may be calculated, and if the calculated proportion exceeds a preset proportion, it may be determined that the second model passes the test, and the second model that passes the test is the target classification model.

And B, combining the target prediction results of the test sample data to obtain a plurality of predicted nucleic acid sequences, comparing the predicted nucleic acid sequences with template nucleic acid sequences in a pre-constructed reference gene template library or genome sequences in a genome database, and determining the test result corresponding to the target prediction result based on the comparison result.

Here, the method for obtaining a plurality of predicted nucleic acid sequences by combining the target prediction results of the test sample data may be the same as the method for obtaining a plurality of nucleic acid sequences by combining the base types of the first sample digital signals, and will not be described herein again.

Theoretically, the nucleic acid sequence obtained by combining the target prediction results of the test sample data should also be a template nucleic acid sequence in a reference gene template library or a genome sequence in a genome database, so that the wrong bases contained in the predicted nucleic acid sequence can be determined by comparing the predicted nucleic acid sequence with each template nucleic acid sequence in a pre-constructed reference gene template library or the genome sequence in the genome database, and if the comparison rate is higher than the preset comparison rate, the second model can be determined to pass the test.

In another possible embodiment, when the second model is tested, only a second intermediate prediction result of a single base may be combined, the second intermediate prediction result may be a prediction vector, and each value in the prediction vector is a probability that the test sample data output by the second model is of each base type.

When determining the test result based on the second intermediate prediction result of the single base, theoretically, the probability that the test sample data is the correct base type should be much higher than the probabilities of other wrong base types, and the larger the two probabilities are, the higher the probability that the test sample data is the correct base type is.

Based on this, exemplarily, the calculation can be performed by the following formula:

where Score represents the confidence Score,

representing the highest probability value of the second intermediate predictors,

indicating other probability values than the maximum probability value,

if the test result does not pass after the second model is tested, it indicates that the network parameter setting in the process of training the network is unreasonable, so that the adjustable preset parameters may be adjusted, which may specifically include:

the learning rate eta, the maximum depth K of the tree, a threshold value of the sum of the minimum leaf node sample weights, a regularization term lambda, a splitting gain threshold value gamma, the iteration number N and the like.

The parameter adjusting method includes but is not limited to: traditional manual search, grid search, random search, bayesian search.

It should be noted that, the above description is given by taking the XGBoost model as an example of the target classification model, and the target classification model may also be other types of models, such as a random forest model and a gradient boost decision tree model.

The base determination method can be executed at a training end when the target classification model is trained, after the training at the training end is finished, the target classification model can be applied to any application end, the application end can obtain data to be recognized, and the recognition result of the data to be recognized is determined through the deployed target classification model, so that the universality is high.

The method provided by the present disclosure is to perform base detection on signal strength by using a machine learning method, and the method may also be used in other scenarios of classification according to signal strength, that is, the detection result may be of other types besides the base type, and the present disclosure is not limited thereto.

It will be understood by those of skill in the art that in the above method of the present embodiment, the order of writing the steps does not imply a strict order of execution and does not impose any limitations on the implementation, as the order of execution of the steps should be determined by their function and possibly inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a base determination apparatus corresponding to the base determination method, and since the principle of the apparatus in the embodiment of the present disclosure for solving the problem is similar to the base determination method described above in the embodiment of the present disclosure, the implementation of the apparatus can refer to the implementation of the method, and the repeated parts are not described again.

Referring to fig. 6, a schematic diagram of an architecture of a base determination apparatus provided in the embodiment of the present disclosure is shown, the apparatus includes: an acquisition module 601, a training module 602, and a test module 603; wherein the content of the first and second substances,

an obtaining module 601, configured to obtain a sample digital signal and a base type of the sample digital signal, where the base type of the sample digital signal is determined based on a first model of unsupervised training;

the training module 602 is configured to train a second model to be trained based on the sample digital signal, the base type of the sample digital signal, and the quality score corresponding to the base type output by the unsupervised training first model, to obtain a target classification model, and determine the base type corresponding to the digital signal to be recognized through the target classification model.

In one possible embodiment, after acquiring the sample digital signal and the base type of the sample digital signal, the acquiring module 601 is further configured to:

the training module 602, when training a second model to be trained based on the sample digital signal, the base type of the sample digital signal, and the quality score corresponding to the base type output by the unsupervised training first model, is configured to:

In a possible embodiment, the training module 602, when training the second model to be trained based on the second sample digital signal corresponding to the base type included in the first nucleic acid sequence, the base type of the second sample digital signal, and the quality score corresponding to the base type, is configured to:

In a possible embodiment, the obtaining module 601 is further configured to construct the reference gene template library according to the following method:

obtaining a genome sequence in the genome database;

In a possible implementation, after acquiring the sample digital signal, the acquiring module 601 is further configured to:

scaling the sample digital signal;

and training a second model to be trained based on the sample digital signal obtained after sampling, the base type of the sample digital signal obtained after sampling and the quality score corresponding to the base type output by the unsupervised training first model.

In a possible implementation manner, the process of training the second model to be trained includes N rounds of iterative training, each round of iterative training includes training M classification trees, each classification tree corresponds to one base type, the target model includes M × N target classification trees, N is a preset positive integer, and M is the number of base types;

In a possible implementation, the training module 602, when determining the best splitting node of each classification tree in the iterative training round and the weight corresponding to the leaf node of each classification tree based on the base type corresponding to the sample digital signal, the initial prediction result and the quality score corresponding to the base type, is configured to:

In a possible implementation, after the training of the second model to be trained is completed, the apparatus further includes a testing module 603 configured to:

obtaining test sample data;

the testing module 603, when fusing the first intermediate prediction results of the target classification trees corresponding to the same base type and determining the target prediction result corresponding to the test sample data, is configured to:

In a possible implementation, the testing module 603, when determining the testing result of the second model based on the target prediction result, is configured to:

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Based on the same technical concept, the embodiment of the disclosure also provides computer equipment. Referring to fig. 7, a schematic structural diagram of a computer device 700 provided in the embodiment of the present disclosure includes a processor 701, a memory 702, and a bus 703. The memory 702 is used for storing execution instructions and includes a memory 7021 and an external memory 7022; the memory 7021 is also referred to as an internal memory, and is used to temporarily store operation data in the processor 701 and data exchanged with an external memory 7022 such as a hard disk, the processor 701 exchanges data with the external memory 7022 through the memory 7021, and when the computer apparatus 700 operates, the processor 701 and the memory 702 communicate with each other through the bus 703, so that the processor 701 executes the following instructions:

In one possible embodiment, the processor 701 executes instructions that, after obtaining the sample digital signal and the base type of the sample digital signal, the method further includes:

In a possible embodiment, the instructions executed by the processor 701, the training the second model to be trained based on the second sample digital signal corresponding to the base type included in the first nucleic acid sequence, the base type of the second sample digital signal, and the quality score corresponding to the base type, includes:

In one possible embodiment, the processor 701 executes instructions that further include constructing the library of reference gene templates according to the following method:

obtaining a genome sequence in the genome database;

In a possible implementation, in the instructions executed by the processor 701, after the obtaining the sample digital signal, the method further includes:

scaling the sample digital signal;

In a possible implementation manner, in the instructions executed by the processor 701, a process of training the second model to be trained includes N rounds of iterative training, where each round of iterative training includes training M classification trees, each classification tree corresponds to one base type, the target model includes M × N target classification trees, N is a preset positive integer, and M is the number of base types;

In a possible embodiment, the instructions executed by the processor 701 for determining the best splitting node of each classification tree in the current iteration training and the corresponding weight of the leaf node of each classification tree based on the base type corresponding to the sample digital signal, the initial prediction result and the quality score corresponding to the base type include:

traversing a plurality of splitting nodes corresponding to the sample digital signal, and determining splitting gain after splitting according to the splitting node and weight corresponding to leaf nodes of each classification tree for any splitting node based on the base type corresponding to the sample digital signal, the initial prediction result and the quality score corresponding to the base type;

In a possible implementation manner, in the instructions executed by the processor 701, after the training of the second model to be trained is completed, the method further includes:

obtaining test sample data;

In one possible embodiment, the processor 701 executes instructions in which the first intermediate prediction result of the target classification tree includes a probability that the test sample data belongs to a base type corresponding to the target classification tree;

In a possible implementation, the determining, by the processor 701, the test result of the second model based on the target prediction result includes:

Embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the base determination method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the base determination method in the foregoing method embodiments, which may be referred to in detail in the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of base determination, comprising:

acquiring a sample digital signal and a base type of the sample digital signal, wherein the base type of the sample digital signal is determined based on a first model of unsupervised training;

2. The method of claim 1, wherein after obtaining the sample digital signal, and the base type of the sample digital signal, the method further comprises:

3. The method according to claim 2, wherein the training of the second model to be trained based on the second sample digital signal corresponding to the base type included in the first nucleic acid sequence, the base type of the second sample digital signal, and the quality score corresponding to the base type comprises:

4. The method of claim 3, further comprising constructing the library of reference gene templates according to the following method:

obtaining a genome sequence in the genome database;

5. The method of any of claims 1-4, wherein after acquiring the sample digital signal, the method further comprises:

scaling the sample digital signal;

6. The method of any of claims 1-4, wherein after acquiring the sample digital signal, the method further comprises:

sampling the sample digital signals in any signal interval according to the sampling proportion;

7. The method according to claim 1, wherein the process of training the second model to be trained comprises N rounds of iterative training, each round of iterative training comprises training M classification trees, each classification tree corresponds to one base type, the target model comprises M × N target classification trees, N is a preset positive integer, and M is the number of base types;

8. A base determination apparatus comprising:

an acquisition module for acquiring a sample digital signal and a base type of the sample digital signal, the base type of the sample digital signal being determined based on a first model of unsupervised training;

9. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is run, the machine-readable instructions when executed by the processor performing the steps of the method of base determination of any one of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, performs the steps of the method for base determination according to any one of claims 1 to 7.