CN117831630B

CN117831630B - Method and device for constructing training data set for base recognition model and electronic equipment

Info

Publication number: CN117831630B
Application number: CN202410245961.2A
Authority: CN
Inventors: 孙琛; 王大千
Original assignee: Beijing Puyi Biotechnology Co ltd
Current assignee: Beijing Puyi Biotechnology Co ltd
Priority date: 2024-03-05
Filing date: 2024-03-05
Publication date: 2024-05-17
Anticipated expiration: 2044-03-05
Also published as: CN117831630A

Abstract

The invention relates to the field of biological gene sequencing, and discloses a method, a device and electronic equipment for constructing a training data set for a base recognition model, wherein the method comprises the following steps: determining a target base sequence according to the initial electric signal corresponding to the target nucleic acid sequence; determining an expected electric signal corresponding to the target base sequence based on the nanopore sequencing signal simulation tool; determining the base position distribution corresponding to the initial electric signal according to the expected electric signal and the initial electric signal; segmenting the initial electric signals according to the preset electric signal length and base position distribution to obtain a plurality of training electric signals; according to the plurality of training electric signals and the training base sequence corresponding to each training electric signal, determining a training data set corresponding to a base recognition model, wherein the base recognition model is used for carrying out base recognition on the electric signal corresponding to the nucleic acid sequence to be recognized. Through the training data set determined by the embodiment of the disclosure, the training efficiency of the base recognition model can be improved, and the performance requirement of hardware equipment used for training is reduced.

Description

Method and device for constructing training data set for base recognition model and electronic equipment

Technical Field

The disclosure relates to the field of biological gene sequencing, and in particular relates to a method, a device and electronic equipment for constructing a training data set for a base recognition model.

Background

The neural network is used for identifying the nanopore sequencing base, so that the accuracy is high. Training neural networks requires a large number of accurate training data sets of electrical signals paired one-to-one with base sequences. In the process of training the neural network, if the length of the electric signal in the training data set is long, the problem that the neural training cannot be performed normally due to insufficient processing performance of hardware equipment trained by the neural network may occur. Therefore, it is necessary to segment the electrical signal with a longer length to construct a reasonable training data set.

Disclosure of Invention

In view of this, the present disclosure proposes a method, an apparatus, and a technical solution of an electronic device for constructing a training data set for a base recognition model.

According to an aspect of the present disclosure, there is provided a method of constructing a training dataset for a base recognition model, comprising: determining a target base sequence according to the initial electric signal corresponding to the target nucleic acid sequence; determining an expected electrical signal corresponding to the target base sequence based on a nanopore sequencing signal simulation tool; determining a base position distribution corresponding to the initial electrical signal according to the expected electrical signal and the initial electrical signal, wherein the base position distribution is used for indicating the position of each base in the target base sequence in the initial electrical signal; segmenting the initial electric signals according to the preset electric signal length and the base position distribution to obtain a plurality of training electric signals; and determining a training data set corresponding to a base recognition model according to the plurality of training electric signals and training base sequences corresponding to the training electric signals, wherein the base recognition model is used for carrying out base recognition on the electric signals corresponding to the nucleic acid sequences to be recognized.

In one possible implementation manner, the determining the target base sequence according to the initial electrical signal corresponding to the target nucleic acid sequence includes: performing base recognition on the initial electric signals corresponding to the target nucleic acid sequence, and determining a plurality of base recognition results; respectively determining the recognition accuracy corresponding to each base recognition result according to the base labeling information corresponding to the target nucleic acid sequence; and determining the base recognition result as the target base sequence when the recognition accuracy corresponding to the base recognition result is larger than a preset threshold value according to any base recognition result.

In one possible implementation manner, the determining, according to the expected electrical signal and the initial electrical signal, a base position distribution corresponding to the initial electrical signal includes: segmenting the initial electric signal based on a t-test method, and determining a first signal segment sequence, wherein the first signal segment sequence comprises a plurality of first signal segments; segmenting the expected electric signal based on a t-test method, and determining a second signal segment sequence, wherein the second signal segment sequence comprises a plurality of second signal segments; and determining the base position distribution corresponding to the initial electric signal according to the first signal fragment sequence and the second signal fragment sequence.

In one possible implementation manner, the determining the base position distribution corresponding to the initial electrical signal according to the first signal segment sequence and the second signal segment sequence includes: determining a distance matrix according to the first signal segment sequence and the second signal segment sequence, wherein the distance matrix comprises m rows and n columns, and the element of the ith row and the jth column in the distance matrix represents the Euclidean distance between the ith first signal segment in the first signal segment sequence and the jth second signal segment in the second signal segment sequence, wherein m, n, i and j are positive integers, and m is larger than or equal to i, and n is larger than or equal to j; determining a regular path according to the distance matrix based on a dynamic time warping method, wherein the regular path represents a path with the smallest Euclidean distance sum from an element of an mth row and an mth column to an element of an nth row and an mth column in the distance matrix; and determining the base position distribution corresponding to the initial electric signal according to the regular path.

In one possible implementation manner, the determining, according to the regular path, a base position distribution corresponding to the initial electrical signal includes: determining the base sequence corresponding to each second signal fragment; and determining the base position distribution corresponding to the initial electric signal according to the regular path and the base sequence corresponding to each second signal fragment.

According to another aspect of the present disclosure, there is provided an apparatus for constructing a training data set for a base recognition model, comprising: the base recognition module is used for determining a target base sequence according to the initial electric signal corresponding to the target nucleic acid sequence; the expected electric signal determining module is used for determining an expected electric signal corresponding to the target base sequence based on the nanopore sequencing signal simulation tool; a base position determining module, configured to determine a base position distribution corresponding to the initial electrical signal according to the expected electrical signal and the initial electrical signal, where the base position distribution is used to indicate a position of each base in the target base sequence in the initial electrical signal; the electric signal segmentation module is used for segmenting the initial electric signal according to the preset electric signal length and the base position distribution to obtain a plurality of training electric signals; the training data set determining module is used for determining a training data set corresponding to a base recognition model according to the plurality of training electric signals and training base sequences corresponding to the training electric signals, wherein the base recognition model is used for carrying out base recognition on the electric signals corresponding to the nucleic acid sequences to be recognized.

According to another aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, the target base sequence can be determined according to the initial electric signal corresponding to the target nucleic acid sequence; based on the nanopore sequencing signal simulation tool, an expected electrical signal corresponding to the target base sequence can be determined; according to the expected electric signal and the initial electric signal, the base position distribution corresponding to the initial electric signal can be determined, and the position of each base in the target base sequence in the initial electric signal can be indicated through the base position distribution corresponding to the initial electric signal; according to the preset electric signal length and base position distribution, the initial electric signal can be segmented to obtain a plurality of training electric signals, so that the initial electric signal with longer length is segmented into a plurality of training electric signals with shorter length; determining a training data set corresponding to a base recognition model according to a plurality of training electric signals and training base sequences corresponding to the training electric signals, wherein the base recognition model is used for carrying out base recognition on the electric signals corresponding to the nucleic acid sequences to be recognized; the base recognition model is trained through the training data set, so that training efficiency can be improved, and performance requirements on hardware equipment used for training are reduced.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow chart of a method of constructing a training dataset for a base recognition model, according to an embodiment of the present disclosure.

FIG. 2 shows a schematic diagram of a loss profile based on a training data set training base identification model, according to an embodiment of the present disclosure.

FIG. 3 shows a schematic diagram of an accuracy profile of training a base recognition model based on a training data set, according to an embodiment of the present disclosure.

Fig. 4 shows a schematic diagram of a first segment signal according to an embodiment of the present disclosure.

FIG. 5 illustrates a block diagram of an apparatus for constructing a training dataset for a base recognition model, according to an embodiment of the present disclosure.

Fig. 6 shows a block diagram of an electronic device, according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

The nanopore sequencing technology is used for replacing a novel nucleic acid detection technology based on polymerase chain reaction, and can inlay nanopores (protein holes or solid-state holes) on an insulating artificial membrane to form ion channels, wherein electrolyte solutions are arranged on two sides of the insulating artificial membrane, and electrodes are respectively arranged on the insulating artificial membrane. The potential difference of the electrodes at both sides of the insulating artificial film can form a via current in the pore canal of the nanopore. When a polymer chain (e.g., single-stranded deoxyribonucleic acid, single-stranded ribonucleic acid, protein, etc.) passes through the nanopore, different electrical signals are generated due to the presence of a plurality of different monomers on the polymer chain, such as adenine (a), guanine (G), cytosine (C), thymine (T), uracil (U), polypeptide, amino acid, etc., which correspond to different impedances. By detecting the electrical signals, e.g., current signals and voltage signals, etc., generated as the polymer chains pass through the nanopore, the constituent sequences of the polymer chains can be deduced. Because the nanopore sequencing technology has the advantages of longer sequence reading length, simple and convenient use, capability of directly sequencing RNA and the like, the nanopore sequencing technology is widely focused in the field of biological gene sequencing in recent years.

In the prior art, a neural network model is generally utilized to carry out base recognition on the electric signal determined based on the nanopore sequencing technology, so that the method has higher accuracy. Training neural network models requires a large number of accurate training data sets of electrical signals paired one-to-one with base sequences. In the process of training the neural network model, if the length of the electric signal in the training data set is long, hardware equipment for training the neural network model is required to have higher processing performance, and the training efficiency of the neural network model is reduced. Therefore, in order to improve the efficiency of training the neural network model and reduce the performance requirement on hardware equipment, the electrical signals with longer length determined based on the nanopore sequencing technology need to be segmented, and a reasonable training data set is constructed. However, the signal-to-noise ratio of the electrical signal determined based on the nanopore sequencing technology is very complex, and the electrical signal is difficult to segment in a manual identification mode.

According to the method for constructing the training data set for the base recognition model, the electric signals can be segmented after the longer electric signals are obtained through the nanopore sequencing technology, the segmented electric signals and the base sequences form the training data set which is matched with each other one by one, the neural network model is trained through the training data set, the performance requirements on hardware equipment for training the neural network model can be reduced, and the training efficiency of the neural network model is improved. The method for constructing a training data set for a base recognition model provided in the embodiments of the present disclosure will be described in detail.

FIG. 1 illustrates a flow chart of a method of constructing a training dataset for a base recognition model, according to an embodiment of the present disclosure. As shown in fig. 1, the method for constructing a training data set for a base recognition model may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method for constructing a training data set for a base recognition model may be implemented by a processor invoking computer readable instructions stored in a memory. Alternatively, the method of constructing a training data set for a base recognition model may be performed by a server. As shown in fig. 1, the method for constructing a training data set for a base recognition model includes:

in step S11, a target base sequence is determined based on the initial electrical signal corresponding to the target nucleic acid sequence.

The target nucleic acid sequence may be a deoxyribonucleic acid (DNA) sequence, a ribonucleic acid (RNA) sequence, or the like, and the present disclosure is not limited thereto depending on the actual use requirements.

In one possible implementation, the target nucleic acid sequence may be a nucleic acid sequence corresponding to a new species organism.

For any new species of organism, it is necessary to sequence the genes of the new species of organism in order to be able to accurately determine the corresponding biological classification of the new species of organism and to perform biological studies on the new species of organism. Specifically, after the nucleic acid sequence of the new species organism is obtained, determining an electrical signal corresponding to the nucleic acid sequence of the new species organism by a nanopore sequencing technology; and then, utilizing the existing base recognition model to carry out base recognition on the electric signals corresponding to the nucleic acid sequences of the organisms of the new species, and determining the base recognition result of the organisms of the new species.

However, the existing base recognition model is usually trained by the public training data set, and since the public training data set does not include the sample data of the organism of the new species, the existing base recognition model may have difficulty in accurately extracting the characteristic information in the electrical signal corresponding to the nucleic acid sequence of the organism of the new species, resulting in lower accuracy of base recognition of the nucleic acid sequence of the organism of the new species.

In this regard, a nucleic acid sequence corresponding to the new species organism may be determined as the target nucleic acid sequence and a training dataset for the new species organism may be constructed based on the target nucleic acid sequence. The nucleic acid sequence of the new species organism can be accurately base-identified by the base-identification model trained by the training data set of the new species organism.

The specific form of the initial electrical signal corresponding to the target nucleic acid sequence can be flexibly set according to actual use requirements, for example, can be a current signal or a voltage signal, and the disclosure is not limited in particular.

The target base sequence can be determined by base recognition based on the initial electric signal corresponding to the target nucleic acid sequence. The target base sequence may indicate the base type and the arrangement order in the target nucleic acid sequence. The specific manner of base recognition by the initial electrical signal corresponding to the target nucleic acid sequence may be referred to as an embodiment in the related art, for example, recognition by the base recognition (basecall) tool guppy or the like, which is not particularly limited in the present disclosure.

In step S12, a desired electrical signal corresponding to the target base sequence is determined based on the nanopore sequencing signal simulation tool.

The Nanopore sequencing signal simulation tool (DeepSimulator) is a Nanopore sequencing simulation tool based on deep learning technology. By utilizing DeepSimulator, the real sequencing process corresponding to various different sequencing technologies can be simulated, and the expected electric signal corresponding to the target base sequence can be generated according to the target base sequence.

Specifically, the simulation parameters of DeepSimulator may be set according to the actual use requirements, for example, the simulated sequencing technology, the sequencing read length, the type of the electrical signal, and the like may be set. And using the target base sequence as a template for simulation sequencing, and simulating the sequencing process of the nucleic acid sequence corresponding to the target base sequence through DeepSimulator to determine an expected electric signal corresponding to the target base sequence. The desired electrical signal has a similar distribution as the initial electrical signal, but the signal-to-noise ratio of the desired electrical signal is higher relative to the initial electrical signal and less noise information.

In step S13, a base position distribution corresponding to the initial electrical signal is determined from the desired electrical signal and the initial electrical signal, wherein the base position distribution is used to indicate the position of each base in the target base sequence in the initial electrical signal.

By comparing the desired electrical signal with the initial electrical signal, the distribution of base positions corresponding to the initial electrical signal can be determined. The distribution of base positions corresponding to the initial electrical signal can be used to indicate the position of each base in the target base sequence in the initial electrical signal. The specific content of the base position distribution may be flexibly set according to actual use requirements, for example, may include a start position, a duration length, an end position, and the like of each base in the initial electric signal, which is not particularly limited in the present disclosure.

Referring to Table 1, the target nucleic acid sequence may be a DNA sequence, the bases within the DNA sequence comprising: adenine (A), guanine (G), cytosine (C), thymine (T). The distribution of base positions corresponding to the initial electrical signal may include a starting position and a duration of each base in the initial electrical signal.

Specifically, the first base in the target nucleic acid sequence is guanine (G), and the duration of the first base in the target nucleic acid sequence is 5 sampling points at the 0 th sampling point of the initial electric signal corresponding to the starting position in the initial electric signal; the second base in the target nucleic acid sequence is cytosine (C), which is 8 samples in duration at the 5 th sample point in the initial electrical signal corresponding to the start position in the initial electrical signal, and so on.

TABLE 1

The process of determining the base position distribution corresponding to the initial electrical signal according to the desired electrical signal and the initial electrical signal will be described in detail below in connection with possible implementation manners of the present disclosure, and will not be described herein.

In step S14, the initial electrical signal is segmented according to the preset electrical signal length and base position distribution, so as to obtain a plurality of training electrical signals.

According to the preset electric signal length, the longer initial electric signal can be segmented into a plurality of training electric signals with uniform length and shorter length. The specific value of the preset electrical signal length can be flexibly set according to actual use requirements, for example, the preset electrical signal length can be set to 10000 sampling points, and the like, which is not particularly limited in the disclosure.

Based on the base position distribution, the boundary position of each base in the initial electric signal can be determined and used as a segmentation point for segmenting the initial electric signal, so that the situation that the electric signal corresponding to one base is divided into two parts by mistake when the initial electric signal is segmented and the accuracy of the base recognition model after training for recognizing the base sequence according to the electric signal is influenced can be avoided. Furthermore, the training base sequence corresponding to each training signal obtained by segmentation can be determined based on the base position distribution.

In step S15, a training data set corresponding to a base recognition model is determined according to the plurality of training electric signals and the training base sequence corresponding to each training electric signal, wherein the base recognition model is used for base recognition of the electric signal corresponding to the nucleic acid sequence to be recognized.

By pairing each training electric signal with its corresponding training base sequence, a training data set for training a base recognition model can be constructed. The base recognition model is used for carrying out base recognition on the electric signals corresponding to the nucleic acid sequence to be recognized, and determining the base types and the base arrangement sequence included in the nucleic acid sequence to be recognized. The specific form of the base recognition model can be flexibly set according to actual use requirements, and the present disclosure is not particularly limited.

In one example, the base recognition model is a long-short-term memory (Long Short Term Memory, LSTM) model built based on pytorch neural network framework, and employs a junction-based time-series classification penalty (Connectionist Temporal Classification Loss, CTC loss) as the penalty function. Based on the training dataset, the LSTM model may be trained.

FIG. 2 shows a schematic diagram of a loss profile based on a training data set training base identification model, according to an embodiment of the present disclosure. As shown in fig. 2, a dark curve may represent a loss variation curve of training a base recognition model by a training data set constructed by an embodiment of the present disclosure, and a light curve may represent a loss variation curve of training a base recognition model by a training data set directly generated according to a training method in the related art.

Specifically, sample data in a training data set constructed by the embodiment of the present disclosure may be divided into a plurality of batch data (batch), where each batch data includes a preset number of training electric signals, and a training base sequence corresponding to each training electric signal, for example, each batch data includes 32 training electric signals, and a training base sequence corresponding to each training electric signal.

Likewise, a training data set directly generated according to the training method in the related art may be divided into a plurality of batch data in the above-described manner. The base recognition model is trained a preset number of times based on the lot data determined from the training data set constructed by the embodiment of the present disclosure and the lot data determined from the published training data set in the related art, respectively. In the training process, the same training electric signal and the corresponding training base sequence can be reused.

As shown in fig. 2, after training for about 100 training rounds (epochs), the loss value of training the base recognition model according to the training data set directly generated by the training method in the prior art still has larger oscillation, which indicates that the model has not completely converged; the loss value of training the base recognition model by the training data set constructed by the embodiment of the disclosure is stable, and the loss value is obviously lower than the loss value of training the base recognition model by the training data set directly generated according to the training method in the prior art. Therefore, the base recognition model can be converged at a high speed by training the base recognition model through the training set constructed by the embodiment of the disclosure, and the efficiency of training the base recognition model can be improved.

FIG. 3 shows a schematic diagram of an accuracy profile of training a base recognition model based on a training data set, according to an embodiment of the present disclosure. As shown in fig. 3, a dark curve may represent an accuracy rate variation curve of training a base recognition model by a training data set constructed by an embodiment of the present disclosure, and a light curve may represent an accuracy rate variation curve of training a base recognition model by a training data set directly generated according to a training method in the related art.

As shown in fig. 3, after training for about 100 training rounds (epochs), the prediction accuracy of training the base recognition model according to the training data set directly generated by the training method in the prior art is significantly lower than the prediction accuracy of training the base recognition model by the training set constructed by the embodiments of the present disclosure.

Therefore, the base recognition model is trained by the training set constructed by the embodiment of the disclosure, so that the base recognition model can reach higher accuracy at a higher speed, the efficiency of training the base recognition model can be improved, and the base recognition result determined based on the base recognition model after training has higher accuracy.

In one possible implementation, determining the target base sequence from the initial electrical signal corresponding to the target nucleic acid sequence includes: performing base recognition on the initial electric signals corresponding to the target nucleic acid sequence, and determining a plurality of base recognition results; respectively determining the recognition accuracy corresponding to each base recognition result according to the base labeling information corresponding to the target nucleic acid sequence; and determining the base recognition result as a target base sequence according to any base recognition result when the recognition accuracy corresponding to the base recognition result is larger than a preset threshold value.

After determining the target nucleic acid sequence, the electrical signal corresponding to the target nucleic acid sequence may be base-identified, and a plurality of results including base identification may be determined. For a specific method of base recognition of an electric signal corresponding to a target nucleic acid sequence, reference may be made to an embodiment in the related art, for example, recognition by a base recognition (basecall) tool or the like, and the present disclosure is not limited thereto. Any one of the base recognition results indicates the base type and the arrangement order in the target nucleic acid sequence.

The recognition result determined by base recognition of the electric signal corresponding to the target nucleic acid sequence may have a recognition error, so that the base recognition result may not completely coincide with the actual base arrangement order in the target nucleic acid sequence. Therefore, it is necessary to control the quality of the base recognition results, and to determine the recognition accuracy corresponding to each base recognition result based on the base sequence labeling information corresponding to the target nucleic acid sequence. The base sequence labeling information corresponding to the target nucleic acid sequence may indicate the type of bases actually included in the target nucleic acid sequence and the order of the bases.

For any base recognition result, the recognition accuracy corresponding to the base recognition result can represent the similarity between the base recognition result and the base sequence labeling information corresponding to the target nucleic acid sequence. The specific form of the recognition accuracy corresponding to any one base recognition result can be flexibly set according to actual use requirements, for example, the specific form can be a percentage form, and the like, so the disclosure is not particularly limited.

Aiming at any base recognition result, under the condition that the recognition accuracy corresponding to the base recognition result is larger than a preset threshold value, the base recognition result is determined to be a high-quality base sequence, and the method has higher readability and accuracy. In this case, the base recognition result can be determined as the target base sequence. The specific value of the preset threshold can be flexibly set according to actual use requirements, for example, when the accuracy rate is identified in a non-percentage form, the preset threshold can be set to be 95%, and the disclosure is not limited in detail.

Through the process, the base recognition result with higher accuracy is screened as the target base sequence, so that the target base sequence can be ensured to have higher readability and accuracy, and the accuracy of the expected electric signal determined based on the target base sequence is improved.

In one possible implementation, determining a base position distribution corresponding to the initial electrical signal from the desired electrical signal and the initial electrical signal includes: segmenting an initial electric signal based on a t-test method, and determining a first signal segment sequence, wherein the first signal segment sequence comprises a plurality of first signal segments; segmenting the expected electric signal based on a t-test method, and determining a second signal segment sequence, wherein the second signal segment sequence comprises a plurality of second signal segments; and determining the base position distribution corresponding to the initial electric signal according to the first signal fragment sequence and the second signal fragment sequence.

By the t-test method, whether a significant difference exists between two samples can be judged according to the statistical principle. Based on the t-test results, it may be determined whether the difference between the means of the two samples is within a preset error range or whether the two samples are from the same population.

Specifically, for the initial electrical signal, the average value of any two adjacent electrical signals with preset lengths can be obtained. The mean value of the two adjacent electrical signals may be compared based on a t-test method. When the t-test result shows that the difference value between the average values of the two adjacent electric signals is within a preset error range, the two adjacent electric signals are corresponding to the same base region, and segmentation between the two adjacent electric signals is not needed; when the t-test result shows that the difference value between the average values of the two adjacent electric signals exceeds the preset error range, the two adjacent electric signals are corresponding to different base areas, and segmentation is needed between the two adjacent electric signals. The specific value of the preset length can be flexibly set according to actual use requirements, and the specific value is not limited in the disclosure.

Fig. 4 shows a schematic diagram of a first segment signal according to an embodiment of the present disclosure. As shown in fig. 4, the light solid line represents the first segment signal; the portion of the dark solid line where the mean value of any one of the electrical signals remains unchanged represents a base region in the first segment signal.

Through the above procedure, the initial electrical signal may be divided into a plurality of first signal segments and a corresponding sequence of first signal segments determined.

Likewise, for the desired electrical signal, it is also possible to divide the desired electrical signal into a plurality of second signal segments based on the t-test method and to determine the corresponding sequence of second signal segments.

Since the base type and the base arrangement order are determined in the base sequence corresponding to the desired electric signal, the base position distribution corresponding to the initial electric signal can be determined from the first signal fragment sequence and the second signal fragment sequence.

In one possible implementation, determining the base position distribution corresponding to the initial electrical signal from the first signal fragment sequence and the second signal fragment sequence includes: determining a distance matrix according to the first signal segment sequence and the second signal segment sequence, wherein the distance matrix comprises m rows and n columns, the element of the ith row and the jth column in the distance matrix represents the Euclidean distance between the ith first signal segment in the first signal segment sequence and the jth second signal segment in the second signal segment sequence, m, n, i and j are positive integers, and m is greater than or equal to i, and n is greater than or equal to j; determining a regular path according to a distance matrix based on a dynamic time warping method, wherein the regular path represents a path with the smallest Euclidean distance sum from an element of an mth column of a1 st row to an element of an nth row of the 1 st column in the distance matrix; and determining the base position distribution corresponding to the initial electric signal according to the regular path.

The first signal segment sequence may include m first signal segments, and the second signal segment sequence may include n second signal segments. From the first signal segment sequence and the second signal segment sequence, a distance matrix comprising m rows and n columns may be constructed. The element of the ith row and the jth column in the distance matrix may represent the euclidean distance between the ith first signal segment in the first signal segment sequence and the jth second signal segment in the second signal segment sequence, where m, n, i and j are positive integers, and m is greater than or equal to i, n is greater than or equal to j.

Based on the dynamic time warping method, a warping path corresponding to the distance matrix can be determined according to the distance matrix. Specifically, for any one path in the distance matrix, there will be one element per row of the distance matrix included in that path, and one element per column of the distance matrix included in that path. That is, from the 1 st row, the m-th column, the path of the element to the n-th row, the 1 st column, the path would be along the n-th row, the n-1 st row, … …, the 1 st row (the path is monotonically decreasing in the row direction and does not skip any row), the path would be along the 1 st column, the 2 nd column, … …, the m-th column (the path is monotonically increasing in the column direction and does not skip any column). Thus, the path traverses each first signal segment in the sequence of first signal segments and each second signal segment in the sequence of second signal segments.

The element of the ith row and the jth column in the distance matrix may represent the euclidean distance between the ith first signal segment in the first signal segment sequence and the jth second signal segment in the second signal segment sequence. Thus, for any one path in the distance matrix, the sum of Euclidean distances corresponding to the path from the element of row 1, column 1, to the element of row 1, column 1, can be determined.

By comparing the Euclidean distance sum corresponding to different paths in the distance matrix, a path, i.e. a regular path, with the smallest Euclidean distance sum from the element of the 1 st row and the mth column to the element of the n row and the 1 st column in the distance matrix can be determined.

Through the regular path, the corresponding relation between the different first signal fragments and the different second signal fragments can be determined, so that the alignment between the initial electric signal and the bases corresponding to each second signal fragment can be realized, and the position of each base in the initial electric signal is determined, namely, the base position distribution corresponding to the initial electric signal is determined.

Through the process, the base sequence corresponding to the target nucleic acid sequence and the base position distribution corresponding to the initial electric signal can be determined accurately at the same time, compared with the method for determining the base position distribution corresponding to the initial electric signal directly according to the base recognition result after the base recognition result is determined by judging the base of the target nucleic acid sequence in the prior art, the problem that the accuracy of the base position distribution corresponding to the initial electric signal is lower due to the inaccuracy of the base recognition result can be avoided, and the method has higher applicability in the application scene of the nucleic acid sequence of a new species of target nucleic acid sequence.

In one possible implementation, determining the base position distribution corresponding to the initial electrical signal according to the regular path includes: determining the base sequence corresponding to each second signal fragment; and determining the base position distribution corresponding to the initial electric signal according to the regular path and the base sequence corresponding to each second signal fragment.

Through the regular path, the corresponding relation between different first signal fragments and different second signal fragments can be determined, and then the basic group science column corresponding to each first signal fragment can be determined according to the basic group sequence corresponding to each second signal fragment, so that the alignment between the initial electric signal and the basic group can be realized. According to parameters such as sampling frequency of the initial electric signal, the starting position/time, duration/time, ending position/time and the like of each base in the initial electric signal can be expressed in the form of sampling points, sampling time and the like of the initial electric signal, so that the base position distribution corresponding to the initial electric signal can be obtained.

In the embodiment of the disclosure, the target base sequence can be determined according to the initial electric signal corresponding to the target nucleic acid sequence; based on the nanopore sequencing signal simulation tool, an expected electrical signal corresponding to the target base sequence can be determined; based on the dynamic time warping method, the base position distribution corresponding to the initial electrical signal can be determined according to the expected electrical signal and the initial electrical signal. The corresponding relation between each base in the target base sequence and the initial electric signal can be indicated through the base position distribution corresponding to the initial electric signal; according to the preset electric signal length and base position distribution, the initial electric signal can be segmented to obtain a plurality of training electric signals, so that the initial electric signal with longer length is segmented into a plurality of training electric signals with shorter length; determining a training data set corresponding to a base recognition model according to a plurality of training electric signals and training base sequences corresponding to the training electric signals, wherein the base recognition model is used for carrying out base recognition on the electric signals corresponding to the nucleic acid sequences to be recognized; the base recognition model is trained through the training data set, so that training efficiency can be improved, and performance requirements on hardware equipment used for training are reduced.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.

In addition, the disclosure further provides a device, an electronic device, a computer readable storage medium and a program for constructing a training data set for a base recognition model, and any one of the methods for constructing a training data set for a base recognition model provided by the disclosure may be implemented, and corresponding technical schemes and descriptions and corresponding descriptions of method parts are omitted.

FIG. 5 illustrates a block diagram of an apparatus for constructing a training dataset for a base recognition model, according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus 500 includes:

a base recognition module 501 for determining a target base sequence based on an initial electrical signal corresponding to the target nucleic acid sequence;

The expected electric signal determining module 502 is used for determining an expected electric signal corresponding to the target base sequence based on the nanopore sequencing signal simulation tool;

A base position determining module 503 for determining a base position distribution corresponding to the initial electric signal based on the desired electric signal and the initial electric signal, wherein the base position distribution is used for indicating the position of each base in the target base sequence in the initial electric signal;

The electric signal segmentation module 504 is configured to segment the initial electric signal according to a preset electric signal length and base position distribution, so as to obtain a plurality of training electric signals;

The training data set determining module 505 is configured to determine a training data set corresponding to a base recognition model according to the plurality of training electrical signals and training base sequences corresponding to the training electrical signals, where the base recognition model is configured to perform base recognition on an electrical signal corresponding to a nucleic acid sequence to be recognized.

In one possible implementation, the base recognition module 501 is configured to:

Performing base recognition on the initial electric signals corresponding to the target nucleic acid sequence, and determining a plurality of base recognition results; respectively determining the recognition accuracy corresponding to each base recognition result according to the base labeling information corresponding to the target nucleic acid sequence; and determining the base recognition result as a target base sequence according to any base recognition result when the recognition accuracy corresponding to the base recognition result is larger than a preset threshold value.

In one possible implementation, the base position determination module 503 is configured to:

Segmenting an initial electric signal based on a t-test method, and determining a first signal segment sequence, wherein the first signal segment sequence comprises a plurality of first signal segments; segmenting the expected electric signal based on a t-test method, and determining a second signal segment sequence, wherein the second signal segment sequence comprises a plurality of second signal segments; and determining the base position distribution corresponding to the initial electric signal according to the first signal fragment sequence and the second signal fragment sequence.

Determining a distance matrix according to the first signal segment sequence and the second signal segment sequence, wherein the distance matrix comprises m rows and n columns, the element of the ith row and the jth column in the distance matrix represents the Euclidean distance between the ith first signal segment in the first signal segment sequence and the jth second signal segment in the second signal segment sequence, m, n, i and j are positive integers, and m is greater than or equal to i, and n is greater than or equal to j; determining a regular path according to a distance matrix based on a dynamic time warping method, wherein the regular path represents a path with the smallest Euclidean distance sum from an element of an mth column of a1 st row to an element of an nth row of the 1 st column in the distance matrix; and determining the base position distribution corresponding to the initial electric signal according to the regular path.

Determining the base sequence corresponding to each second signal fragment; and determining the base position distribution corresponding to the initial electric signal according to the regular path and the base sequence corresponding to each second signal fragment.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

Fig. 6 shows a block diagram of an electronic device, according to an embodiment of the disclosure. For example, the apparatus 1900 may be provided as a server or terminal device. Referring to fig. 6, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The apparatus 1900 may further comprise a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output interface 1958 (I/O interface). The device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server ^TM,Mac OS X^TM,Unix^TM, Linux^TM,FreeBSD^TM or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of apparatus 1900 to perform the above-described methods.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of constructing a training data set for a base recognition model, comprising:

Determining a target base sequence according to the initial electric signal corresponding to the target nucleic acid sequence;

Determining an expected electrical signal corresponding to the target base sequence based on a nanopore sequencing signal simulation tool;

Determining a base position distribution corresponding to the initial electrical signal according to the expected electrical signal and the initial electrical signal, wherein the base position distribution is used for indicating the position of each base in the target base sequence in the initial electrical signal;

Segmenting the initial electric signals according to the preset electric signal length and the base position distribution to obtain a plurality of training electric signals;

Determining a training data set corresponding to a base recognition model according to the plurality of training electric signals and training base sequences corresponding to the training electric signals, wherein the base recognition model is used for carrying out base recognition on the electric signals corresponding to the nucleic acid sequences to be recognized;

Wherein, the determining the base position distribution corresponding to the initial electric signal according to the expected electric signal and the initial electric signal comprises:

segmenting the initial electric signal based on a t-test method, and determining a first signal segment sequence, wherein the first signal segment sequence comprises a plurality of first signal segments;

Segmenting the expected electric signal based on a t-test method, and determining a second signal segment sequence, wherein the second signal segment sequence comprises a plurality of second signal segments;

Determining the base position distribution corresponding to the initial electric signal according to the first signal fragment sequence and the second signal fragment sequence;

The determining the base position distribution corresponding to the initial electric signal according to the first signal fragment sequence and the second signal fragment sequence comprises the following steps:

Determining a distance matrix according to the first signal segment sequence and the second signal segment sequence, wherein the distance matrix comprises m rows and n columns, and the element of the ith row and the jth column in the distance matrix represents the Euclidean distance between the ith first signal segment in the first signal segment sequence and the jth second signal segment in the second signal segment sequence, wherein m, n, i and j are positive integers, and m is larger than or equal to i, and n is larger than or equal to j;

determining a regular path according to the distance matrix based on a dynamic time warping method, wherein the regular path represents a path with the smallest Euclidean distance sum from an element of an mth row and an mth column to an element of an nth row and an mth column in the distance matrix;

And determining the base position distribution corresponding to the initial electric signal according to the regular path.

2. The method according to claim 1, wherein determining the target base sequence based on the initial electrical signal corresponding to the target nucleic acid sequence comprises:

performing base recognition on the initial electric signals corresponding to the target nucleic acid sequence, and determining a plurality of base recognition results;

respectively determining the recognition accuracy corresponding to each base recognition result according to the base labeling information corresponding to the target nucleic acid sequence;

and determining the base recognition result as the target base sequence when the recognition accuracy corresponding to the base recognition result is larger than a preset threshold value according to any base recognition result.

3. The method of claim 1, wherein determining a base position distribution corresponding to the initial electrical signal based on the regular path comprises:

Determining the base sequence corresponding to each second signal fragment;

And determining the base position distribution corresponding to the initial electric signal according to the regular path and the base sequence corresponding to each second signal fragment.

4. An apparatus for constructing a training data set for a base recognition model, comprising:

the base recognition module is used for determining a target base sequence according to the initial electric signal corresponding to the target nucleic acid sequence;

the expected electric signal determining module is used for determining an expected electric signal corresponding to the target base sequence based on the nanopore sequencing signal simulation tool;

A base position determining module, configured to determine a base position distribution corresponding to the initial electrical signal according to the expected electrical signal and the initial electrical signal, where the base position distribution is used to indicate a position of each base in the target base sequence in the initial electrical signal;

The electric signal segmentation module is used for segmenting the initial electric signal according to the preset electric signal length and the base position distribution to obtain a plurality of training electric signals;

The training data set determining module is used for determining a training data set corresponding to a base recognition model according to the plurality of training electric signals and training base sequences corresponding to the training electric signals, wherein the base recognition model is used for carrying out base recognition on the electric signals corresponding to the nucleic acid sequences to be recognized;

Wherein the base position determination module is further configured to:

the determining the base position distribution corresponding to the initial electric signal according to the expected electric signal and the initial electric signal comprises the following steps:

the base position determination module is further configured to:

5. An electronic device, comprising:

A processor;

a memory for storing processor-executable instructions;

Wherein the processor is configured to implement the method of any one of claims 1 to 3 when executing the instructions stored by the memory.

6. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 3.