CN113744804A

CN113744804A - Method and device for storing data by using DNA and storage equipment

Info

Publication number: CN113744804A
Application number: CN202110688074.9A
Authority: CN
Inventors: 戴俊彪; 黄小罗
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2021-12-03
Anticipated expiration: 2041-06-21
Also published as: CN113744804B

Abstract

The application belongs to the field of data storage, and provides a method for storing data by using DNA, which comprises the following steps: extracting a binary sequence corresponding to data to be stored; converting the binary sequence into a base sequence according to a preset mapping relation; dividing the base sequence into a plurality of core sequences with preset first lengths; splicing the core sequence with a connector sequence for marking the sequence direction to obtain a sequence block; searching a DNA sequence matched with the sequence block in a pre-synthesized DNA molecule library, and amplifying synthesized DNA molecules corresponding to a predetermined number of extracted DNA sequences to obtain a DNA product; storing the amplified DNA product and the corresponding key information. The method can reduce the cost of data storage by using the DNA, and is beneficial to promoting the wide application of the storage equipment based on the DNA medium.

Description

Method and device for storing data by using DNA and storage equipment

Technical Field

The present application belongs to the field of data storage, and in particular, relates to a method, an apparatus and a storage device for data storage by using DNA.

Background

With the development of scientific technologies such as internet technology, big data technology and artificial intelligence technology, global data shows exponential growth. The traditional storage devices such as hard disks, magnetic tapes or optical disks cannot meet the ever-increasing requirement for mass data storage due to high maintenance cost, large occupied space, short storage life and the like.

When the DNA is used for data storage, the method has the characteristics of high storage density, long storage time and low maintenance cost. Moreover, DNA as a life genetic information substance can be inserted into microbial cells of animals and plants, and permanent preservation of generations can be realized by replication of a living body.

In the conventional method for storing data based on DNA, it is generally necessary to synthesize different DNA base sequences corresponding to different stored data. The number of computer bits stored and the correspondence of synthetic bases is typically 1 base for 1 to 2 bits of data. That is, the stored computer binary data amount requires at least one half of the synthetic base amount of the binary data amount for storage, and the DNA molecule synthesized in vitro at one time can only be used for storing specific data at one time, so the synthesis cost is high, and the wide application of the storage technology of DNA medium is not favorable.

Disclosure of Invention

In view of this, embodiments of the present application provide a method, an apparatus, and a device for data storage using DNA, so as to solve the problem in the prior art that when data storage using DNA is performed, synthesis cost is high, and it is not favorable for wide application of storage technology of DNA media.

A first aspect of embodiments of the present application provides a method for data storage using DNA, the method including:

extracting a binary sequence corresponding to data to be stored;

converting the binary sequence into a base sequence according to a preset mapping relation;

dividing the base sequence into a plurality of core sequences with a preset first length, wherein an overlapping region with a preset second length is included between two related core sequences;

splicing the core sequence with a connector sequence for marking the sequence direction to obtain a sequence block;

searching a DNA sequence matched with the sequence block in a pre-synthesized DNA molecule library, and amplifying synthesized DNA molecules corresponding to a predetermined number of extracted DNA sequences to obtain a DNA product;

storing the amplified DNA product and corresponding key information, wherein the key information comprises more than one core sequence, or the key information comprises more than one partial base in the core sequence and the position information of the core sequence in the base sequence.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the linker sequence includes one or both of a left linker sequence and a right linker sequence, and when the linker sequence includes the left linker sequence and the right linker sequence, the left linker sequence is different from the right linker sequence.

With reference to the first aspect, in a second possible implementation manner of the first aspect, converting the binary sequence into a base sequence includes:

splitting the file corresponding to the data to be stored, splitting the base sequence according to the result of file splitting, and allocating a corresponding index sequence for the split base sequence.

With reference to the first aspect, in a third possible implementation manner of the first aspect, the including an overlap region with a preset second length between two associated core sequences includes:

an overlapping area with a preset second length is formed between every two adjacent core sequences;

or, an overlapping region with a preset second length is included between adjacent odd-numbered core sequences, and an overlapping region with a preset second length is included between adjacent even-numbered core sequences;

or an overlapping area with a preset second length is included between the core sequences of the M + i th bit and the M + N + i th bit, wherein M and N are preset integers, and i is an integer variable greater than or equal to 0.

With reference to the first aspect, in a fourth possible implementation manner of the first aspect, the dividing the base sequence into a plurality of core sequences with preset first lengths includes:

and when the length of the divided last core sequence is smaller than the first length, filling up the last core sequence by a preset repeated base.

With reference to the first aspect, in a fifth possible implementation manner of the first aspect, the storing the amplified DNA product and the corresponding key information includes:

converting the position information in the key information into a base unit according to a mapping relation between a preset base unit and the position information, and storing a core sequence in the key information and the base unit in a DNA form according to a preset combination mode;

or, the core sequence in the key information and the position information of the included core sequence are stored by a computer readable storage medium;

or, the key information is stored in a mixed manner in a DNA form and a computer-readable storage medium form.

With reference to the first aspect, the first possible implementation manner of the first aspect, the second possible implementation manner of the first aspect, the third possible implementation manner of the first aspect, the fourth possible implementation manner of the first aspect, or the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the key information includes a core sequence at a start position and/or a core sequence at an end position.

With reference to the first aspect, in a seventh possible implementation manner of the first aspect, the key information further includes one or two of the adaptor sequence and an index sequence corresponding to a sub-base sequence obtained by splitting the base sequence.

A second aspect of embodiments of the present application provides an apparatus for data storage using DNA, the apparatus comprising:

the binary sequence extraction unit is used for extracting a binary sequence corresponding to the data to be stored;

a first sequence conversion unit for converting the binary sequence into a base sequence according to a preset mapping relationship;

the base segmentation unit is used for segmenting the base sequence into a plurality of core sequences with preset first length, and an overlapping region with preset second length is included between two related core sequences;

the sequence splicing unit is used for splicing the core sequence and a connector sequence used for marking the sequence direction to obtain a sequence block;

a DNA molecule extraction unit, which is used for searching a DNA sequence matched with the sequence block in a pre-synthesized DNA molecule library, and amplifying synthesized DNA molecules corresponding to a predetermined number of extracted DNA sequences to obtain a DNA product;

a DNA storage unit for storing the amplified DNA product and corresponding key information, wherein the key information comprises more than one core sequence, or the key information comprises more than one partial base in the core sequence and the position information of the core sequence in the base sequence.

In a third aspect, an embodiment of the present application provides a method for decoding stored data of a DNA medium, where the method includes:

obtaining a DNA sequence to be decoded and key information thereof;

extracting a core sequence included in the DNA sequence according to a preset adaptor sequence;

combining the core sequences to generate a base sequence according to the key information in combination with an overlapping region between the core sequences;

converting the base sequence into a binary sequence according to a preset mapping relation;

a data file is generated from the converted binary data.

With reference to the third aspect, in a first possible implementation manner of the third aspect, extracting a core sequence included in the DNA sequence according to a preset linker sequence includes:

cutting the DNA sequence according to a preset linker sequence to obtain a core sequence;

and determining the direction of the obtained core sequence according to the position of the joint sequence.

With reference to the third aspect, in a second possible implementation manner of the third aspect, the combining the core sequences to generate a base sequence according to the key information and with reference to an overlapping region between the core sequences includes:

determining the position of more than one core sequence in the base sequence according to the key information;

determining the relative positional relationship between the core sequences based on the overlapping regions between the core sequences, and generating the base sequence based on the core sequences whose relative positional relationship is determined.

With reference to the third aspect, the first possible implementation manner of the third aspect, or the second possible implementation manner of the third aspect, in a third possible implementation manner of the third aspect, the extracting a core sequence included in the DNA sequence according to a preset linker sequence includes:

when the end of the extracted core sequence includes a predetermined repeated base, the repeated base included at the end of the core sequence is removed.

A fourth aspect of an embodiment of the present application provides an apparatus for decoding stored data of a DNA medium, the apparatus comprising:

a DNA sequence obtaining unit for obtaining the DNA sequence to be decoded and the key information thereof;

a core sequence extraction unit, configured to extract a core sequence included in the DNA sequence according to a preset linker sequence;

a sequence combining unit configured to combine the core sequences to generate a base sequence by combining overlapping regions between the core sequences based on the key information;

a second sequence conversion unit for converting the base sequence into a binary sequence according to a preset mapping relationship;

a data file generating unit for generating a data file from the converted binary data.

A fifth aspect of embodiments of the present application provides a method for generating a DNA molecule library, the method comprising:

generating a core sequence from the base combination, the core sequence comprising a base fragment for storing data;

splicing a preset adapter sequence on the core sequence to obtain a DNA sequence corresponding to the core sequence, wherein the adapter sequence is used for identifying the direction of the core sequence;

synthesizing corresponding DNA molecules according to the DNA sequences, and obtaining a DNA molecule library according to the synthesized DNA molecules.

A sixth aspect of the embodiments of the present application provides a storage device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for storing data with DNA according to any one of the first aspect when executing the computer program, or implements the method for decoding the stored data with the DNA medium according to any one of the second aspect when executing the computer program, or implements the method for generating the DNA molecule library according to any one of the fifth aspect when executing the computer program.

Compared with the prior art, the embodiment of the application has the advantages that: according to the method, a binary sequence corresponding to data to be stored is converted into a base sequence, the base sequence is divided according to a preset first length, an overlapping region with a second length is formed between adjacent core sequences obtained through division, and the position information of the core sequences in the base sequence, which is included in key information, is combined, so that the core sequences are combined according to the overlapping region during decoding; splicing the core sequence and the joint sequence to obtain a sequence block, thereby being convenient for determining the direction of the core sequence during decoding; when single data storage is carried out, a predetermined number of small amount of synthesized DNA molecules can be obtained from a DNA molecule library, the number of the synthesized DNA molecules used in the data storage is greatly reduced, the DNA molecules synthesized in vitro for one time can be ensured to be called for many times, and the DNA synthesis cost of the data storage is effectively reduced; meanwhile, the pre-synthesized DNA molecule library can be called repeatedly, so that the condition that at least one half of base number of corresponding data needs to be synthesized aiming at different binary data is avoided, the total base synthesis number is saved, the DNA synthesis cost for storing data is further reduced, and the wide application of a storage device of a DNA medium is facilitated to be promoted.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a schematic flow chart of an implementation of a method for storing data by using DNA provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of the generation of a core sequence from a base sequence according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a core sequence for generating a base sequence provided in an example of the present application;

fig. 4 is a schematic diagram of a core sequence obtained by segmentation according to an embodiment of the present application;

FIG. 5 is a diagram of a spliced sequence block according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a combination rule provided by an embodiment of the present application;

FIG. 7 is a schematic flow chart of an implementation of a method for decoding stored data of a DNA medium according to an embodiment of the present application;

FIG. 8 is a schematic diagram of an apparatus for data storage using DNA according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an apparatus for decoding stored data of a DNA medium according to an embodiment of the present application;

fig. 10 is a schematic diagram of a storage device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

In vitro synthesis of DNA is generally based on a modified A/T/C/G base chemical molecular monomer material according to an arbitrarily set sequence without depending on a template, and in vitro synthesis of a macromolecular DNA polymer consisting of A/T/C/G consistent with the set sequence through multiple rounds of chemical or enzymatic reactions in such a manner that one or more bases are added to the last one or more bases in each round of chemical or enzymatic reaction according to a chemical or enzymatic synthesis method. Wherein, each round of chemical or enzyme reaction needs to spend reagent, material consumption, manpower and mechanical loss cost for supporting the round of chemical or enzyme reaction, and the number of bases of the DNA sequence to be synthesized determines the synthesis cost of most macromolecular DNA polymers. Conventional DNA synthesis services available on the market, also typically base unit price multiplied by the number of bases, offer prices to the customers. Such as 50 bases, commercial companies may offer 0.3-0.6 bases per base, such that 50 bases are sold at a price between 15 and 30 bases. In the current DNA data storage method, at least one-half of the base amount of binary data needs to be synthesized, which is expensive. Furthermore, in a single DNA synthesis run, at least a predetermined number of DNA molecules are synthesized at one time, according to current technical features, due to the fixed reaction volume required for each chemical or enzymatic reaction. For example, in the above examples, the amount of 50 bases synthesized by a conventional single-stranded DNA synthesis method is usually at least 0.5nmol (3X 10)¹⁴One molecule). However, in DNA data storage applications, it will be appreciated that only a very small fraction of a predetermined number of DNA molecules (e.g., 1 molecule, 10) is required²Molecule, 10³Molecule, 10⁵Molecules, etc.) can represent data information to be stored, thus causing a great waste of raw materials for synthesizing DNA molecules in a manner that the data information is directly stored in DNA synthesized at one time.

The present application proposes a method for data storage by means of pre-synthesized DNA sequences. The base sequence of the stored data is divided into the fragments of the core sequence with the overlapping region, the data is stored by utilizing the general DNA molecule library which is synthesized in advance and can be called for many times, the universality is good, the DNA sequence corresponding to the data to be stored can be extracted from the pre-synthesized DNA molecule library according to the requirement, the total base number required to be synthesized by DNA products for storing different data information and the number of synthesized DNA molecules used for one time are greatly reduced, the data storage cost is reduced, and the wide application of DNA data storage is facilitated. The following description is made in detail with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of an implementation of data storage using DNA according to an embodiment of the present application, which is detailed as follows:

in S101, a binary sequence corresponding to the data to be stored is extracted.

The data to be stored may include one or more items of data information that may exist in a computer, such as pictures, texts, programs, audio, video, and the like.

When the binary sequence corresponding to the data to be stored is obtained, the coding information corresponding to the data to be stored can be obtained, and the corresponding coding information is converted into the binary coding information, so that the corresponding binary sequence is obtained. For example, the characters in the text message may be converted into corresponding ASCII (American Standard Code for Information exchange, chinese) codes, UNICODE (Universal Character Set, chinese) codes, and then the codes may be converted into binary sequences.

For example, the binary sequence corresponding to the extracted text "sculpture a beautiful winter work as a piece of art" is "111001011011000010000110111001111011111010001110111001001011100010111101111001111001101010000100111001011000011010101100111001011010110110100011111010011001101110010101111001011010000110010001111001101000100010010000111001001011100010000000111001001011101110110110111010001000100110111010111001101001110010101111111001011001001110000001".

In S102, the binary sequence is converted into a base sequence according to a predetermined mapping relationship.

Since the bases in the DNA include A, T, C, G four types of bases, the predetermined mapping relationship may be a binary to quaternary mapping relationship. In one possible implementation, the mapping relationship may be as shown in table 1:

binary value	00	01	11	10
					Base	A	T	C	G

TABLE 1

The mapping relationship in the above table is arbitrarily defined, and the mapping relationship between the binary number and the base can be defined according to actual usage habits or requirements.

According to the mapping relationship in the above table, the binary sequence in S101 can be converted into base sequence: "CGTTGCAAGATGCGTCGCCGGACGCGTAGCGAGCCTCGTCGTGGGATACGTTGATGGGCACGTTGGCTGGACCGGTGTGCGTTTCGTTGGATGTATCGTGGAGAGTAACGTAGCGAGAAACGTAGCGCGCTGCGGAGAGTGCGGCGTGGTCAGGCCCGTTGTACGAAT".

In a possible implementation manner, if the data to be stored is large, the file of the data to be stored can be split into a plurality of subfiles, and the precedence relationship of the sub-base sequences corresponding to the respective subfiles can be recorded through the index sequence. After the data or the file to be stored is split, the sub-files obtained by splitting can be recorded through the index sequence. The index sequence may also be added to the primers corresponding to the left or right adapters of the core sequence block used in the amplification method, and further added to the DNA molecules for amplification of each core sequence stored in the subfile by the method by means of polymerase amplification such as PCR. By adding the uniform index identification to the core sequence blocks corresponding to the same subfile, the sequences are clustered together for splicing, so that the reading of the data of the subfile is facilitated. By means of splitting and amplification, large data files can be converted in a segmented mode conveniently, and therefore data storage accuracy is improved.

The index sequence is a sequence composed of bases, and the index sequence can be used to indicate position information of a subfile. For example, the index sequence may be defined as AAAA ═ 1, AAAG ═ 2, AAAT ═ 3, AAAC ═ 4, AATA ═ 5, and AACA ═ 6, and the like.

In S103, the base sequence is divided into a plurality of core sequences with a predetermined first length, and an overlap region with a predetermined second length is included between two related core sequences.

In order to facilitate direct access to a pre-synthesized DNA sequence, it is necessary to divide the converted base sequence into core sequences of a predetermined first length. The first length may be 4 bases, 5 bases, 6 bases, 7 bases, or 8 bases, among others.

In order to correctly combine the divided core sequences, the two adjacent core sequences obtained by division include bases in an overlapping region of a second predetermined length. I.e., the bases of the two core sequences in the overlapping region are the same. Based on the bases of the overlapping region, two core sequences having a relationship can be found when combined.

The two core sequences having a relationship determined by the predetermined relationship may be two core sequences in which the core sequences are directly adjacent to each other in the base sequence, odd-numbered adjacent core sequences, or even-numbered adjacent core sequences.

When two core sequences to be related are directly adjacent, the schematic diagram of the generation of the core sequences from the base sequences as shown in fig. 2 shows the cases where the overlap region includes 3 base overlap (a), 4 base overlap (b), 5 base overlap (c), 6 base overlap (d), and 7 base overlap (e). Wherein the second length of the overlap region is half of the first length of the core sequence. Without being limited thereto, the first length and the second length may also be in other proportional relationships. Of course, the number of bases in the overlapping region is not limited to that shown in FIG. 2, and other numbers of bases in the overlapping region may be included.

When the length of the last core sequence obtained by base sequence segmentation is smaller than the preset first length, the filling can be performed by the preset base type or the preset repeated base. For example, in the segmentation process of the 5-base overlap (c) in FIG. 2, the last core sequence is filled with the base A. In the segmentation process of the 7-base staggered overlapping (e), the tail core sequence is supplemented with the base AAAA, so that the supplemented core sequence has the same length as other core sequences, and the splicing and storing operations of the core sequences are facilitated.

The core sequences for which there is an association may be odd-bit adjacent or even-bit adjacent core sequences. As shown in FIG. 3, the schematic diagram of the core sequence generated from the further base sequence shows a 3-base parity-staggered overlap (f), a 4-base parity-staggered overlap (g), and a 3-consecutive 4-base-unit staggered overlap (h). In the schematic diagram of 3 consecutive 4-base units overlapped alternately, the overlapped region is consecutive 8 bases in two adjacent core sequences, and the length of the core sequence is 12 bases.

In a possible implementation manner, the M + i-th bit and the M + N + i-th bit of the core sequence may further include an overlap region with a preset second length, where M and N are predetermined integers, and i is an integer variable greater than or equal to 0. By altering the parameter M, N, different forms of cryptographic encoding may be implemented.

When the nucleotide sequence in S102 is divided into core sequences in the manner shown in FIG. 2 (b), a table diagram of the core sequences shown in FIG. 4 can be obtained. Wherein any two adjacent core sequences comprise 4 bases in the region of overlap. When decoding a DNA sequence, a combination operation of splicing two by two core sequences in the DNA sequence may be performed according to bases of an overlapping region in the core sequences.

In S104, the core sequence and a linker sequence used for indicating the sequence direction are spliced to obtain a sequence block.

The linker sequence in the embodiments of the present application may include a pre-linker sequence and a post-linker sequence, or may be any one of the pre-linker sequence and the post-linker sequence. Sequence blocks obtained by splicing the linker sequences can be used to mark the front and back directions of the core sequence. For example, the front or left direction of the core sequence is represented by a front linker sequence, and the rear or right direction of the core sequence is represented by a rear linker sequence. Therefore, when the DNA sequence is decoded, the direction of the core sequence can be determined according to the linker sequence, and the core sequence can be combined correctly.

Fig. 4 is a schematic diagram of core sequences obtained by segmentation, and 26 core sequences can be spliced according to a preset front adapter sequence "CGCCAGGGTTTTCCCAGTCACGAC" and a preset rear adapter sequence "TCCTGTGTGAAATTGTTATCCGCT" respectively to obtain a spliced sequence block shown in fig. 5.

In S105, a DNA sequence matching the sequence block is searched in a DNA molecule library synthesized in advance, and a predetermined number of synthesized DNA molecules corresponding to the extracted DNA sequence are amplified to obtain a DNA product.

In the present application, a library of synthesized DNA molecules is set in advance, and the correspondence between the synthesized DNA molecules and DNA sequences is stored in the library of DNA molecules. According to the splicing in S104The obtained sequence block can be searched for the DNA sequence corresponding to the sequence block in a predetermined DNA molecule library, and a small number of molecules (e.g., 1 molecule, 10 molecules) in the synthesized DNA molecules searched for in the DNA molecule library can be retrieved²Molecule, 10³Molecule, 10⁵Molecules, etc.) and amplifying the molecules, can ensure the repeated calling of DNA molecules in a DNA molecule library synthesized in vitro once, thereby greatly reducing the number of DNA molecules used in data storage once and effectively reducing the DNA synthesis cost for data storage.

The library of DNA molecules synthesized in advance may include a library of DNA molecules comprising core sequences of different base lengths. For example, the length of the core sequence may include an arbitrary sequence of 2 bases in length, an arbitrary sequence of 3 bases in length, an arbitrary sequence of 4 bases in length, an arbitrary sequence of 5 bases in length, an arbitrary sequence of 6 bases in length, an arbitrary sequence of 7 bases in length, an arbitrary sequence of 8 bases in length, and the like. The number of arbitrary sequences of 2 bases in length was 4 × 4 — 16, the number of arbitrary sequences of 3 bases in length was 4 × 4 — 64, and the rest were calculated by analogy.

And splicing the set core sequence and the joint sequence according to the preset joint sequence. For example, the left linker sequence and the right linker sequence are spliced to the left and right sides of the core sequence, respectively. A DNA molecule library is obtained by synthesizing a large number of DNA molecules with linker sequences. When the DNA molecule library is required to be used, a certain amount of DNA molecules corresponding to the sequence blocks are only required to be taken from the pre-synthesized DNA molecule library, data storage is carried out according to the selected DNA molecules in the DNA molecule library, and the rest DNA molecules in the taken DNA molecule library can be continuously used, so that the data storage cost of the DNA medium is further reduced.

When the DNA molecule is amplified, the primer amplification mode adopted can comprise isothermal amplification, PCR (polymerase chain reaction) amplification and other amplification modes.

In S106, the amplified DNA product sequence and the corresponding key information are stored, wherein the key information includes one or more core sequences, or the key information includes one or more partial bases in the core sequences and the position information of the included core sequences in the base sequences.

The key information comprises more than one core sequence, or more than one partial base in the core sequence and corresponding position information of the core sequence in the base sequence, so that the accurate position of the core sequence can be quickly determined.

In a possible implementation, the key information includes a core sequence at the start position and a core sequence at the end position. Thus, the splicing can be performed based on the core sequence at the start position and the core sequence at the end position, and based on the core sequence at the middle position of the base pair in the overlapping region. In a possible implementation, a number of intermediate core sequences may also be included. When the number of bases in the overlapping region is smaller, the number of pieces of positional information of the core sequence can be increased.

When storing the key information, the number of the position information in the base sequence may be converted into a base unit, and the base unit to which the position information is converted and the core sequence may be stored in a DNA form in a predetermined combination. For example, the combinations can be made according to the combination rules shown in FIG. 6, in which the base unit AAAA at position number 1, the base unit ACAA at position number 18, and the base unit CTGA at position sequence 100. The correspondence between the position numbers and the base units can be determined by a predetermined mapping relationship.

When the key information is stored, the positional information and the base unit sequence in the key information may be stored in a computer-readable storage medium. Or, the key information is stored in a mixed manner in a DNA form and a computer-readable storage medium form.

When the key information and the DNA product are stored separately, the safety and the confidentiality of data storage are further improved.

When storing the amplified DNA product, it may be stored in a lyophilized form, or may be stored in a liquid form. The temperature for storage may be-20 degrees or-80 degrees, etc. Alternatively, the amplified DNA product may be stored in a centrifuge tube or a freezing tube, or may be stored by means of wax droplets.

In the case where the overlapping region is 4 bases and the core sequence is 8 bases, the key information can be determined by calculation simulation. For example, the position information of the core sequence recorded by the key information includes the core sequence of the start position, the core sequence of the end position, and the core sequence of the center position. For example, the base sequence determined in S102 is:

“CGTTGCAAGATGCGTCGCCGGACGCGTAGCGAGCCTCGTCGTGGGATACGTTGATGGGCACGTTGGCTGGACCGGTGTGCGTTTCGTTGGATGTATCGTGGAGAGTAACGTAGCGAGAAACGTAGCGCGCTGCGGAGAGTGCGGCGTGGTCAGGCCCGTTGTACGAAT”。

the key information corresponding to the base sequence may include 4 bases CGTT starting from the first position of the base sequence, 4 bases TTCG starting from the 83 th position, and 4 bases GAAT starting from the 165 th position, recorded as "1 ═ CGTT; 83 ═ TTCG; 168-GAAT "

In a possible implementation manner, the key information may further include one or both of the adaptor sequence and/or the index sequence corresponding to the sub-base sequence obtained by splitting the base sequence.

Fig. 7 is a schematic flow chart of an implementation of a method for decoding stored data of a DNA medium according to an embodiment of the present application, where the method includes:

in S701, a DNA sequence to be decoded and key information thereof are acquired.

The sequencing reading mode includes any mode capable of reading the DNA product, such as second generation sequencing, third generation sequencing and the like, and obtaining the DNA sequence to be decoded corresponding to the DNA product.

The DNA sequence to be decoded and the key information thereof are the amplified DNA sequence and the key information obtained by the data storage method shown in fig. 1.

Wherein, the key information may include a core sequence of the start position and/or a core sequence of the end position. By determining the core sequence at the start position and/or the core sequence at the end position, a fast combination of core sequences can be performed depending on the determined core sequences.

In a possible implementation manner, the key information may further include one or two of an adaptor sequence and an index sequence corresponding to a sub-base sequence obtained by splitting the base sequence. The linker information in the DNA sequence can be segmented through the linker sequence to obtain the core sequence included in the DNA sequence, and the direction of the core sequence can be distinguished through the linker information, so that the core sequence can be accurately combined.

The order of the combined base subsequences can be easily determined by the index sequence, so that an accurate base sequence can be obtained according to the determined order.

In S702, a core sequence included in the DNA sequence is extracted according to a preset linker sequence.

The preset connector sequence can be preset and can be directly called for use during decoding. Alternatively, different linker sequences may be selected for different data storage means and stored in the key information by means of DNA. The set linker information can be extracted by analyzing the key information at the time of decoding, thereby improving the safety of DNA data storage.

The linker sequence may include a left linker sequence and/or a right linker sequence, and one or two linker sequences may be used to indicate the direction of the core sequence, so as to facilitate decoding to obtain the core sequence with the correct direction.

For example, the core sequence shown in FIG. 4 can be obtained by cutting 26 sequence blocks corresponding to the DNA sequence shown in FIG. 5 based on a pre-determined pre-adapter sequence "CGCCAGGGTTTTCCCAGTCACGAC" and a pre-determined post-adapter sequence "TCCTGTGTGAAATTGTTATCCGCT".

In S703, the core sequences are combined to generate a base sequence based on the key information in conjunction with the overlapping region between the core sequences.

The decryption process using the key information in the embodiment of the application corresponds to the encoding and storing process using the key information.

For example, in accordance with the data storage method shown in FIG. 1, the initial position of a core sequence to be combined is accurately determined from the position information of one or more core sequences included in the key information, and the positions of other core sequences in the base sequence are determined by combining the other core sequences based on the initial position.

And combining the core sequences according to the overlapping regions between the core sequences, including the overlapping regions between adjacent sequences, or the overlapping regions adjacent to odd numbers and the overlapping regions adjacent to even numbers to obtain the spliced base sequences.

For example, in the schematic diagram of the core sequence shown in FIG. 4, the nucleotide sequence corresponding to the DNA sequence to be decrypted is "CGTTGCAAGATGCGTCGCCGGACGCGTAGCGAGCCTCGTCGTGGGATACGTTGATGGGCACGTTGGCTGGACCGGTGTGCGTTTCGTTGGATGTATCGTGGAGAGTAACGTAGCGAGAAACGTAGCGCGCTGCGGAGAGTGCGGCGTGGTCAGGCCCGTTGTACGAAT" by combining the predetermined key information with the overlapping region between the core sequences.

The core sequence in the key information and the position information of the included core sequence may include a core sequence at a first position and a core sequence at an end position. In a possible implementation scenario, if the types of DNA molecules in the DNA sequence corresponding to the DNA product are less, the base sequence can be obtained by directly splicing the DNA sequences according to the overlapping region.

In S704, the base sequence is converted into a binary sequence according to a predetermined mapping relationship.

The base sequence may be converted into a binary sequence by a predetermined mapping relationship with the data stored in S102, for example, the base sequence shown in S703 may be converted into a binary sequence as follows: "111001011011000010000110111001111011111010001110111001001011100010111101111001111001101010000100111001011000011010101100111001011010110110100011111010011001101110010101111001011010000110010001111001101000100010010000111001001011100010000000111001001011101110110110111010001000100110111010111001101001110010101111111001011001001110000001".

In S705, a data file is generated from the converted binary data.

According to the generated binary data, the binary data file can be converted into a corresponding data file, such as a picture, a text, a program, audio, video and the like, by combining a preset coding rule.

It should be understood that the implementation of the steps in the above embodiments may be implemented by human computing or by a computer program. The sequence number of each step does not mean the execution sequence, and the execution sequence of each process should be determined by the function and the internal logic of the process, and should not constitute any limitation to the implementation process of the embodiment of the present application.

In summary, the method for storing data by using DNA in the embodiment of the present application has the following functions:

1. when single data storage is carried out, a predetermined number of small amount of synthesized DNA molecules can be obtained from a DNA molecule library synthesized in vitro in advance, so that the number of the synthesized DNA molecules used in one data storage can be greatly reduced, multiple times of calling of the DNA molecules synthesized in vitro in one time can be ensured, and the DNA synthesis cost for the data storage is effectively reduced.

2. The pre-synthesized DNA molecule library can be called repeatedly, so that the condition that at least one half of base number of corresponding data needs to be synthesized aiming at different binary data is avoided, the total base synthesis number is saved, and the synthesis cost of stored data is further reduced.

3. Because the data stored in the DNA product are read through the key information, the safety of data storage can be effectively improved.

Fig. 8 is a device for data storage using DNA according to an embodiment of the present application, the device including:

a binary sequence extraction unit 801, configured to extract a binary sequence corresponding to data to be stored;

a first sequence conversion unit 802, configured to convert the binary sequence into a base sequence according to a preset mapping relationship;

a base dividing unit 803, configured to divide the base sequence into a plurality of core sequences with a preset first length, and an overlap region with a preset second length is included between two related core sequences;

a sequence splicing unit 804, configured to splice the core sequence and a linker sequence used for indicating a sequence direction to obtain a sequence block;

a DNA molecule extracting unit 805, configured to search a DNA sequence matching the sequence block in a pre-synthesized DNA molecule library, and amplify a predetermined number of synthesized DNA molecules corresponding to the extracted DNA sequence to obtain a DNA product;

a DNA storage unit 806, configured to store the amplified DNA product sequence and corresponding key information, where the key information includes one or more core sequences, or the key information includes one or more partial bases in the core sequences and position information of the included core sequences in the base sequences.

The apparatus for data storage using DNA corresponds to the method for data storage using DNA shown in fig. 1.

Fig. 9 is a schematic diagram of an apparatus for decoding stored data of a DNA medium according to an embodiment of the present application, the apparatus including:

a DNA sequence obtaining unit 901, configured to obtain a DNA sequence to be decoded and key information thereof;

a core sequence extraction unit 902, configured to extract a core sequence included in the DNA sequence according to a preset adaptor sequence;

a sequence combining unit 903 for combining the core sequences to generate a base sequence by combining overlapping regions between the core sequences based on the key information;

a second sequence conversion unit 904, configured to convert the base sequence into a binary sequence according to a preset mapping relationship;

a data file generating unit 905 for generating a data file from the converted binary data.

The apparatus for decoding the stored data of the DNA medium shown in fig. 9 corresponds to the method for decoding the stored data of the DNA medium shown in fig. 7.

Fig. 10 is a schematic diagram of a storage device according to an embodiment of the present application. As shown in fig. 10, the storage device 10 of this embodiment includes: a processor 100, a memory 101 and a computer program 102 stored in said memory 101 and executable on said processor 100, for example a data storage or decoding program using DNA. The processor 100, when executing the computer program 102, implements the steps of the above-described embodiments of the method for storing or decoding data using DNA. Alternatively, the processor 100 implements the functions of the modules/units in the above device embodiments when executing the computer program 102.

Illustratively, the computer program 102 may be partitioned into one or more modules/units that are stored in the memory 101 and executed by the processor 100 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 102 in the storage device 10.

The storage device may include, but is not limited to, a processor 100, a memory 101. Those skilled in the art will appreciate that fig. 10 is merely an example of a storage device 10 and is not intended to limit the storage device 10 and may include more or fewer components than those shown, or some components may be combined, or different components, for example, the storage device may also include input output devices, network access devices, buses, etc.

The Processor 100 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 101 may be an internal storage unit of the storage device 10, such as a hard disk or a memory of the storage device 10. The memory 101 may also be an external storage device of the storage device 10, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the storage device 10. Further, the memory 101 may also include both an internal storage unit and an external storage device of the storage device 10. The memory 101 is used for storing the computer program and other programs and data required by the storage device. The memory 101 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Sequence listing

<110> Shenzhen advanced technology research institute

<120> method, apparatus and storage device for storing data using DNA

<140>202110688074.9

<160> 6

<210> 1

<211> 168

<212> DNA

<213> Artificial Sequence (Artificial Sequence)

<400> 1

cgttgcaaga tgcgtcgccg gacgcgtagc gagcctcgtc gtgggatacg ttgatgggca 60

cgttggctgg accggtgtgc gtttcgttgg atgtatcgtg gagagtaacg tagcgagaaa 120

cgtagcgcgc tgcggagagt gcggcgtggt caggcccgtt gtacgaat 168

<210> 2

<211> 24

<212> DNA

<213> Artificial Sequence (Artificial Sequence)

<400> 2

cgccagggtt ttcccagtca cgac 24

<210> 3

<211> 24

<212> DNA

<213> Artificial Sequence (Artificial Sequence)

<400> 3

tcctgtgtga aattgttatc cgct 24

<210> 4

<211> 24

<212> DNA

<213> Artificial Sequence (Artificial Sequence)

<400> 4

Atcggtgcgt acgttacgtg gcag 24

<210> 5

<211> 328

<212> DNA

<213> Artificial Sequence (Artificial Sequence)

<400> 5

cgttgcaagc aagatggatg cgtccgtcgc cggccggacg gacgcgtacg tagcgagcga 60

gcctgcctcg tccgtcgtgg gtgggataga tacgttcgtt gatggatggg caggcacgtt 120

cgttggctgg ctggacggac cggtcggtgt gcgtgcgttt gtttcgttcg ttggatggat 180

gtatgtatcg tgcgtggaga gagagtaagt aacgtacgta gcgagcgaga aagaaacgta 240

cgtagcgcgc gcgctggctg cggacggaga gtgagtgcgg gcggcgtgcg tggtcagtca 300

ggccggcccg ttcgttgtac gtacgaat 328

<210> 6

<211> 2296

<212> DNA

<213> Artificial Sequence (Artificial Sequence)

<400> 6

cgccagggtt ttcccagtca cgaccgttgc aatcctgtgt gaaattgtta tccgct 56

cgccagggtt ttcccagtca cgacgcaaga tgtcctgtgt gaaattgtta tccgct 112

cgccagggtt ttcccagtca cgacgatgcg tctcctgtgt gaaattgtta tccgct 168

cgccagggtt ttcccagtca cgaccgtcgc cgtcctgtgt gaaattgtta tccgct 224

cgccagggtt ttcccagtca cgacgccgga cgtcctgtgt gaaattgtta tccgct 280

cgccagggtt ttcccagtca cgacgacgcg tatcctgtgt gaaattgtta tccgct 336

cgccagggtt ttcccagtca cgaccgtagc gatcctgtgt gaaattgtta tccgct 392

cgccagggtt ttcccagtca cgacgcgagc cttcctgtgt gaaattgtta tccgct 448

cgccagggtt ttcccagtca cgacgcctcg tctcctgtgt gaaattgtta tccgct 504

cgccagggtt ttcccagtca cgaccgtcgt ggtcctgtgt gaaattgtta tccgct 560

cgccagggtt ttcccagtca cgacgtggga tatcctgtgt gaaattgtta tccgct 616

cgccagggtt ttcccagtca cgacgatacg tttcctgtgt gaaattgtta tccgct 672

cgccagggtt ttcccagtca cgaccgttga tgtcctgtgt gaaattgtta tccgct 728

cgccagggtt ttcccagtca cgacgatggg catcctgtgt gaaattgtta tccgct 784

cgccagggtt ttcccagtca cgacggcacg tttcctgtgt gaaattgtta tccgct 840

cgccagggtt ttcccagtca cgaccgttgg cttcctgtgt gaaattgtta tccgct 896

cgccagggtt ttcccagtca cgacggctgg actcctgtgt gaaattgtta tccgct 952

cgccagggtt ttcccagtca cgacggaccg gttcctgtgt gaaattgtta tccgct 1008

cgccagggtt ttcccagtca cgaccggtgt gctcctgtgt gaaattgtta tccgct 1064

cgccagggtt ttcccagtca cgacgtgcgt tttcctgtgt gaaattgtta tccgct 1120

cgccagggtt ttcccagtca cgacgtttcg tttcctgtgt gaaattgtta tccgct 1176

cgccagggtt ttcccagtca cgaccgttgg attcctgtgt gaaattgtta tccgct 1232

cgccagggtt ttcccagtca cgacggatgt attcctgtgt gaaattgtta tccgct 1288

cgccagggtt ttcccagtca cgacgtatcg tgtcctgtgt gaaattgtta tccgct 1344

cgccagggtt ttcccagtca cgaccgtgga gatcctgtgt gaaattgtta tccgct 1400

cgccagggtt ttcccagtca cgacgagagt aatcctgtgt gaaattgtta tccgct 1456

cgccagggtt ttcccagtca cgacgtaacg tatcctgtgt gaaattgtta tccgct 1512

cgccagggtt ttcccagtca cgaccgtagc gatcctgtgt gaaattgtta tccgct 1568

cgccagggtt ttcccagtca cgacgcgaga aatcctgtgt gaaattgtta tccgct 1624

cgccagggtt ttcccagtca cgacgaaacg tatcctgtgt gaaattgtta tccgct 1680

cgccagggtt ttcccagtca cgaccgtagc gctcctgtgt gaaattgtta tccgct 1736

cgccagggtt ttcccagtca cgacgcgcgc tgtcctgtgt gaaattgtta tccgct 1792

cgccagggtt ttcccagtca cgacgctgcg gatcctgtgt gaaattgtta tccgct 1848

cgccagggtt ttcccagtca cgaccggaga gttcctgtgt gaaattgtta tccgct 1904

cgccagggtt ttcccagtca cgacgagtgc ggtcctgtgt gaaattgtta tccgct 1960

cgccagggtt ttcccagtca cgacgcggcg tgtcctgtgt gaaattgtta tccgct 2016

cgccagggtt ttcccagtca cgaccgtggt catcctgtgt gaaattgtta tccgct 2072

cgccagggtt ttcccagtca cgacgtcagg cctcctgtgt gaaattgtta tccgct 2128

cgccagggtt ttcccagtca cgacggcccg tttcctgtgt gaaattgtta tccgct 2184

cgccagggtt ttcccagtca cgaccgttgt actcctgtgt gaaattgtta tccgct 2240

cgccagggtt ttcccagtca cgacgtacga attcctgtgt gaaattgtta tccgct 2296

Claims

1. A method for data storage using DNA, the method comprising:

extracting a binary sequence corresponding to data to be stored;

2. The method of claim 1, wherein the linker sequence comprises one or both of a left linker sequence and a right linker sequence, and wherein the left linker sequence is different from the right linker sequence when the linker sequence comprises a left linker sequence and a right linker sequence.

3. The method of claim 1, wherein converting the binary sequence to a base sequence comprises:

4. The method of claim 1, wherein the overlapping region with the preset second length between two associated core sequences comprises:

5. The method of claim 1, wherein the step of dividing the base sequence into a plurality of core sequences of a predetermined first length comprises:

6. The method of claim 1, wherein storing the amplified DNA product and corresponding key information comprises:

7. The method according to any of claims 1-6, wherein the key information comprises a core sequence at a start position and/or a core sequence at an end position.

8. The method according to claim 7, wherein the key information further includes one or both of the adaptor sequence and an index sequence corresponding to a sub-base sequence obtained by splitting the base sequence.

9. An apparatus for data storage using DNA, the apparatus comprising:

a DNA storage unit for storing the amplified DNA product and corresponding key information, wherein the key information comprises more than one core sequence, or the key information comprises partial bases in more than one core sequence and position information of the included core sequence in the base sequence.

10. A method for decoding stored data from a DNA media, the method comprising:

obtaining a DNA sequence to be decoded and key information thereof;

a data file is generated from the converted binary data.

11. The method of claim 10, wherein extracting a core sequence included in the DNA sequence based on a predetermined linker sequence comprises:

12. The method according to claim 10, wherein combining the core sequences to generate a base sequence based on the key information in combination with an overlapping region between the core sequences comprises:

13. The method according to any one of claims 10 to 12, wherein extracting a core sequence included in the DNA sequence based on a predetermined linker sequence comprises:

14. An apparatus for decoding of DNA media stored data, the apparatus comprising:

15. A method of generating a library of DNA molecules, the method comprising:

16. A storage device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method for storing data with DNA according to any one of claims 1 to 8 when executing the computer program, or implements the method for decoding the stored data of the DNA medium according to any one of claims 10 to 13 when executing the computer program, or implements the method for generating the DNA molecule library according to claim 15.