CN113782102B - Method, device and equipment for storing DNA data and readable storage medium - Google Patents

Method, device and equipment for storing DNA data and readable storage medium Download PDF

Info

Publication number
CN113782102B
CN113782102B CN202110929436.9A CN202110929436A CN113782102B CN 113782102 B CN113782102 B CN 113782102B CN 202110929436 A CN202110929436 A CN 202110929436A CN 113782102 B CN113782102 B CN 113782102B
Authority
CN
China
Prior art keywords
sequence
dna
units
base
fragments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110929436.9A
Other languages
Chinese (zh)
Other versions
CN113782102A (en
Inventor
戴俊彪
黄小罗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Carbon Yuan Shenzhen Biotechnology Co ltd
Original Assignee
Zhongke Carbon Yuan Shenzhen Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Carbon Yuan Shenzhen Biotechnology Co ltd filed Critical Zhongke Carbon Yuan Shenzhen Biotechnology Co ltd
Priority to CN202110929436.9A priority Critical patent/CN113782102B/en
Publication of CN113782102A publication Critical patent/CN113782102A/en
Application granted granted Critical
Publication of CN113782102B publication Critical patent/CN113782102B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Abstract

The application relates to the technical field of data storage, and provides a method, a device and equipment for storing DNA data and the field of readable storage media. The method comprises the following steps: acquiring a base sequence corresponding to a binary sequence of target data; dividing the base sequence to obtain S sequence units, wherein each sequence unit comprises a plurality of divided sequence fragments, the S sequence units contain K sequence fragments in total, and the length of each sequence fragment is n; labeling the K sequence fragments and the S sequence units by using preset index information to obtain the K labeled sequence fragments and the S labeled sequence units, wherein the index information comprises a first retrieval sequence for expressing the arrangement sequence of the S sequence units in the base sequence and a second retrieval sequence for expressing the arrangement sequence of a plurality of sequence fragments belonging to the same sequence unit in the sequence units. The method provided by the application can realize the storage of large-scale data information in DNA.

Description

Method, device and equipment for storing DNA data and readable storage medium
Technical Field
The application belongs to the technical field of data storage, and particularly relates to a method, a device and equipment for storing DNA data and a readable storage medium.
Background
The development of the artificial intelligence and big data era has higher and higher requirements on data storage, and new storage media with high storage density, long storage time and low maintenance cost are urgently needed. DeoxyriboNucleic Acid (DNA) is considered as one of the most potential media for future information storage as an information storage medium developed in recent years.
The DNA molecule has four bases, which are: adenine (Adenine, a), cytosine (C), guanine (Guanine, G) and Thymine (thynine, T). The DNA-based data storage technique is a data series composed of binary "0" and "1" represented by the above four base sequences. Compared with the traditional storage medium, the DNA data storage has the characteristics of high storage density, long storage time, low maintenance cost and good biocompatibility. Such as: 1g of DNA can store more than millions of high-definition movies, and the data storage density of the DNA is more than 7 orders of magnitude of that of silicon-based storage media such as the conventional hard disk; meanwhile, the DNA can stably store data for more than thousand years, which is more than one hundred times of the storage time of the existing storage medium. In addition, the maintenance cost of DNA is low, and the maintenance cost of hundreds of years is only one ten-thousandth of that of the current medium.
The DNA data storage process generally comprises the following steps: (1) Extracting binary information from computer information such as pictures, videos and texts; (2) Converting binary sequence information into an A/T/C/G sequence (namely a DNA sequence) which is formed by coding the bases A, T, C and G and stores data information according to the preset corresponding relation between the binary and the bases A, T, C and G; (3) The encoded A/T/C/G sequence is converted to a DNA chemical polymer molecule using DNA synthesis techniques or other techniques and stored in a suitable environment. Thereafter, when the stored data needs to be acquired, the following steps may be performed: (4) Reading the stored DNA chemical polymer molecules into A/T/C/G sequences by using a DNA sequencing technology; (5) Converting the A/T/C/G sequence into binary information by using a proper decoding mode; (6) The binary information is converted into computer information such as pictures, videos, texts and the like.
Among them, the data encoding problem is a core problem in the current DNA data storage method.
Disclosure of Invention
One of the purposes of the embodiment of the application is as follows: a method, a device, equipment and a readable storage medium for storing DNA data are provided, which aim to solve the problem of data coding in the DNA data storage technology.
The technical scheme adopted by the embodiment of the application is as follows:
in a first aspect, a method for storing DNA data is provided, which includes:
acquiring a base sequence corresponding to a binary sequence of target data;
dividing the base sequence to obtain S sequence units, wherein each sequence unit comprises a plurality of divided sequence fragments, the S sequence units contain K sequence fragments in total, the length of each sequence fragment is n, and n, S and K are integers greater than or equal to 2;
labeling the K sequence segments and the S sequence units by using preset index information to obtain K labeled sequence segments and S labeled sequence units, wherein the index information comprises a first retrieval sequence used for representing the arrangement sequence of the S sequence units in the base sequence and a second retrieval sequence used for representing the arrangement sequence of a plurality of sequence segments belonging to the same sequence unit in the sequence unit, and the K labeled sequence segments are used for synthesizing K first DNA molecules storing the target data.
In one embodiment, the means for tagging a plurality of said sequence segments belonging to the same said sequence unit with said second search sequence comprises:
splicing a second search sequence on either side of the sequence fragment, or
And simultaneously splicing retrieval base groups on two sides of the sequence fragment, wherein the retrieval base groups on the two sides form the second retrieval sequence.
In one embodiment, the first search sequence comprises i DNA sequence segments, i is an integer greater than or equal to 1, and each of the DNA sequence segments comprises a first base sequence used as an index marker and a second base sequence used to indicate the number of the sequence units.
In one embodiment, the DNA sequence segments corresponding to the first search sequence and the second search sequence are obtained using DNA synthesis techniques. Exemplary DNA synthesis techniques include, but are not limited to, enzymatic synthesis, phosphoramidite synthesis, and the like.
In one embodiment, the DNA sequence segments corresponding to the first search sequence and the second search sequence can be obtained by amplification from a pre-synthesized DNA universal molecule library, such as PCR technology.
In one embodiment, the storage method further comprises:
synthesizing K marking sequence fragments into K first DNA molecules storing the target data, and then storing the K first DNA molecules in S first physical spaces, wherein the first DNA molecules corresponding to the marking sequence fragments belonging to one sequence unit are stored in the same first physical space, and the first DNA molecules corresponding to the marking sequence fragments not belonging to the same sequence unit are stored in different first physical spaces.
In one embodiment, s of said first physical spaces are integrated in one DNA hard disk.
In one embodiment, the storage method further comprises:
and storing a second DNA molecule in a second physical space corresponding to the first physical space, wherein the second DNA molecule stores the index information.
In one embodiment, a method of decoding K of said first DNA molecules comprises:
sequencing a plurality of the first DNA molecules stored in each of the first physical spaces to obtain a plurality of the marker sequence fragments; splicing the sequence fragments corresponding to each of the labeled sequence fragments belonging to the same labeled sequence unit according to the second retrieval sequence to obtain the sequence unit;
splicing the obtained S sequence units according to the first retrieval sequence to obtain the base sequence;
converting the base sequence into the target data.
In a second aspect, there is provided a DNA data storage device comprising a data processing module,
the data processing module is used for acquiring a base sequence corresponding to the binary sequence of the target data; dividing the base sequence to obtain K sequence segments with the length of n, wherein the K sequence segments are divided into S sequence units, and S and K are integers which are more than or equal to 2; labeling the K sequence fragments and the S sequence units by using preset index information to obtain K labeled sequence fragments and S labeled sequence units, wherein the index information comprises a first retrieval sequence for representing the arrangement sequence of the S sequence units in the base sequence and a second retrieval sequence for representing the arrangement sequence of a plurality of sequence fragments belonging to the same sequence unit in the sequence units, and the K labeled sequence fragments are used for synthesizing K first DNA molecules storing the target data.
In one embodiment, the apparatus further comprises: and the DNA synthesis module is used for synthesizing the K marked sequence fragments into K first DNA molecules in which the target data is stored.
In one embodiment, the apparatus further comprises: and the DNA molecule storage module is used for storing the K first DNA molecules in S first physical spaces, wherein the first DNA molecules corresponding to the marker sequence fragments belonging to one sequence unit are stored in the same first physical space, and the first DNA molecules corresponding to the marker sequence fragments not belonging to the same sequence unit are stored in different first physical spaces.
In one embodiment, the DNA molecule storage module is further configured to store a second DNA molecule in a second physical space.
In one embodiment, the DNA sequencing module is further configured to sequence a plurality of first DNA molecules stored in each of the first physical spaces to obtain a plurality of tag sequence fragments; the data processing module is further configured to splice the sequence segments corresponding to each of the labeled sequence segments belonging to the same labeled sequence unit according to the second search sequence to obtain the sequence unit;
splicing the obtained S sequence units according to the first retrieval sequence to obtain the base sequence;
converting the base sequence into the target data.
In a third aspect, there is provided a DNA data storage device comprising a terminal device, the terminal device comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor implements the method for storing DNA data according to the first aspect when executing the computer program.
In a fourth aspect, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of storing DNA data as in the first aspect.
In a fifth aspect, a DNA hard disk is provided, which comprises a plurality of physical spaces, wherein the physical spaces are made of physical materials, and each physical space is used for storing DNA molecules.
In one embodiment, the physical material is SiO 2 At least one of metal oxide and polymer material. These physical materials form a physical space that encapsulates the DNA molecules, while at the same time, isolating the different DNA molecules.
The method, the device, the equipment and the readable storage medium for storing the DNA data have the advantages that: the base sequence corresponding to the binary sequence of the target data is divided into S sequences, each sequence unit comprises a plurality of divided sequence segments, the S sequence units contain K sequence segments, the length of each sequence segment is n, the S and the K are integers greater than or equal to 2, the position information of the sequence segments and the position information of the sequence units are marked by utilizing preset index information, and the marked sequence segments are synthesized into DNA molecules and then are stored respectively. By the method, the storage capacity of the data information can be improved, and large-scale storage of the data information in DNA can be realized.
In one implementation, the length of the first search sequence and the length of the marker sequence segment (the sequence segment with the second search sequence) are different, and the first search sequence and the marker sequence segment are distinguished from the sequence unit by length differentiation. Specifically, m represents the number of bases of the tag sequence fragment, q represents the number of bases of the second search sequence, i represents the number of DNA sequence fragments in the first search sequence, and p represents the number of bases of the second base sequence in the first search sequence, and by the method provided by the present application, the storage of DNA data containing D numbers of bases can be realized, wherein the calculation formula of D is as follows:
D=4 q ×(m-q)×4 i×p
in a specific example, when m is 8 bases in length, q =4,i =10,p =4,d =256 × 4 40 =1.23×10 27 . When 1 base stores 2bits of information, the amount of information that can be stored is L =2.46 × 10 27 bits=3.075×10 26 bytes=3.075×10 5 ZB, much larger than the current data storage size.
In another embodiment, the length of the first search sequence is the same as the length of the marker sequence segment (the segment of the sequence with the second search sequence); the first nucleotide sequence used as an index marker in the first search sequence may be a part of the second search sequence, and the number of bases in the first nucleotide sequence may be the same as the number of bases in the second search sequence. In this case, the number of bases of the tag sequence fragment is represented by m, the number of bases of the second search sequence is represented by q, the number of DNA sequence fragments in the first search sequence is represented by i, and the number of bases of the second base sequence in the first search sequence is represented by p, and by the method of the method provided by the present application, it is possible to realize the storage of DNA data having D numbers of bases, wherein D is calculated as follows:
D=(4 q- i)×(m-q)×4 i×p
in a specific example, when m is 8 bases in length, q =4,i =10,p =4,d = (256-10) × 4 40 =1.18×10 27 . When 2bits information is stored in 1 base, the amount of information that can be stored is L =2.36 × 10 27 bits=2.95×10 26 bytes=2.95×10 5 ZB, much larger than the current data storage size.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required for the embodiments or exemplary technical descriptions will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts.
FIG. 1 is a schematic diagram showing the composition of a DNA storage apparatus according to an embodiment of the present application;
FIG. 2 is a process flow diagram of the storage writing of DNA data provided by the embodiments of the present application;
FIG. 3 is a schematic diagram of a sequence unit formed by dividing the base sequence in S101 to include a plurality of sequence fragments, provided in the examples of the present application;
fig. 4 is a schematic diagram of the information sequence units respectively including a plurality of information sequence fragments obtained after the sequence unit in S103 passes through the first search sequence tag and each sequence fragment in the sequence unit passes through the second search sequence tag according to the embodiment of the present application;
FIG. 5 is a schematic diagram of DNA storage after K first DNA molecules are stored in S different first physical spaces provided by an embodiment of the present application;
FIG. 6 is a flowchart of a process for reading DNA data provided in the examples of the present application;
fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the present application.
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It should be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
It should also be appreciated that reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless otherwise specifically stated.
At present, the problem of data coding is a core technical problem in the DNA data storage technology, and especially for large-scale data, an effective data coding method is not available to realize large-scale data storage. In view of the above, the present application provides a method for storing DNA data, which comprises converting a binary sequence of data to be stored into a corresponding base sequence, and then dividing the base sequence into S sequence units by two divisions, wherein each sequence unit comprises a plurality of divided sequence segments, wherein S sequence units contain K sequence segments in total, during the data encoding process. Then, the K sequence segments and the S sequence units are marked by using preset index information. And finally synthesizing the K marked sequence fragments into DNA molecules and storing. Based on the data coding mode provided by the application, the DNA data storage can be effectively realized.
Furthermore, the data coding mode provided by the application has obvious advantages in data storage capacity.
The storage method is implemented by the DNA data storage device shown in fig. 1. The DNA data storage device comprises a data processing module, a DNA molecule synthesizing module, a DNA molecule storage module and a DNA molecule sequencing module.
The data processing module is used for realizing data coding and decoding. For example, data to be stored is converted into binary information, and the binary information is converted into a base sequence according to a preset correspondence between the binary data and the base. And then coding the base sequence according to preset index information to obtain the base sequence finally used for generating the DNA molecule.
The DNA molecule synthesizing module is used for synthesizing DNA molecules according to the coded base sequences. The DNA molecule storage module can store DNA molecules. The DNA molecule sequencing module is used for translating DNA molecules into base sequences. Correspondingly, the data processing module can also decode the base sequence obtained by sequencing in the DNA molecule according to the index information, and obtain the data stored in the DNA molecule through data conversion.
In the embodiment of the present application, the DNA data storage apparatus may be a complete DNA data storage device, i.e. a device integrated by a plurality of functional modules. The device can realize the complete flow of data encoding, DNA molecule synthesis, DNA molecule storage, DNA molecule sequencing and data decoding.
In another embodiment, the DNA data storage device may be a system composed of individual devices.
For example, the data processing module may be a computer, a server, a robot, or other computer device. The method is used for realizing data coding and decoding. The DNA molecule synthesizing module can be used for synthesizing DNA molecules according to the coded base sequence by a DNA molecule synthesizer. The DNA molecule storage module can be a DNA hard disk and can store DNA molecules. The DNA molecule sequencing module can be a DNA molecule sequencer and can realize the DNA molecule sequencing function.
Fig. 2 is a schematic view of an implementation flow of a DNA data storage method provided in an embodiment of the present application, which specifically includes:
s101, acquiring a base sequence corresponding to a binary sequence of data to be stored.
In this step, acquiring the base sequence corresponding to the binary sequence of the data to be stored means converting the binary sequence of the data to be stored into a base sequence formed by encoding a, T, C, and G and storing data information.
In some embodiments, obtaining the base sequence corresponding to the binary sequence of the data to be stored comprises:
and S111, extracting a binary sequence corresponding to the data to be stored.
The data to be stored is any data information that can exist in the terminal device, and may include information such as text, pictures, sound, video, software, programs, and the like, but is not limited thereto.
When extracting the binary sequence corresponding to the data to be stored, the encoding information corresponding to the data to be stored can be obtained, and the corresponding encoding information is converted into the binary encoding information, so that the corresponding binary sequence is obtained. For example, the characters in the text message may be converted into corresponding ASCII (American Standard Code for Information exchange, chinese) codes and UNICODE (Universal Character Set, chinese) codes, and then the codes may be converted into binary sequences.
Illustratively, the text "spring" is extracted and is no longer the butterfly that is out of imagination. "the corresponding binary sequence is" 11100110 10011000 10100101 11101111 10111100 10001100 11100101 10110110111 10110010 11100100 10111000 10001101 11100101 10000110 10001101 11100110 10011000 10101111 11100110 10000011 10110001000 10110001 10100001 11100100 10111000 10111001 10001011 11100101 10100100 10010110 11100111 10011010 10000100 11101001 10000010 10111100101 11100101 100001 1011111 10101010 11101000 10011101 10110111100 10111101000 10111100 10111101000 10111101 10111110111110111110110 10111100011 10000000 10000010".
S121, converting the binary sequence into a base sequence according to a preset mapping rule.
In the embodiment of the present application, the preset mapping rule refers to a preset mapping rule between a binary system and a base. And converting the binary sequence consisting of 0/1 into a base sequence which is formed by coding the bases A, T, C and G and stores data information according to a preset mapping rule between the binary code and the bases A, T, C and G. Illustratively, the predetermined correspondence between the binary sequence and the bases a, T, C, G is: one base A represents one 00, one base T represents one 01, one base C represents one 10, and one base G represents one 11. When the secondary system sequence is 00110110101100101011011000011001001, converting binary data information into a DNA sequence with the base sequence AGTCCGACCGTCGAAGAACT according to a preset mapping rule between binary system and bases A, T, C and G. Of course, the preset correspondence between the binary code and the bases a, T, C, G is not limited to the above example, and may be, for example: one base T represents one 00, one base A represents one 01, one base G represents one 10, and one base C represents one 11, but is not limited thereto. It should be understood that the preset mapping rules between the binary sequence and the bases A, T, C, G are not limited to the above examples, as long as the binary sequence can be converted into the base sequence according to the preset mapping rules.
Illustratively, the mapping rule between binary and base is: one base A represents one 11, one base T represents one 10, one base C represents one 01, and one base G represents one 00. According to the mapping rule, the text "spring, is no longer a butterfly beyond imagination. "corresponding binary sequence" 11100110 10011000 10100101 11101111 10111100 10001100 11100101 10110111 10110010 11100100 10111000 10001101 11100101 10000110 10001101 11100110 10011000 10101111 11100110 10000011 10110011 11101000 10110001 10100001 11100100 10111001 100011100101 10100100 10110110110110110 11100111 10011010 10000100 11101001 10000010 10100011 11100101 10001111 10101010 11101000 10011101 10110100 11101000 10011101 10110110 11100011 10000000 10000010' is converted into the base sequence "ATCT TCTG TTCC ATAA TAAG TGAG ATCC TACA TAGT ATG TATG TGAC ATCC TGCT TGAC ATCT TCTG TTAA ATCT TGGA TAGA ATTG TAGC TTGC ATCG TATC TGTA ATCC TTCG TCCT ATCA TCTT TGCG ATTC TGGT TTGA ATCC TGA TTTT ATTG TCAC TACG ATTG TACT ATGA TGGG TGGT".
S102, segmenting the base sequence to obtain S sequence units, wherein each sequence unit comprises a plurality of segmented sequence segments, the S sequence units contain K sequence segments in total, the length of each sequence segment is n, and n, S and K are integers greater than or equal to 2.
In the embodiment of the application, the base sequence is segmented, the segmented base sequence becomes s sequence units, and each sequence unit comprises a plurality of segmented sequence segments, so that the sequence length is reduced, and the sequence units can be respectively stored in the subsequent steps. It should be understood that the length referred to in the examples of the present application refers to the length of the base, and is understood to mean the number of bases.
Each sequence unit obtained by dividing the base sequence includes a plurality of sequence fragments having a length of n, and therefore, the length of the sequence unit is an integral multiple of the sequence fragment. In some embodiments, the sequence units formed by the segmentation are of the same length, i.e., the base sequence is segmented into s sequence units of the same length, and the s sequence units of the same length simultaneously contain the same number of sequence fragments of length n. In some embodiments, the length of the sequence units formed by the segmentation is different, and for example, the base sequence is sequentially segmented according to a preset sequence unit length to obtain S-1 sequence units, and the length of the base sequence remaining after the last segmentation is less than the preset length, in this case, the remaining base sequence is used as one sequence unit and has a length less than the length of other sequence units. In some embodiments, the base sequence can be segmented according to other predetermined segmentation rules to obtain s sequence units with different sequence unit lengths.
In one possible implementation, the base sequence is divided into s sequence units, each sequence unit comprising a plurality of divided sequence fragments, comprising:
dividing the base sequence into a plurality of sequence units, wherein the length of each sequence unit is integral multiple of n;
each sequence unit is divided into a plurality of sequence segments with the length of n.
In some embodiments, when the base sequence is divided into a plurality of sequence units, the base sequence is divided every predetermined sequence unit length from one end of the base sequence to obtain s sequence units, wherein the predetermined length is an integral multiple of n. And when the length of the remaining base sequence after the last segmentation is less than the preset length, taking the remaining base sequence as a sequence unit. In other embodiments, the base sequence may be divided from other sites of the base sequence. For example, the base sequence is divided sequentially from the middle point of the base sequence toward both ends to obtain sequence units having a length which is an integral multiple of n.
In some embodiments, partitioning each sequence unit into a plurality of sequence segments of length n comprises:
the sequence unit is divided once every length n from one end of the sequence unit to obtain a plurality of sequence fragments. In other embodiments, the sequence units may also be partitioned starting from other positions in the sequence units. For example, sequence segments of length l are obtained by sequentially dividing the sequence units from the middle of the sequence units toward both ends.
Illustratively, as shown in FIG. 3, in the base sequence in S101, the base sequence is sequentially divided from one end of the base sequence according to a standard in which the length of the base sequence is 60 bases to obtain 3 sets of 60-base sequence units and one set of 12-base sequence units; sequence units are sequentially divided from one end of the sequence unit according to a standard that the sequence fragments are 4 bases in length, three groups of sequence units with the length of 60 bases are divided into 15 groups of sequence fragments, and the sequence units with the length of 12 bases are divided into 3 groups of sequence fragments.
In another possible implementation, the base sequence is segmented to obtain s sequence units, each sequence unit comprising a plurality of sequence segments of length n, comprising:
dividing the base sequence into a plurality of sequence segments with the length of n;
and combining the sequence fragments according to a preset combination rule to obtain s sequence units.
In some embodiments, the base sequence is divided into a plurality of sequence fragments of length n, and the base sequence is divided every length n from one end of the base sequence to obtain a plurality of sequence fragments.
In some embodiments, when the sequence fragments are combined according to a preset combination rule, the preset combination rule refers to a rule for assigning a plurality of sequence fragments as one sequence unit, including the position of the sequence fragment assigned as one sequence unit in the base sequence, the number of the sequence fragments, and the arrangement order of the sequence fragments when the sequence fragments are combined into the sequence unit. The lengths of the combined sequence units can be the same or different, but are integral multiples of l. Illustratively, sequence fragments of 20 bases in number 6 are combined in sequence in the order of the sequence fragments in the base sequence to form a sequence unit.
Illustratively, the sequence segments are 4 in length and the sequence unit comprises 15 sequence segments. At this time, the base sequence "ATCT TCTG TTCC ATAA TAAG TGAG ATCC TACA TAGT ATCG TATG TGAC ATCC TGCT TGAC ATCT TCTG TTAA ATCT TGGA TAGA ATTG TTGC ATCG TATC TGTA ATCC TTCG TCCT ATCA TCTT TGCG ATTC TGGT ATGA ATTA TTTT ATTG TCAC TACT ATTG ATGA TGGA TGGG GT" of step S101 is divided into 48 sequence segments, which are: ATCT, TCTG, TTCC, ATAA, TAAG, TGAG, ATCC, TACA, TAGT, ATCG, TATG, TGAC, ATCC, TGCT, TGAC, ATCT, TCTG, TTAA, ATCT, TGGA, TAGA, ATTG, TAGC, TTGC, ATCG, TATC, TGTA, ATCC, TTCG, TCCT, ATCA, TCTT, TGCG, ATTC, TGGT, TTGA, ATCC, TGAA, TTTT, ATTG, TCAC, TACG, ATTG, TCAC, TACT, ATGA, TGGG, TGGT; and sequentially combining the sequence fragments according to a preset combination rule and the length of the sequence unit, and combining 15 sequence fragments into one sequence unit to obtain four sequence units which respectively comprise the following sequence fragments. The first sequence unit comprises the following 15 sequence fragments: ATCT, TCTG, TTCC, ATAA, TAAG, TGAG, ATCC, TACA, TAGT, ATCG, TATG, TGAC, ATCC, TGCT, TGAC; the second sequence unit comprises the following 15 sequence fragments: ATCT, TCTG, TTAA, ATCT, TGGA, TAGA, ATTG, TAGC, TTGC, ATCG, TATC, TGTA, ATCC, TTCG, TCCT; the third sequence unit comprises the following 15 sequence segments: ATCA, TCTT, TGCG, ATTC, TGGT, TTGA, ATCC, TGAA, TTTT, ATTG, TCAC, TACG, ATTG, TCAC, TACT; the fourth sequence unit comprises the following 3 sequence fragments: ATGA, TGGG, TGGT.
In the examples of the present application, the base sequence was divided into S sequence units, and the S sequence units contained K sequence fragments of length n in total.
S103, labeling the K sequence fragments and the S sequence units by using preset index information to obtain K labeled sequence fragments and S labeled sequence units, wherein the index information comprises a first retrieval sequence for representing the arrangement sequence of the S sequence units in the base sequence and a second retrieval sequence for representing the arrangement sequence of a plurality of sequence fragments belonging to the same sequence unit in the sequence units, and the K labeled sequence fragments are used for synthesizing K first DNA molecules storing target data.
In the embodiment of the application, the sequence unit is marked by adopting the preset index information so as to facilitate the decoding of the subsequent DNA storage data. Marking the sequence unit by adopting preset index information, comprising the following steps: labeling the arrangement sequence of S sequence units in the base sequence by using a first search sequence, and labeling the arrangement sequence of a plurality of sequence fragments belonging to the same sequence unit in the sequence units by using a second search sequence, to obtain K labeled sequence fragments and S labeled sequence units. In this case, the order of S sequence units in the base sequence is recorded, and the order of a plurality of sequence fragments belonging to the same sequence unit is also recorded.
In the present example, the second search sequence is a sequence of bases, and the base sequence of the second search sequence can be determined by a predetermined rule. Illustratively, when the number of sequence segments in a sequence unit is less than or equal to 4, a single base may be used to represent the second search sequence. Such as: the sequence number 1 corresponds to the double base A, the sequence number 2 corresponds to the double base C, the sequence number 3 corresponds to the double base G, and the sequence number 4 corresponds to the double base T, but the sequence numbers and bases are not limited to this correspondence. For example, when the number of sequence fragments in a sequence unit is less than or equal to 16, a double base can be used to represent the second search sequence. Such as: the double bases AA corresponding to No. 1, the double bases AC corresponding to No. 2, the double bases AG corresponding to No. 3, the double bases AT corresponding to No. 4, the double bases CA corresponding to No. 5, the double bases CC corresponding to No. 6, the double bases CG corresponding to No. 7, the double bases CT corresponding to No. 8, the double bases GA corresponding to No. 9, the double bases GC corresponding to No. 10, the double bases GG corresponding to No. 11, the double bases GT corresponding to No. 12, the double bases TA corresponding to No. 13, the double bases TC corresponding to No. 14, the double bases TG corresponding to No. 15, and the double bases tt corresponding to No. 16. Of course, the number of bases in the reference base set is not limited to 2, and when the number of sequence fragments in a sequence unit increases, the number of bases in the second search sequence employed correspondingly increases, e.g., when the number of sequence fragments in a sequence unit is 64 or less, the second search sequence can be represented by three bases. The second search sequence has the rule that the power of the number of bases in 4 is greater than or equal to the number of sequence fragments in a sequence unit, and so on.
In this embodiment, when the sequence segment is marked by the second search sequence, the second search sequence may be spliced at a specific position of the sequence segment. In some embodiments, the manner in which multiple sequence fragments belonging to the same sequence unit are tagged with the second search sequence comprises: the second search sequence was spliced on either side of the sequence fragment. Illustratively, the second search sequence is spliced at the beginning, i.e., left end, of the sequence segment; or, the second search sequence is spliced at the terminal end, i.e., the right end, of the sequence fragment. In one embodiment, the means for tagging a plurality of sequence fragments belonging to the same sequence unit with a second search sequence comprises: and simultaneously splicing base groups on two sides of the sequence fragment, wherein the base groups on the two sides form a second retrieval sequence.
In the embodiments of the present application, the K sequence segments are labeled with the second search sequence to form K labeled sequence segments, which are also referred to as information sequence segments. Illustratively, the labeled sequence segment of AAATGC is formed after labeling the second search sequence AA before sequence segment ATGC.
In the examples of the present application, the first search sequence was used to indicate the positions of S sequence units in the base sequence. In some embodiments, the first search sequence includes i DNA sequence segments, i is an integer greater than or equal to 1, i.e., the first search sequence may be one DNA sequence segment or a plurality of DNA sequence segments. Wherein each DNA sequence fragment comprises a first base sequence used as an index marker and a second base sequence used for marking sequence unit numbers. Wherein the first base sequence can be placed in a specific position of the first search sequence according to a preset setting, and exemplarily, the first base sequence is located at the beginning (left end) of the first search sequence; illustratively, the first base sequence is located at the terminal end (right end) of the first search sequence; illustratively, the first base sequence is located at a specific position in the first search sequence, such as the third and fourth bases of the first search sequence, but not limited thereto.
The first base sequence may be set in advance. Illustratively, TT is placed as a first base sequence at the beginning of the first search sequence, and serves as an index marker indicating the first search sequence of the DNA sequence fragment beginning with TT in sequence units. In this example, when the first nucleotide sequence is located at the beginning of the first search sequence, the first nucleotide sequence is different from the beginning of the second search sequence, so as to avoid misidentifying the sequence fragment as the first search sequence during the identification process.
Similarly, the second base sequence may be set in advance. Illustratively, the sequence number 1 corresponds to four bases AAAA, the sequence number 2 corresponds to four bases AAAC, the sequence number 3 corresponds to four bases AAAG, and the sequence number 4 corresponds to four bases AAAT, but the sequence number and the bases are not limited to this correspondence, and the number of bases in the reference base group is not limited to 4. In the embodiment of the present application, the sequence unit is labeled by the first search sequence, and each sequence fragment in the sequence unit is labeled by the second search sequence to obtain a labeled sequence unit, which is also referred to as an information sequence unit. Referring to fig. 4, after subjecting the sequence unit (shown on the left side of the arrow in fig. 4) in step S102 to the first search sequence tag and each sequence fragment in the sequence unit to the second search sequence tag, four sets of tagged sequence units (shown on the right side of the arrow in fig. 4) each comprising a plurality of information sequence fragments are obtained.
In some embodiments, the DNA sequence fragments corresponding to the first search sequence and the second search sequence are obtained using DNA synthesis techniques. Exemplary DNA synthesis techniques include, but are not limited to, enzymatic synthesis, phosphoramidite synthesis, and the like.
In some embodiments, the DNA sequence fragments corresponding to the first search sequence and the second search sequence can be obtained by amplification from a pre-synthesized DNA universal molecule library, such as PCR technology.
The above steps S101 to S103 are realized by processing modules in the apparatus shown in fig. 1.
In some embodiments, the storage method further comprises:
s104, synthesizing the K marking sequence segments into K first DNA molecules which store the target data, and then storing the K first DNA molecules in S first physical spaces, wherein the first DNA molecules corresponding to the marking sequence segments which belong to the same sequence unit are stored in the same first physical space, and the first DNA molecules corresponding to the marking sequence segments which do not belong to the same sequence unit are stored in different first physical spaces.
In this step, K first DNA molecules storing the target data are synthesized from K marker sequence fragments, respectively, and the synthesis is implemented by a DNA synthesis module in the apparatus shown in fig. 1. The embodiment of the application can synthesize K marking sequence segments respectively through the existing synthesis technology to obtain K first DNA molecules.
Storing K first DNA molecules in S different first physical spaces, and storing first DNA molecules corresponding to the marker sequence fragments belonging to the same sequence unit in the same first physical space, and storing first DNA molecules corresponding to the marker sequence fragments not belonging to the same sequence unit in different first physical spaces, wherein the step is realized by a DNA storage module in the device shown in FIG. 1. A schematic diagram of DNA storage after storing K first DNA molecules in S different first physical spaces is shown in FIG. 5, where each box represents a first physical space.
In the step, the synthesized K first DNA molecules are respectively stored in different first physical spaces, so that the respective storage of each information sequence unit is realized. Further, S first physical spaces are integrated in a DNA hard disk, and the storage of the base sequence stored with the target data is realized. Through integration, K first DNA molecules in S first physical spaces form a complete whole to be stored, and omission and information loss are not easy to occur in the storage process, so that the integrity of data storage is improved. Correspondingly, during the decoding release process, the integrated first DNA molecule is decoded and released, so that DNA base can be completely restored, and the integrity of data can be maintained.
In some embodiments, the storage method further comprises: and storing the second DNA molecule in a second physical space corresponding to the first physical space, wherein the second DNA molecule stores index information. And storing the second DNA molecule with the index information in a second physical space different from the first physical space to save the index information.
In the method for storing DNA data provided in the embodiment of the present application, after a binary sequence corresponding to data to be stored is converted into a base sequence, the base sequence corresponding to the binary sequence of target data is divided into S sequences, each sequence unit includes a plurality of divided sequence segments, the S sequence units collectively include K sequence segments, the length of each sequence segment is n, and n, S, and K are integers greater than or equal to 2, the sequence segments and the position information of the sequence units are labeled by using preset index information, and the labeled sequence segments are synthesized into DNA molecules and then stored respectively. By the method, the storage capacity of the data information can be improved, and the large-scale storage of the data information in the DNA can be realized.
In certain embodiments, the length of the first search sequence is different from the length of the marker sequence segment (the segment of sequence with the second search sequence), and the first search sequence and the marker sequence segment are distinguished from the sequence unit by length differentiation. Specifically, m represents the number of bases of the tag sequence fragments, q represents the number of bases of the second search sequence, i represents the number of DNA sequence fragments in the first search sequence, and p represents the number of bases of the second base sequence in the first search sequence, and by the method provided by the application, the storage of DNA data containing D numbers of bases can be realized, wherein the calculation formula of D is as follows:
D=4 q ×(m-q)×4 i×p
in a specific example, in the case where m is 8 bases in length, q =4,i =10,p =4,d =256 × 4 40 =1.23×10 27 . When 1 base stores 2bits of information, the amount of information that can be stored is L =2.46 × 10 27 bits=3.075×10 26 bytes=3.075×10 5 ZB, much larger than the current data storage size. In certain embodiments, the length of the first search sequence is the same as the length of the marker sequence segment (the segment of the sequence with the second search sequence); the first nucleotide sequence used as an index marker in the first search sequence may be a part of the second search sequence and may have the same number of nucleotides as the second search sequence. In this case, the number of bases of the tag sequence fragment is represented by m, the number of bases of the second search sequence is represented by q, the number of DNA sequence fragments in the first search sequence is represented by i, and the number of bases of the second base sequence in the first search sequence is represented by p, and by the method of the present application, storage of DNA data containing D numbers of bases can be achieved, where D is calculated as follows:
D=(4 q- i)×(m-q)×4 i×p
in a specific example, in the case where m is 8 bases in length, q =4,i =10,p =4,d = (256-10) × 4 40 =1.18×10 27 . When 2bits information is stored in 1 base, the amount of information that can be stored is L =2.36 × 10 27 bits=2.95×10 26 bytes=2.95×10 5 ZB, much larger than the current data storage size.
Even if the sequence fragment contains 8 bases, the data storage which is far larger than the current storage capacity can be realized. In addition, since each sequence unit and sequence fragment is labeled with a search sequence, the method can successfully achieve data recovery from sequence fragments to sequence units, and from sequence units to base sequences.
In addition, the first DNA molecules are stored separately and amplified for storage according to actual needs in the examples of the present application. In this case, after the first DNA molecule is incorporated into the DNA molecule library, backup information of the first DNA molecule can be extracted according to actual needs, so that the trouble of re-synthesizing from the beginning every time is avoided, and the storage cost can be greatly reduced.
In one embodiment, as shown in fig. 6, a method for decoding K of the first DNA molecules comprises:
s201, sequencing a plurality of first DNA molecules stored in each first physical space to obtain a plurality of marker sequence segments; and splicing the sequence segments corresponding to each marked sequence segment belonging to the same marked sequence unit according to the index information to obtain the sequence unit.
The means for sequencing the K first DNA molecules includes any means that can read the DNA product, such as second generation sequencing, third generation sequencing, etc., to obtain a plurality of information sequence units to be decoded.
In one embodiment, the K first DNA molecules are sequenced to obtain K marker sequence fragments, respectively. In some embodiments, when the second DNA molecule is stored in a second physical space corresponding to the first physical space, the second DNA molecule storing the index information, the sequencing further comprises: and sequencing the second DNA molecule to obtain index information comprising the first retrieval sequence and the second retrieval sequence.
The step S201 can be implemented by a DNA sequencing module in the storage device shown in fig. 1.
S202, splicing sequence segments corresponding to all the marked sequence segments belonging to the same marked sequence unit according to a second retrieval sequence to obtain S sequence units; and splicing the obtained S sequence units according to the first search sequence to obtain a base sequence.
This step splices K search fragments belonging to S sequence units into a base sequence by search information. Acquiring position information of a marker sequence unit in a base sequence according to a first retrieval sequence; and acquiring the position information or the number information of a plurality of marker sequence fragments belonging to the same sequence unit according to the second retrieval sequence. Splicing the sequence fragments into sequence units according to the obtained position information or the number information of the sequence fragments; based on the obtained position information or number information of the sequence units, S sequence units are spliced into a base sequence.
In some embodiments, splicing the sequence segments corresponding to each tagged sequence segment belonging to the same tagged sequence unit according to the second search sequence to obtain S sequence units comprises: acquiring a sequence fragment corresponding to each marker sequence fragment belonging to the same marker sequence unit and the position of the sequence fragment in the sequence unit according to the second retrieval sequence; and splicing the sequence fragments into sequence units according to the positions of the sequence fragments in the sequence units.
In one embodiment, splicing the obtained S sequence units according to the first search sequence to obtain a base sequence comprises:
acquiring the position of a sequence unit in a base sequence according to a first search sequence;
and splicing the s sequence units into a base sequence according to the positions of the sequence units in the base sequence.
Illustratively, the complete DNA sequence "ATCT TCTG TTCC ATAA TAAG TGAG ATCC TACA TAGT ATGG TATG TGAC ATCC TGCT TGAC ATCT TCTG TTAA ATCT TGGA TAGA ATTG TTGC ATGC ATCG TATC TGTA ATCC TTCG TCCT ATCA TCTT TGCG ATTC TGGT TTGA ATCC TTTT ATTG TCAC TACG ATTG ATCT TCAC TACT ATGA TGGGG TGGT" is read based on the first search sequence and the second search sequence.
S203, converting the base sequence into target data.
The nucleotide sequence can be converted into a binary sequence by a predetermined mapping relationship with the data of S101. For example, according to the case where 11 corresponds to A, 10 corresponds to T, 01 corresponds to C, and 00 corresponds to G, the base sequence obtained in S102 can be converted into a binary sequence: "11100110 10011000 10100101 11101111 10111100 10001100 11100101 10110111 10110010 11100100 10111000 10001101 11100101 10000110 10001101 11100110 10011000 10101111 11100110 10000011 10110011 11101000 10110001 10100001 11100100 10111001 10001011 11100101 10100100 10010110 11100111 10011010 10000100 11101001 10000010 10100011 11100101 10001111 10101010 11101000 10011101 10110100 11101000 10011101 10110110 11100011 10000000 10000010".
A sequence of computer information is then generated from the binary sequence.
According to the generated binary sequence and in combination with preset coding rules, the binary sequence can be converted into corresponding data files, including files such as pictures, texts, programs, audio and video.
If the binary sequence obtained in step S203 is restored to "spring" by using a computer program, the butterfly is no longer the butterfly out of imagination. "is used as a text message.
It should be understood that the implementation of the steps in the above embodiments may be implemented by human computing or by a computer program. The sequence number of each step does not mean the execution sequence, and the execution sequence of each process should be determined by the function and the internal logic of the process, and should not constitute any limitation to the implementation process of the embodiment of the present application.
In a second aspect, with reference to fig. 1, an embodiment of the present application provides a DNA data storage device, including a data processing module, configured to obtain a base sequence corresponding to a binary sequence of target data; dividing the base sequence to obtain S sequence units, wherein each sequence unit comprises a plurality of divided sequence fragments, the S sequence units contain K sequence fragments, the length of each sequence fragment is n, and n, S and K are integers more than or equal to 2; labeling the K sequence fragments and the S sequence units by using preset index information to obtain K labeled sequence fragments and S labeled sequence units, wherein the index information comprises a first retrieval sequence for expressing the arrangement sequence of the S sequence units in a base sequence and a second retrieval sequence for expressing the arrangement sequence of a plurality of sequence fragments belonging to the same sequence unit in the sequence units, and the K labeled sequence fragments are used for synthesizing K first DNA molecules storing target data.
In some embodiments, the data processing module is further configured to splice, according to the second search sequence, sequence fragments corresponding to each of the tagged sequence fragments belonging to the same tagged sequence unit to obtain a sequence unit; splicing the obtained S sequence units according to the first retrieval sequence to obtain a base sequence; the nucleotide sequence was converted into target data.
In some embodiments, the storage device further comprises a DNA synthesis module for synthesizing the K marker sequence segments into K first DNA molecules storing the target data. In some embodiments, the storage device further comprises a DNA molecule storage module for storing K first DNA molecules in S first physical spaces, wherein first DNA molecules corresponding to marker sequence fragments belonging to one sequence unit are stored in the same first physical space and first DNA molecules corresponding to marker sequence fragments not belonging to the same sequence unit are stored in different first physical spaces.
In some embodiments, the DNA molecule storage module is further configured to store a second DNA molecule in a second physical space.
In some embodiments, the storage device further comprises a DNA molecule sequencing module for sequencing the plurality of first DNA molecules stored in each first physical space to obtain a plurality of tag sequence fragments;
the apparatus of FIG. 1 containing the above modules can perform data storage writing using DNA, corresponding to the method of DNA data storage writing shown in FIG. 2.
The embodiment of the application also provides a DNA data storage device. As shown in fig. 7, the terminal device 70 provided in the present embodiment includes: a processor 710, a memory 720, and a computer program 721 stored in the memory 720 and operable on the processor 710. The processor 710, when executing the computer program 721, implements the steps in the various embodiments of the method of storing DNA data described above, such as steps S101 to S103 shown in fig. 2.
Illustratively, the computer program 721 may be divided into one or more modules/units, which are stored in the memory 720 and executed by the processor 710 to accomplish the present application. One or more of the modules/units may be a series of computer program instruction segments capable of performing certain functions, which may be used to describe the execution of the computer program 721 in the terminal device. For example, the computer program 721 may be divided into data processing modules. In some embodiments, the computer program 721 may also be partitioned into a data processing module, a DNA molecule synthesis module, a DNA molecule storage module, and a DNA molecule storage module, each module functioning specifically as described herein. For economy of disclosure, further description is omitted here.
Those skilled in the art will appreciate that fig. 7 is merely an example of terminal device 70, and does not constitute a limitation of terminal device 70, and may include more or fewer components than shown, or some components may be combined, or different components.
The Processor 710 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 720 may be an internal storage unit of the terminal device 70, such as a hard disk or a memory of the terminal device 70. The memory 720 may also be an external storage device of the terminal device 70, such as a plug-in hard disk provided on the terminal device 70, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and so on. Further, the memory 720 may also include both an internal storage unit of the terminal device 70 and an external storage device. The memory 720 is used for storing the computer program 721 and other programs and data required by the terminal device 70. The memory 720 may also be used to temporarily store data that has been output or is to be output.
The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the processing method of the foregoing embodiments is implemented.
The embodiment of the present application further provides a computer program product, which, when running on a terminal device, enables the terminal device to execute the method for storing DNA data of the foregoing embodiments.
The embodiment of the application provides a DNA hard disk, referring to fig. 5, comprising a plurality of physical spaces, wherein the physical spaces are made of physical materials, and each physical space is used for storing DNA molecules.
In the embodiment of the application, the physical space made of the physical material wraps the DNA molecules, and the DNA molecules are isolated by the physical material. The shape of the physical space is not strictly limited, and may be circular, square, or any other shape.
In some embodiments, the DNA molecules stored in the physical space comprise the first DNA molecule above, in which case the physical space is the first physical space. Correspondingly, the first DNA molecule comprises the second search sequence and the sequence fragment. Each first physical space also stores the second retrieval sequence described above.
In some embodiments, the DNA molecules stored in the physical space comprise a second DNA molecule as described above, in which case the physical space is a second physical space.
In some embodiments, the physical material is selected from SiO 2 At least one of metal oxide and high molecular polymer material. The polymer material includes, but is not limited to, resin.
The above are merely alternative embodiments of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims (14)

  1. A method for storing DNA data, comprising:
    acquiring a base sequence corresponding to a binary sequence of target data;
    dividing the base sequence to obtain S sequence units, wherein each sequence unit comprises a plurality of divided sequence fragments, the S sequence units contain K sequence fragments, the length of each sequence fragment is n, and n, S and K are integers greater than or equal to 2;
    labeling the K sequence segments and the S sequence units by using preset index information to obtain K labeled sequence segments and S labeled sequence units, wherein the index information comprises a first retrieval sequence used for representing the arrangement sequence of the S sequence units in the base sequence and a second retrieval sequence used for representing the arrangement sequence of a plurality of sequence segments belonging to the same sequence unit in the sequence unit, and the K labeled sequence segments are used for synthesizing K first DNA molecules storing the target data;
    the storage method further comprises the following steps:
    after synthesizing K marked sequence segments into K first DNA molecules storing the target data, storing the K first DNA molecules in S first physical spaces, wherein the first DNA molecules corresponding to the marked sequence segments belonging to the same sequence unit are stored in the same first physical space, and the first DNA molecules corresponding to the marked sequence segments not belonging to the same sequence unit are stored in different first physical spaces.
  2. 2. The method for storing DNA data according to claim 1, wherein the labeling of the plurality of sequence fragments belonging to the same sequence unit with the second search sequence comprises:
    splicing a second search sequence on either side of the sequence fragment, or
    And simultaneously splicing retrieval base groups on two sides of the sequence fragment, wherein the retrieval base groups on the two sides form the second retrieval sequence.
  3. 3. The method for storing DNA data according to claim 1, wherein the first search sequence includes i DNA sequence fragments, i is an integer of 1 or more, and each of the DNA sequence fragments includes a first base sequence serving as an index marker and a second base sequence for identifying the sequence unit number.
  4. 4. The method according to claim 1, wherein s first physical spaces are integrated in one DNA hard disk.
  5. 5. The method for storing DNA data according to claim 1, further comprising:
    storing a second DNA molecule in a second physical space corresponding to the first physical space, the second DNA molecule storing the index information.
  6. 6. The method for storing DNA data according to any one of claims 1 to 5, wherein the method for decoding K first DNA molecules comprises:
    sequencing a plurality of the first DNA molecules stored in each of the first physical spaces to obtain a plurality of the marker sequence fragments;
    splicing the sequence fragments corresponding to each of the labeled sequence fragments belonging to the same labeled sequence unit according to the second retrieval sequence to obtain the sequence unit; splicing the obtained S sequence units according to the first retrieval sequence to obtain the base sequence;
    converting the base sequence into the target data.
  7. 7. A DNA data storage device is characterized by comprising a data processing module,
    the data processing module is used for acquiring a base sequence corresponding to the binary sequence of the target data; dividing the base sequence to obtain S sequence units, wherein each sequence unit comprises a plurality of divided sequence fragments, the S sequence units contain K sequence fragments in total, the length of each sequence fragment is n, and n, S and K are integers greater than or equal to 2; labeling the K sequence segments and the S sequence units by using preset index information to obtain K labeled sequence segments and S labeled sequence units, wherein the index information comprises a first retrieval sequence used for representing the arrangement sequence of the S sequence units in the base sequence and a second retrieval sequence used for representing the arrangement sequence of a plurality of sequence segments belonging to the same sequence unit in the sequence unit, and the K labeled sequence segments are used for synthesizing K first DNA molecules storing the target data;
    the device further comprises: a DNA molecule storage module,
    for storing K of said first DNA molecules in S first physical spaces, wherein said first DNA molecules corresponding to said marker sequence fragments belonging to one of said sequence units are stored in the same one of said first physical spaces and said first DNA molecules corresponding to said marker sequence fragments not belonging to the same one of said sequence units are stored in a different one of said first physical spaces.
  8. 8. The DNA data storage device of claim 7, further comprising: and the DNA molecule synthesis module is used for synthesizing the K marking sequence segments into K first DNA molecules in which the target data are stored.
  9. 9. The DNA data storage device of claim 7, wherein the DNA molecule storage module is further configured to store a second DNA molecule in a second physical space.
  10. 10. The DNA data storage device of claim 7, further comprising a DNA molecule sequencing module for sequencing a plurality of the first DNA molecules stored in each of the first physical spaces to obtain a plurality of the marker sequence fragments;
    the data processing module is further configured to splice the sequence segments corresponding to each of the labeled sequence segments belonging to the same labeled sequence unit according to the second search sequence to obtain the sequence unit; splicing the obtained S sequence units according to the first retrieval sequence to obtain the base sequence; converting the base sequence into the target data.
  11. 11. A DNA data storage device comprising a terminal device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the storage method of DNA data according to any one of claims 1 to 6 when executing the computer program.
  12. 12. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method for storing DNA data according to any one of claims 1 to 6.
  13. 13. A DNA hard disk, characterized by comprising a plurality of physical spaces made of physical materials, wherein each physical space is used for storing DNA molecules, and the DNA molecules are stored according to the storage method of any one of claims 1 to 6.
  14. 14. The DNA hard disk of claim 13 characterized in that the physical material is selected from SiO 2 At least one of metal oxide and polymer material.
CN202110929436.9A 2021-08-13 2021-08-13 Method, device and equipment for storing DNA data and readable storage medium Active CN113782102B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110929436.9A CN113782102B (en) 2021-08-13 2021-08-13 Method, device and equipment for storing DNA data and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110929436.9A CN113782102B (en) 2021-08-13 2021-08-13 Method, device and equipment for storing DNA data and readable storage medium

Publications (2)

Publication Number Publication Date
CN113782102A CN113782102A (en) 2021-12-10
CN113782102B true CN113782102B (en) 2022-12-13

Family

ID=78837721

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110929436.9A Active CN113782102B (en) 2021-08-13 2021-08-13 Method, device and equipment for storing DNA data and readable storage medium

Country Status (1)

Country Link
CN (1) CN113782102B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758703B (en) * 2022-06-14 2022-09-13 深圳先进技术研究院 Data information storage method based on recombinant plasmid DNA molecules
CN114958828B (en) * 2022-06-14 2024-04-19 深圳先进技术研究院 Data information storage method based on DNA molecular medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005247900A (en) * 2004-03-01 2005-09-15 Ltt Bio-Pharma Co Ltd Method for judging seal or sign put by using dna-containing ink
CN101702240A (en) * 2009-11-26 2010-05-05 大连大学 Image encryption method based on DNA sub-sequence operation
CN106845158A (en) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 A kind of method that information Store is carried out using DNA
CN111091876A (en) * 2019-12-16 2020-05-01 中国科学院深圳先进技术研究院 DNA storage method, system and electronic equipment
CN111095423A (en) * 2017-08-25 2020-05-01 深圳华大生命科学研究院 Encoding/decoding method, apparatus and data processing apparatus
CN111858510A (en) * 2020-07-16 2020-10-30 中国科学院北京基因组研究所(国家生物信息中心) DNA type storage system and method
CN112288090A (en) * 2020-10-22 2021-01-29 中国科学院深圳先进技术研究院 Method and device for processing DNA sequence with data information
CN112673428A (en) * 2019-05-31 2021-04-16 伊鲁米那股份有限公司 Storage device, system and method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11093547B2 (en) * 2018-06-19 2021-08-17 Intel Corporation Data storage based on encoded DNA sequences
CN112749247B (en) * 2019-10-31 2023-08-18 中国科学院深圳先进技术研究院 Text information storage and reading method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005247900A (en) * 2004-03-01 2005-09-15 Ltt Bio-Pharma Co Ltd Method for judging seal or sign put by using dna-containing ink
CN101702240A (en) * 2009-11-26 2010-05-05 大连大学 Image encryption method based on DNA sub-sequence operation
CN106845158A (en) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 A kind of method that information Store is carried out using DNA
CN111095423A (en) * 2017-08-25 2020-05-01 深圳华大生命科学研究院 Encoding/decoding method, apparatus and data processing apparatus
CN112673428A (en) * 2019-05-31 2021-04-16 伊鲁米那股份有限公司 Storage device, system and method
CN111091876A (en) * 2019-12-16 2020-05-01 中国科学院深圳先进技术研究院 DNA storage method, system and electronic equipment
CN111858510A (en) * 2020-07-16 2020-10-30 中国科学院北京基因组研究所(国家生物信息中心) DNA type storage system and method
CN112288090A (en) * 2020-10-22 2021-01-29 中国科学院深圳先进技术研究院 Method and device for processing DNA sequence with data information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Addressing Information Using Data Hiding for DNA-based Storage Systems;Takahiro Ota等;《2020 International Symposium on Information Theory and Its Applications (ISITA)》;20210802;第509-513页 *
DNA数据存储技术原理及其研究进展;滕越等;《生物化学与生物物理进展》;20210531;第48卷(第5期);第494-504页 *
Mendel: A Distributed Storage Framework for Similarity Searching over Sequencing Data;Cameron Tolooee等;《2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)》;20160721;第790-799页 *
人工DNA合成技术:DNA数据存储的基石;黄小罗等;《合成生物学》;20210228;第2卷(第3期);第335-353页 *

Also Published As

Publication number Publication date
CN113782102A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
CN113782102B (en) Method, device and equipment for storing DNA data and readable storage medium
CN112288090B (en) Method and device for processing DNA sequence with data information
US20210050074A1 (en) Systems and methods for sequence encoding, storage, and compression
CN112382340B (en) Coding and decoding method and coding and decoding device for DNA data storage
CN109830263B (en) DNA storage method based on oligonucleotide sequence coding storage
US20170249345A1 (en) A biomolecule based data storage system
US20130166518A1 (en) Compression Of Genomic Data File
CN112527736B (en) DNA-based data storage method, data recovery method and terminal equipment
CN113744804A (en) Method and device for storing data by using DNA and storage equipment
WO2020132935A1 (en) Method and device for fixed-point editing of nucleotide sequence stored with data
CN105760706A (en) Compression method for next generation sequencing data
Cevallos et al. A brief review on DNA storage, compression, and digitalization
WO2023240950A1 (en) Data information storage method based on dna molecular medium
CN109658981B (en) Data classification method for single cell sequencing
CN111095423B (en) Encoding/decoding method, apparatus and data processing apparatus
Goel A compression algorithm for DNA that uses ASCII values
Beck et al. Finding data in DNA: computer forensic investigations of living organisms
WO2023015550A1 (en) Dna data storage method and apparatus, device, and readable storage medium
WO2022109879A1 (en) Encoding and decoding method and encoding and decoding device between binary information and base sequence for dna data storage
CN111279422A (en) Encoding/decoding method, encoding/decoding device, and storage method and device
WO2022082573A1 (en) Method and apparatus for processing dna sequence storing data information
Wu et al. HD-code: End-to-end high density code for DNA storage
CN114730616A (en) Information encoding and decoding method, apparatus, storage medium, and information storage and reading method
Kavia DNA; Digital data storage device
Tsaftaris et al. On designing DNA databases for the storage and retrieval of digital signals

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220517

Address after: 518000 4th floor, Zhuohong building, Zhenmei community, Xinhu street, Guangming District, Shenzhen, Guangdong

Applicant after: Zhongke carbon yuan (Shenzhen) Biotechnology Co.,Ltd.

Address before: 1068 No. 518055 Guangdong city in Shenzhen Province, Nanshan District City Xili University School Avenue

Applicant before: SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY

GR01 Patent grant
GR01 Patent grant