CN111858139A - Method and device for detecting silent data errors - Google Patents
Method and device for detecting silent data errors Download PDFInfo
- Publication number
- CN111858139A CN111858139A CN202010664124.5A CN202010664124A CN111858139A CN 111858139 A CN111858139 A CN 111858139A CN 202010664124 A CN202010664124 A CN 202010664124A CN 111858139 A CN111858139 A CN 111858139A
- Authority
- CN
- China
- Prior art keywords
- data
- checksum
- read
- silent
- write
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 230000002085 persistent effect Effects 0.000 claims description 6
- 238000011084 recovery Methods 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 5
- 239000002253 acid Substances 0.000 description 4
- 239000003795 chemical substances by application Substances 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1004—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1076—Parity data used in redundant arrays of independent storages, e.g. in RAID systems
- G06F11/1084—Degraded mode, e.g. caused by single or multiple storage removals or disk failures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Techniques For Improving Reliability Of Storages (AREA)
Abstract
The invention discloses a method and a device for detecting silent data errors, which are used for verifying written data and storing a write data checksum when the data is written into a target position; when data is read from a target position, the read data is verified, and the obtained read data checksum is compared with the write data checksum when the read data is correspondingly written; if the two are the same, the data is correct; if the two are different, a silent data error occurs. The end-to-end checking method can effectively detect whether silent data errors occur to the data, avoid the problem of data inconsistency and improve the usability and data safety of the storage system.
Description
Technical Field
The invention relates to the field of data detection, in particular to a method and a device for detecting whether silent data are wrong.
Background
Silent data errors are errors that cannot be detected by the computer components themselves, and it is often not until they are needed to be used that they are found to have been erroneous and corrupted, eventually resulting in irreparable loss, and therefore, they are extremely potentially harmful, which can cause data consistency problems in distributed storage systems, which affects whether the storage systems are commercially viable.
Despite the extremely low probability of single-component silent data errors (according to the european atomic energy research organization, the research report is generally at 10-7Horizontal), the occurrence of silent data errors is almost inevitable since each component may generate and take into account the long-term massive data scenario of the distributed system. For silent data errors, conventional methods such as copy, backup and disaster recovery cannot be well handled.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method and an apparatus for detecting whether silent data is erroneous, so as to effectively detect whether silent data is erroneous.
The technical scheme of the invention is as follows: a method of detecting silent data errors, comprising the steps of:
when data is written into the target position, the written data is verified, and a write data checksum is stored;
when data is read from a target position, the read data is verified, and the obtained read data checksum is compared with the write data checksum when the read data is correspondingly written;
if the two are the same, the data is correct; if the two are different, a silent data error occurs.
Further, storing a write data checksum, specifically:
Recording the write data checksum in a data structure;
and persisting the data structure body to the database.
Further, the target position of the method is a physical disk;
the data structure further has recorded therein: the position of the corresponding data on the physical disk, the length of the corresponding data, the adopted verification algorithm and the configured verification block size.
Further, the data structure is persisted to a rocksdb database.
Further, before comparing the obtained read data checksum with the write data checksum when the read data is written correspondingly, the method further includes the steps of:
acquiring a data structure body of corresponding data from a database;
and acquiring the required write data checksum from the acquired data structure body.
Further, the written data and the read data are checked by using a CRC algorithm or an XXHASH algorithm.
Further, when the silent data are judged to be in error, the silent data in error are recovered according to a redundancy strategy of the copy or the erasure code.
The technical scheme of the invention also comprises a device for detecting silent data errors, which comprises,
a checking module: verifying the data written in the target position to obtain a write data checksum, and verifying the data read from the target position to obtain a read data checksum;
A write data checksum save module: storing the write data checksum;
a judging module: and comparing the read data checksum with the write data checksum corresponding to the read data when the data is written, wherein if the read data checksum and the write data checksum are the same, the data is correct, and if the read data checksum and the write data checksum are different, a silent data error occurs.
Further, the write data checksum storage module stores the write data checksum by recording the write data checksum in a data structure and persisting the data structure in a rocksdb database;
the device also comprises a control device which is used for controlling the operation of the device,
a write data checksum acquisition module: and acquiring a data structure body of corresponding data from the rocksdb database, and acquiring a required write data checksum from the acquired data structure body.
Further, the device also comprises a control device,
a data recovery module: and when the silent data are wrong, recovering the wrong silent data according to a storage redundancy strategy of the copy or the erasure code.
The method and the device for detecting whether the silent data are wrong respectively generate checksums before writing data and after reading the data, and then compare the checksums to find out the silent data are wrong, and preferably recover wrong data when the silent data are wrong. The end-to-end checking method can effectively detect whether silent data errors occur to the data, avoid the problem of data inconsistency and improve the usability and data safety of the storage system.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.
Fig. 2 is a flowchart illustrating an implementation method provided in a first embodiment of the present invention.
Fig. 3 is a schematic block diagram of a second structure according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings by way of specific examples, which are illustrative of the present invention and are not limited to the following embodiments.
The following explains the invention in english:
(1) CRC: cyclic redundancy check
(2) XXHASH: hash algorithm with extremely high speed and without encryption purpose
(3) Rocksdb high-performance embedded database for key-value data
(4) ACID indicates four characteristics that a database management system (DBMS) must have in order to ensure that a transaction is correct and reliable during writing or updating data: atomicity (atomicity), consistency (consistency), isolation (isolation), durability (durability).
Example one
The present embodiment provides a method for detecting a silent data error, which generates checksums before writing data and after reading data, and then compares the checksums to find the silent data error.
As shown in fig. 1, the method comprises the following steps:
SS1, when writing data to the target position, checking the written data and storing the data check sum;
SS2, when reading data from the target position, checking the read data, and comparing the obtained read data check sum with the write data check sum when the read data is written correspondingly;
SS3, if the two are the same, the data is correct; if the two are different, a silent data error occurs.
Preferably, after the silent data is found to be in error, the erroneous silent data can be recovered through a redundancy strategy of a copy or an erasure code. The method can effectively detect and repair whether silent data errors occur in the data or not through an end-to-end checking method, avoids the problem of data inconsistency, and improves the usability and data safety of the storage system.
Specifically, when the present invention is implemented specifically, a data structure (named blob _ t) is used to manage the physical disk space segment with a fixed size, and when data is written into the disk space corresponding to the blob _ t data structure, a checksum of the data to be written is calculated according to a CRC or XXHASH check algorithm and recorded in the data structure. Specifically, CRC32 or XXHASH64 is adopted, so that the requirements of a self storage system on collision probability and execution efficiency can be balanced, the blob _ t data structure can be persisted into a rocksdb database, and the ACID characteristic of the rocksdb database is utilized to ensure that the checksum read from the rocksdb database is always reliable. In the process of reading the data, the read data is checked again by using a check algorithm, the check sum is compared with the data recorded in the blob _ t data structure, if the check sum is inconsistent with the data recorded in the blob _ t data structure, a silent data error can be confirmed, and then a redundant strategy configured in a distributed storage system can be configured and used for recovering the disk position where the silent data error occurs so as to repair the silent data error.
Referring to fig. 2, a specific implementation method is provided in the following for further explanation based on the principle of the present invention by combining the above steps.
S101, receiving a client read-write request;
s102, judging the type of the request, if the request is a write request, entering a step S103, and if the request is a read request, entering a step S105;
s103, carrying out checksum calculation on the data written into each blob _ t data structure to obtain a write data checksum, and recording the write data checksum into the blob _ t data structure;
the data structure is named as blob _ t, and the blob _ t data structure records, in addition to the checksum of the write data, the location of the corresponding data on the physical disk, the length of the data, the adopted check algorithm, and the configured check block size. The corresponding data is data written in the blob _ t structure.
The structural part corresponding to blob _ t is defined as follows:
struct blob_t {
PerxtVector extents// record the location on physical disk of the data written to the structure
U int32_ local _ length = 0///length of data to be written in the structure
Agent 8_ t CSUM _ type = CSUM _ CRC 32///employed checking algorithm
U agent 8_ t csum _ chunk _ order = 0// configuration check block size
buffer:ptrcsum _ data;// record checksum data
}
S104, persisting the blob _ t data structure to a rocksdb database;
s105, acquiring data on a magnetic disk corresponding to the blob _ t data structure to be read;
s106, verifying the read data to obtain a read data checksum, and comparing the obtained read data checksum with a write data checksum when the read data is correspondingly written;
it should be noted that, before the checksum comparison, a corresponding write data checksum needs to be obtained, since the write data checksum is recorded in the slave blob _ t data structure and persisted in the rocksdb database, after the read data checksum is obtained, a blob _ t data structure corresponding to the read data is obtained from the rocksdb database, and the recorded write data checksum is obtained from the blob _ t data structure.
In addition, the embodiment adopts the rocksdb database, and the ACID characteristic of the rocksdb database is utilized to ensure that the checksum read from the rocksdb is always reliable.
S108, judging whether the read data checksum and the write data checksum are consistent;
s109, if the data are consistent, returning the data to be read to the client;
s110, if the data are inconsistent, a recovery process is triggered, and the wrong silent data is recovered according to the redundancy strategy of the copy or the erasure code.
Example two
As shown in fig. 3, according to a first embodiment, the present embodiment provides an apparatus for detecting silent data errors, which includes the following functional modules.
The verification module 101: verifying the data written in the target position to obtain a write data checksum, and verifying the data read from the target position to obtain a read data checksum;
write data checksum save module 102: storing the write data checksum;
the judging module 103: and comparing the read data checksum with the write data checksum corresponding to the read data when the data is written, wherein if the read data checksum and the write data checksum are the same, the data is correct, and if the read data checksum and the write data checksum are different, a silent data error occurs.
In this embodiment, the check module 101 checks the read-write data according to a CRC or XXHASH check algorithm to obtain a checksum.
The write data checksum storage module 102 stores the write data checksum by recording the write data checksum in a data structure and persisting the data structure in the rocksdb database. It should be noted that, in addition to the write data checksum, the data structure also records the position of the corresponding data on the physical disk (the client sends a read-write request to the physical disk), the length of the corresponding data, the adopted check algorithm, and the configured check block size.
When the determining module 103 performs checksum comparison, it needs to obtain the corresponding write data checksum, and therefore, the apparatus further includes a write data checksum obtaining module 104: and acquiring a data structure body of corresponding data from the rocksdb database, and acquiring a required write data checksum from the acquired data structure body. The ACID characteristic of the rocksdb database is utilized to ensure that the checksum read from the rocksdb database is always reliable.
In addition, when a silent data error occurs, the present apparatus triggers data recovery, and sets the data recovery module 105: and when the silent data is wrong, recovering the wrong silent data according to a storage redundancy strategy of the copy or the erasure code.
The above disclosure is only for the preferred embodiments of the present invention, but the present invention is not limited thereto, and any non-inventive changes that can be made by those skilled in the art and several modifications and amendments made without departing from the principle of the present invention shall fall within the protection scope of the present invention.
Claims (10)
1. A method of detecting silent data errors, comprising the steps of:
when data is written into the target position, the written data is verified, and a write data checksum is stored;
When data is read from a target position, the read data is verified, and the obtained read data checksum is compared with the write data checksum when the read data is correspondingly written;
if the two are the same, the data is correct; if the two are different, a silent data error occurs.
2. The method for detecting silent data errors as claimed in claim 1, wherein the write data checksum is saved, specifically:
recording the write data checksum in a data structure;
and persisting the data structure body to the database.
3. The method of claim 2, wherein the target location is a physical disk;
the data structure further has recorded therein: the position of the corresponding data on the physical disk, the length of the corresponding data, the adopted verification algorithm and the configured verification block size.
4. A method of detecting silent data errors as claimed in claim 2 or 3, characterized in that the data structure is persisted to a rocksdb database.
5. The method of claim 4, wherein before comparing the obtained read data checksum with the write data checksum when the read data is written correspondingly, the method further comprises the steps of:
Acquiring a data structure body of corresponding data from a database;
and acquiring the required write data checksum from the acquired data structure body.
6. The method of detecting silent data errors as claimed in claim 1, 2, 3 or 5, characterized in that the written data and the read data are checked using a CRC algorithm or an XXHASH algorithm.
7. The method for detecting silent data errors according to claim 1, 2, 3 or 5, wherein when silent data errors are determined to occur, the erroneous silent data are recovered according to a redundancy strategy of replica or erasure code.
8. An apparatus for detecting silent data errors, comprising,
a checking module: verifying the data written in the target position to obtain a write data checksum, and verifying the data read from the target position to obtain a read data checksum;
a write data checksum save module: storing the write data checksum;
a judging module: and comparing the read data checksum with the write data checksum corresponding to the read data when the data is written, wherein if the read data checksum and the write data checksum are the same, the data is correct, and if the read data checksum and the write data checksum are different, a silent data error occurs.
9. The apparatus for detecting silent data errors as claimed in claim 8, wherein the write data checksum storing module implements storing of the write data checksum by recording the write data checksum in a data structure and persisting the data structure in the rocksdb database;
The device also comprises a control device which is used for controlling the operation of the device,
a write data checksum acquisition module: and acquiring a data structure body of corresponding data from the rocksdb database, and acquiring a required write data checksum from the acquired data structure body.
10. The apparatus for detecting silence data errors according to claim 9, further comprising,
a data recovery module: and when the silent data are wrong, recovering the wrong silent data according to a storage redundancy strategy of the copy or the erasure code.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010664124.5A CN111858139A (en) | 2020-07-10 | 2020-07-10 | Method and device for detecting silent data errors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010664124.5A CN111858139A (en) | 2020-07-10 | 2020-07-10 | Method and device for detecting silent data errors |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111858139A true CN111858139A (en) | 2020-10-30 |
Family
ID=72982945
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010664124.5A Withdrawn CN111858139A (en) | 2020-07-10 | 2020-07-10 | Method and device for detecting silent data errors |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111858139A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113377757A (en) * | 2021-06-24 | 2021-09-10 | 杭州数梦工场科技有限公司 | Data reconciliation method and device, electronic equipment and machine-readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040123202A1 (en) * | 2002-12-23 | 2004-06-24 | Talagala Nisha D. | Mechanisms for detecting silent errors in streaming media devices |
CN107807792A (en) * | 2017-10-27 | 2018-03-16 | 郑州云海信息技术有限公司 | A kind of data processing method and relevant apparatus based on copy storage system |
CN109918226A (en) * | 2019-02-26 | 2019-06-21 | 平安科技(深圳)有限公司 | A kind of silence error-detecting method, device and storage medium |
-
2020
- 2020-07-10 CN CN202010664124.5A patent/CN111858139A/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040123202A1 (en) * | 2002-12-23 | 2004-06-24 | Talagala Nisha D. | Mechanisms for detecting silent errors in streaming media devices |
CN107807792A (en) * | 2017-10-27 | 2018-03-16 | 郑州云海信息技术有限公司 | A kind of data processing method and relevant apparatus based on copy storage system |
CN109918226A (en) * | 2019-02-26 | 2019-06-21 | 平安科技(深圳)有限公司 | A kind of silence error-detecting method, device and storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113377757A (en) * | 2021-06-24 | 2021-09-10 | 杭州数梦工场科技有限公司 | Data reconciliation method and device, electronic equipment and machine-readable storage medium |
CN113377757B (en) * | 2021-06-24 | 2023-08-25 | 杭州数梦工场科技有限公司 | Data checking method and device, electronic equipment and machine-readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6629198B2 (en) | Data storage system and method employing a write-ahead hash log | |
US7640412B2 (en) | Techniques for improving the reliability of file systems | |
US7103811B2 (en) | Mechanisms for detecting silent errors in streaming media devices | |
US6535994B1 (en) | Method and apparatus for identifying and repairing mismatched data | |
US7908512B2 (en) | Method and system for cache-based dropped write protection in data storage systems | |
US10643668B1 (en) | Power loss data block marking | |
US6233696B1 (en) | Data verification and repair in redundant storage systems | |
US8572331B2 (en) | Method for reliably updating a data group in a read-before-write data replication environment using a comparison file | |
US7020805B2 (en) | Efficient mechanisms for detecting phantom write errors | |
US9727411B2 (en) | Method and processor for writing and error tracking in a log subsystem of a file system | |
CN112463724B (en) | Data processing method and system for lightweight file system | |
KR20140018393A (en) | Apparatus and methods for providing data integrity | |
KR20140013095A (en) | Apparatus and methods for providing data integrity | |
WO2021135280A1 (en) | Data check method for distributed storage system, and related apparatus | |
US6167485A (en) | On-line data verification and repair in redundant storage systems | |
US9971645B2 (en) | Auto-recovery of media cache master table data | |
CN111858139A (en) | Method and device for detecting silent data errors | |
US7577804B2 (en) | Detecting data integrity | |
CN110222035A (en) | A kind of efficient fault-tolerance approach of database page based on exclusive or check and journal recovery | |
US20160170842A1 (en) | Writing to files and file meta-data | |
CN111428280B (en) | SoC (System on chip) security chip key information integrity storage and error self-repairing method | |
CN112445432B (en) | Method and device for maintaining redundant VPD (virtual private device) in double-control system | |
US20050138526A1 (en) | Recovering track format information mismatch errors using data reconstruction | |
CN109144409B (en) | Data processing method and device, storage medium and data system | |
US10642816B2 (en) | Protection sector and database used to validate version information of user data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201030 |
|
WW01 | Invention patent application withdrawn after publication |