CN111858139A - Method and device for detecting silent data errors - Google Patents

Method and device for detecting silent data errors Download PDF

Info

Publication number
CN111858139A
CN111858139A CN202010664124.5A CN202010664124A CN111858139A CN 111858139 A CN111858139 A CN 111858139A CN 202010664124 A CN202010664124 A CN 202010664124A CN 111858139 A CN111858139 A CN 111858139A
Authority
CN
China
Prior art keywords
data
checksum
read
silent
write
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010664124.5A
Other languages
Chinese (zh)
Inventor
张洪鑫
孟祥瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010664124.5A priority Critical patent/CN111858139A/en
Publication of CN111858139A publication Critical patent/CN111858139A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1004Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • G06F11/1084Degraded mode, e.g. caused by single or multiple storage removals or disk failures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The invention discloses a method and a device for detecting silent data errors, which are used for verifying written data and storing a write data checksum when the data is written into a target position; when data is read from a target position, the read data is verified, and the obtained read data checksum is compared with the write data checksum when the read data is correspondingly written; if the two are the same, the data is correct; if the two are different, a silent data error occurs. The end-to-end checking method can effectively detect whether silent data errors occur to the data, avoid the problem of data inconsistency and improve the usability and data safety of the storage system.

Description

Method and device for detecting silent data errors
Technical Field
The invention relates to the field of data detection, in particular to a method and a device for detecting whether silent data are wrong.
Background
Silent data errors are errors that cannot be detected by the computer components themselves, and it is often not until they are needed to be used that they are found to have been erroneous and corrupted, eventually resulting in irreparable loss, and therefore, they are extremely potentially harmful, which can cause data consistency problems in distributed storage systems, which affects whether the storage systems are commercially viable.
Despite the extremely low probability of single-component silent data errors (according to the european atomic energy research organization, the research report is generally at 10-7Horizontal), the occurrence of silent data errors is almost inevitable since each component may generate and take into account the long-term massive data scenario of the distributed system. For silent data errors, conventional methods such as copy, backup and disaster recovery cannot be well handled.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method and an apparatus for detecting whether silent data is erroneous, so as to effectively detect whether silent data is erroneous.
The technical scheme of the invention is as follows: a method of detecting silent data errors, comprising the steps of:
when data is written into the target position, the written data is verified, and a write data checksum is stored;
when data is read from a target position, the read data is verified, and the obtained read data checksum is compared with the write data checksum when the read data is correspondingly written;
if the two are the same, the data is correct; if the two are different, a silent data error occurs.
Further, storing a write data checksum, specifically:
Recording the write data checksum in a data structure;
and persisting the data structure body to the database.
Further, the target position of the method is a physical disk;
the data structure further has recorded therein: the position of the corresponding data on the physical disk, the length of the corresponding data, the adopted verification algorithm and the configured verification block size.
Further, the data structure is persisted to a rocksdb database.
Further, before comparing the obtained read data checksum with the write data checksum when the read data is written correspondingly, the method further includes the steps of:
acquiring a data structure body of corresponding data from a database;
and acquiring the required write data checksum from the acquired data structure body.
Further, the written data and the read data are checked by using a CRC algorithm or an XXHASH algorithm.
Further, when the silent data are judged to be in error, the silent data in error are recovered according to a redundancy strategy of the copy or the erasure code.
The technical scheme of the invention also comprises a device for detecting silent data errors, which comprises,
a checking module: verifying the data written in the target position to obtain a write data checksum, and verifying the data read from the target position to obtain a read data checksum;
A write data checksum save module: storing the write data checksum;
a judging module: and comparing the read data checksum with the write data checksum corresponding to the read data when the data is written, wherein if the read data checksum and the write data checksum are the same, the data is correct, and if the read data checksum and the write data checksum are different, a silent data error occurs.
Further, the write data checksum storage module stores the write data checksum by recording the write data checksum in a data structure and persisting the data structure in a rocksdb database;
the device also comprises a control device which is used for controlling the operation of the device,
a write data checksum acquisition module: and acquiring a data structure body of corresponding data from the rocksdb database, and acquiring a required write data checksum from the acquired data structure body.
Further, the device also comprises a control device,
a data recovery module: and when the silent data are wrong, recovering the wrong silent data according to a storage redundancy strategy of the copy or the erasure code.
The method and the device for detecting whether the silent data are wrong respectively generate checksums before writing data and after reading the data, and then compare the checksums to find out the silent data are wrong, and preferably recover wrong data when the silent data are wrong. The end-to-end checking method can effectively detect whether silent data errors occur to the data, avoid the problem of data inconsistency and improve the usability and data safety of the storage system.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.
Fig. 2 is a flowchart illustrating an implementation method provided in a first embodiment of the present invention.
Fig. 3 is a schematic block diagram of a second structure according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings by way of specific examples, which are illustrative of the present invention and are not limited to the following embodiments.
The following explains the invention in english:
(1) CRC: cyclic redundancy check
(2) XXHASH: hash algorithm with extremely high speed and without encryption purpose
(3) Rocksdb high-performance embedded database for key-value data
(4) ACID indicates four characteristics that a database management system (DBMS) must have in order to ensure that a transaction is correct and reliable during writing or updating data: atomicity (atomicity), consistency (consistency), isolation (isolation), durability (durability).
Example one
The present embodiment provides a method for detecting a silent data error, which generates checksums before writing data and after reading data, and then compares the checksums to find the silent data error.
As shown in fig. 1, the method comprises the following steps:
SS1, when writing data to the target position, checking the written data and storing the data check sum;
SS2, when reading data from the target position, checking the read data, and comparing the obtained read data check sum with the write data check sum when the read data is written correspondingly;
SS3, if the two are the same, the data is correct; if the two are different, a silent data error occurs.
Preferably, after the silent data is found to be in error, the erroneous silent data can be recovered through a redundancy strategy of a copy or an erasure code. The method can effectively detect and repair whether silent data errors occur in the data or not through an end-to-end checking method, avoids the problem of data inconsistency, and improves the usability and data safety of the storage system.
Specifically, when the present invention is implemented specifically, a data structure (named blob _ t) is used to manage the physical disk space segment with a fixed size, and when data is written into the disk space corresponding to the blob _ t data structure, a checksum of the data to be written is calculated according to a CRC or XXHASH check algorithm and recorded in the data structure. Specifically, CRC32 or XXHASH64 is adopted, so that the requirements of a self storage system on collision probability and execution efficiency can be balanced, the blob _ t data structure can be persisted into a rocksdb database, and the ACID characteristic of the rocksdb database is utilized to ensure that the checksum read from the rocksdb database is always reliable. In the process of reading the data, the read data is checked again by using a check algorithm, the check sum is compared with the data recorded in the blob _ t data structure, if the check sum is inconsistent with the data recorded in the blob _ t data structure, a silent data error can be confirmed, and then a redundant strategy configured in a distributed storage system can be configured and used for recovering the disk position where the silent data error occurs so as to repair the silent data error.
Referring to fig. 2, a specific implementation method is provided in the following for further explanation based on the principle of the present invention by combining the above steps.
S101, receiving a client read-write request;
s102, judging the type of the request, if the request is a write request, entering a step S103, and if the request is a read request, entering a step S105;
s103, carrying out checksum calculation on the data written into each blob _ t data structure to obtain a write data checksum, and recording the write data checksum into the blob _ t data structure;
the data structure is named as blob _ t, and the blob _ t data structure records, in addition to the checksum of the write data, the location of the corresponding data on the physical disk, the length of the data, the adopted check algorithm, and the configured check block size. The corresponding data is data written in the blob _ t structure.
The structural part corresponding to blob _ t is defined as follows:
struct blob_t {
PerxtVector extents// record the location on physical disk of the data written to the structure
U int32_ local _ length = 0///length of data to be written in the structure
Agent 8_ t CSUM _ type = CSUM _ CRC 32///employed checking algorithm
U agent 8_ t csum _ chunk _ order = 0// configuration check block size
buffer:ptrcsum _ data;// record checksum data
}
S104, persisting the blob _ t data structure to a rocksdb database;
s105, acquiring data on a magnetic disk corresponding to the blob _ t data structure to be read;
s106, verifying the read data to obtain a read data checksum, and comparing the obtained read data checksum with a write data checksum when the read data is correspondingly written;
it should be noted that, before the checksum comparison, a corresponding write data checksum needs to be obtained, since the write data checksum is recorded in the slave blob _ t data structure and persisted in the rocksdb database, after the read data checksum is obtained, a blob _ t data structure corresponding to the read data is obtained from the rocksdb database, and the recorded write data checksum is obtained from the blob _ t data structure.
In addition, the embodiment adopts the rocksdb database, and the ACID characteristic of the rocksdb database is utilized to ensure that the checksum read from the rocksdb is always reliable.
S108, judging whether the read data checksum and the write data checksum are consistent;
s109, if the data are consistent, returning the data to be read to the client;
s110, if the data are inconsistent, a recovery process is triggered, and the wrong silent data is recovered according to the redundancy strategy of the copy or the erasure code.
Example two
As shown in fig. 3, according to a first embodiment, the present embodiment provides an apparatus for detecting silent data errors, which includes the following functional modules.
The verification module 101: verifying the data written in the target position to obtain a write data checksum, and verifying the data read from the target position to obtain a read data checksum;
write data checksum save module 102: storing the write data checksum;
the judging module 103: and comparing the read data checksum with the write data checksum corresponding to the read data when the data is written, wherein if the read data checksum and the write data checksum are the same, the data is correct, and if the read data checksum and the write data checksum are different, a silent data error occurs.
In this embodiment, the check module 101 checks the read-write data according to a CRC or XXHASH check algorithm to obtain a checksum.
The write data checksum storage module 102 stores the write data checksum by recording the write data checksum in a data structure and persisting the data structure in the rocksdb database. It should be noted that, in addition to the write data checksum, the data structure also records the position of the corresponding data on the physical disk (the client sends a read-write request to the physical disk), the length of the corresponding data, the adopted check algorithm, and the configured check block size.
When the determining module 103 performs checksum comparison, it needs to obtain the corresponding write data checksum, and therefore, the apparatus further includes a write data checksum obtaining module 104: and acquiring a data structure body of corresponding data from the rocksdb database, and acquiring a required write data checksum from the acquired data structure body. The ACID characteristic of the rocksdb database is utilized to ensure that the checksum read from the rocksdb database is always reliable.
In addition, when a silent data error occurs, the present apparatus triggers data recovery, and sets the data recovery module 105: and when the silent data is wrong, recovering the wrong silent data according to a storage redundancy strategy of the copy or the erasure code.
The above disclosure is only for the preferred embodiments of the present invention, but the present invention is not limited thereto, and any non-inventive changes that can be made by those skilled in the art and several modifications and amendments made without departing from the principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method of detecting silent data errors, comprising the steps of:
when data is written into the target position, the written data is verified, and a write data checksum is stored;
When data is read from a target position, the read data is verified, and the obtained read data checksum is compared with the write data checksum when the read data is correspondingly written;
if the two are the same, the data is correct; if the two are different, a silent data error occurs.
2. The method for detecting silent data errors as claimed in claim 1, wherein the write data checksum is saved, specifically:
recording the write data checksum in a data structure;
and persisting the data structure body to the database.
3. The method of claim 2, wherein the target location is a physical disk;
the data structure further has recorded therein: the position of the corresponding data on the physical disk, the length of the corresponding data, the adopted verification algorithm and the configured verification block size.
4. A method of detecting silent data errors as claimed in claim 2 or 3, characterized in that the data structure is persisted to a rocksdb database.
5. The method of claim 4, wherein before comparing the obtained read data checksum with the write data checksum when the read data is written correspondingly, the method further comprises the steps of:
Acquiring a data structure body of corresponding data from a database;
and acquiring the required write data checksum from the acquired data structure body.
6. The method of detecting silent data errors as claimed in claim 1, 2, 3 or 5, characterized in that the written data and the read data are checked using a CRC algorithm or an XXHASH algorithm.
7. The method for detecting silent data errors according to claim 1, 2, 3 or 5, wherein when silent data errors are determined to occur, the erroneous silent data are recovered according to a redundancy strategy of replica or erasure code.
8. An apparatus for detecting silent data errors, comprising,
a checking module: verifying the data written in the target position to obtain a write data checksum, and verifying the data read from the target position to obtain a read data checksum;
a write data checksum save module: storing the write data checksum;
a judging module: and comparing the read data checksum with the write data checksum corresponding to the read data when the data is written, wherein if the read data checksum and the write data checksum are the same, the data is correct, and if the read data checksum and the write data checksum are different, a silent data error occurs.
9. The apparatus for detecting silent data errors as claimed in claim 8, wherein the write data checksum storing module implements storing of the write data checksum by recording the write data checksum in a data structure and persisting the data structure in the rocksdb database;
The device also comprises a control device which is used for controlling the operation of the device,
a write data checksum acquisition module: and acquiring a data structure body of corresponding data from the rocksdb database, and acquiring a required write data checksum from the acquired data structure body.
10. The apparatus for detecting silence data errors according to claim 9, further comprising,
a data recovery module: and when the silent data are wrong, recovering the wrong silent data according to a storage redundancy strategy of the copy or the erasure code.
CN202010664124.5A 2020-07-10 2020-07-10 Method and device for detecting silent data errors Withdrawn CN111858139A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010664124.5A CN111858139A (en) 2020-07-10 2020-07-10 Method and device for detecting silent data errors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010664124.5A CN111858139A (en) 2020-07-10 2020-07-10 Method and device for detecting silent data errors

Publications (1)

Publication Number Publication Date
CN111858139A true CN111858139A (en) 2020-10-30

Family

ID=72982945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010664124.5A Withdrawn CN111858139A (en) 2020-07-10 2020-07-10 Method and device for detecting silent data errors

Country Status (1)

Country Link
CN (1) CN111858139A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377757A (en) * 2021-06-24 2021-09-10 杭州数梦工场科技有限公司 Data reconciliation method and device, electronic equipment and machine-readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040123202A1 (en) * 2002-12-23 2004-06-24 Talagala Nisha D. Mechanisms for detecting silent errors in streaming media devices
CN107807792A (en) * 2017-10-27 2018-03-16 郑州云海信息技术有限公司 A kind of data processing method and relevant apparatus based on copy storage system
CN109918226A (en) * 2019-02-26 2019-06-21 平安科技(深圳)有限公司 A kind of silence error-detecting method, device and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040123202A1 (en) * 2002-12-23 2004-06-24 Talagala Nisha D. Mechanisms for detecting silent errors in streaming media devices
CN107807792A (en) * 2017-10-27 2018-03-16 郑州云海信息技术有限公司 A kind of data processing method and relevant apparatus based on copy storage system
CN109918226A (en) * 2019-02-26 2019-06-21 平安科技(深圳)有限公司 A kind of silence error-detecting method, device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377757A (en) * 2021-06-24 2021-09-10 杭州数梦工场科技有限公司 Data reconciliation method and device, electronic equipment and machine-readable storage medium
CN113377757B (en) * 2021-06-24 2023-08-25 杭州数梦工场科技有限公司 Data checking method and device, electronic equipment and machine-readable storage medium

Similar Documents

Publication Publication Date Title
US6629198B2 (en) Data storage system and method employing a write-ahead hash log
US7640412B2 (en) Techniques for improving the reliability of file systems
US7103811B2 (en) Mechanisms for detecting silent errors in streaming media devices
US6535994B1 (en) Method and apparatus for identifying and repairing mismatched data
US7908512B2 (en) Method and system for cache-based dropped write protection in data storage systems
US10643668B1 (en) Power loss data block marking
US6233696B1 (en) Data verification and repair in redundant storage systems
US8572331B2 (en) Method for reliably updating a data group in a read-before-write data replication environment using a comparison file
US7020805B2 (en) Efficient mechanisms for detecting phantom write errors
US9727411B2 (en) Method and processor for writing and error tracking in a log subsystem of a file system
CN112463724B (en) Data processing method and system for lightweight file system
KR20140018393A (en) Apparatus and methods for providing data integrity
KR20140013095A (en) Apparatus and methods for providing data integrity
WO2021135280A1 (en) Data check method for distributed storage system, and related apparatus
US6167485A (en) On-line data verification and repair in redundant storage systems
US9971645B2 (en) Auto-recovery of media cache master table data
CN111858139A (en) Method and device for detecting silent data errors
US7577804B2 (en) Detecting data integrity
CN110222035A (en) A kind of efficient fault-tolerance approach of database page based on exclusive or check and journal recovery
US20160170842A1 (en) Writing to files and file meta-data
CN111428280B (en) SoC (System on chip) security chip key information integrity storage and error self-repairing method
CN112445432B (en) Method and device for maintaining redundant VPD (virtual private device) in double-control system
US20050138526A1 (en) Recovering track format information mismatch errors using data reconstruction
CN109144409B (en) Data processing method and device, storage medium and data system
US10642816B2 (en) Protection sector and database used to validate version information of user data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201030

WW01 Invention patent application withdrawn after publication