CN101599305B

CN101599305B - Storage system with data repair function and data repair method thereof

Info

Publication number: CN101599305B
Application number: CN 200810109903
Authority: CN
Inventors: 陈明达; 林传生; 谢祥安; 张惠能
Original assignee: A Data Technology Co Ltd
Current assignee: A Data Technology Co Ltd
Priority date: 2008-06-04
Filing date: 2008-06-04
Publication date: 2013-03-27
Anticipated expiration: 2028-06-04
Also published as: CN101599305A

Abstract

The invention provides a storage system with data repair function and a data repair method thereof, which mainly utilize one or more repeated testing and repair processes to reduce errors in a memory medium to a range that can be repaired by commonly used error detection and correction (ECC) functions so as to ensure the correctness of data reading and effectively improve the reliability of data, wherein the preferred mode of the data repair step comprises the steps of utilizing a test data generator in the storage system to provide a piece of test data, writing the test data into a memory block with data errors, reading the data to find error bits, enabling the error bits to be in the range that can be repaired by the ECC technology through a repair program, but if the testing times exceed the upper limit of a test time, the error bits cannot be found or the error repair cannot be reduced to the range that can be repaired by the error detection and correction technology, the memory block is marked as a corrupted block.

Description

Storage system with data repair function and data repair method thereof

Technical Field

The present invention relates to a storage system with data recovery function and a data recovery method thereof, and more particularly, to a method for performing memory detection and data correction by writing test data once or multiple times to a location where error data is stored.

Background

Flash memory has been widely used in information storage devices because of its high access speed, low power consumption, small size, and shock resistance, which are superior to those of conventional hard disks.

Because of the structure of the flash memory, the stored data is easily affected by high voltage interference or aging and damage of the memory cell, which causes the stored data to generate errors, for example, the state of the memory cell is high potential originally, but the memory cell is read out to be low potential by the controller; or the original memory cell is at a low voltage level, and the controller reads the memory cell and reads the memory cell to a high voltage level.

In order to prevent errors in data stored in a flash memory and improve the reliability of the stored data, the prior art mainly uses Error Checking and Correction (ECC) technology to detect and correct the Error data.

Briefly describing the ECC technique:

when data is written into the flash memory for storage, the ECC code of the data is generated through the operation of an ECC unit in a memory controller, and the ECC code is stored into the flash memory together with the data. When reading data, the controller reads the data and the ECC code thereof, firstly, the ECC unit executes the operation of detecting and correcting error bits, and if no error bit is found after checking, the controller outputs the data; if the data is checked to have error bits, the data is corrected and then output under the condition that the ECC can correct; if the detected error bit exceeds the range that the ECC can correct, the controller reports the data reading error.

For the detection and repair of data stored in flash memory, reference is made to U.S. patent publication No. 20040230879 "Apparatus and method for resetting to data storage loss in and-volatile memory unit using error checking and correction detection techniques" (published in 11/18 2004), and U.S. patent publication No. 6,785,856 "Internal self-test circuit for architectural array" (published in 31/8/2004). The former provides a method for detecting and repairing errors of flash memory under the correction capability of ECC, and the latter provides a test circuit in the storage system to detect errors of memory, please refer to fig. 1.

FIG. 1 shows a memory device 100 having a self-tester 104 as proposed in U.S. Pat. No. 6,785,856, wherein a memory cell 102 is connected to the self-tester 104, wherein there are an error detection and correction (ECC) circuit 106, a self-test circuit 108 and a register 110, and the memory cell 102 is divided into a plurality of page (page) memory structures. The self-tester 104 is responsible for detecting and correcting errors in the memory, the ECC circuit 106 detects the errors in each memory page in groups (groups), for example, using Cyclic Redundancy Check (CRC) method of Reed-Solomon algorithm to detect the errors in the memory cells, and the self-test circuit 108 calculates the number of the errors and stores them in the register 110, and can further use different data to detect the read/write status of each memory cell for data correction.

However, in the above-mentioned prior art, if the error bit exceeds the range that the ECC can correct, the data read error still occurs.

Disclosure of Invention

The invention provides a data repair technology applied to a memory medium, which mainly corrects errors in the memory medium through one or more test flows, can reduce the error bits to the range which can be corrected by the ECC, and marks the error memory block as a damaged block if the next test time exceeds an upper limit time so as to improve the data reliability of the memory medium.

The storage system with data repair function provided by the invention mainly comprises a control unit and a memory unit, wherein the control unit comprises a test data generator, a comparison unit, a repair unit, a data register and an ECC unit.

The test data generator provides test data; the comparing unit is used for comparing the test data written into the memory unit by the test data generator with the test data read from the memory unit so as to judge whether the storage space has wrong bits or not; the data buffer is used for temporarily storing error data which cannot be corrected by ECC in the memory unit; the repair unit corrects the bit data corresponding to the error bit in the data buffer according to the information of the error bit provided by the comparison unit; the ECC unit is used for performing error checking and correcting operations on data in the data register, in addition to performing error detection and correction on data during normal read and write operations.

In the above-mentioned preferred embodiment of the data recovery method in the storage system, when one or more pages of a first memory block in the memory unit have ECC that can not correct the error data, the data of the first memory block is copied to the redundant second memory block and the first memory block is erased, then the test data generator in the control unit provides a test data to write into the first memory block having data error, and the two data are checked to see if they are different by reading the data in the page having data error in the first memory block.

If no error bit is found, the method provided by the invention continues the next testing process, so that the testing data generator generates another group of different testing data, executes the second testing step, detects the error bit on the memory page with data error on the first memory block, and marks the first memory block as a damaged block if the testing step is repeated for a plurality of times and the upper limit number of times of a test is exceeded and the error bit cannot be found. However, if the error bit is found after the first or several test processes, the data corresponding to the error bit in the data buffer is corrected to be within the range that can be repaired by the ECC technique, but if the data cannot be repaired by the ECC technique at this time, the next test process is continued to further correct the error data in the data buffer so as to reduce the error bit to the range that can be repaired by the ECC technique.

Compared with the prior art, the invention has the following beneficial effects:

the invention detects and corrects the error bits of the data by writing the test data to the position for storing the error data, reduces the number of the error bits in the data to the range that the ECC can repair, corrects the error in the memory medium through one or more test procedures when the ECC can not repair the error data, and marks the error memory block as a damaged block if the next test time exceeds an upper limit time, thereby ensuring the correctness of data reading and effectively improving the reliability of the data.

Drawings

FIG. 1 is a block diagram of a prior art self-healing memory device;

FIG. 2 is a functional block diagram of a storage system according to the present invention;

fig. 3A to 3I are schematic diagrams illustrating a state of interaction between memory blocks according to the present invention;

FIG. 4 is a flowchart illustrating the data recovery process according to a preferred embodiment of the present invention;

FIG. 5 is a flowchart illustrating the data repair process according to another preferred embodiment of the present invention;

FIG. 6 shows steps of a retest process performed during data repair;

fig. 7A to 7H are schematic diagrams illustrating another state of interaction between memory blocks according to the present invention.

Wherein,

memory device 100 memory cell 102

ECC circuit 106 of self-tester 104

Self-test circuit 108 register 110

Storage system 20 control unit 22

Memory unit 24 test data generator 221

Comparison unit 223 repair unit 225

Data buffer 227 ECC Unit 229

Detailed Description

The invention provides a storage system with a data recovery function and a data recovery method thereof, and particularly provides a function of detecting error bits and then recovering the error bits when the error detection and correction (ECC) technology which is generally used cannot recover the error bits of stored data, so that the data reading correctness is ensured, and the data reliability is effectively improved.

In the prior art, to ensure the reliability of data, the control unit uses the ECC function to check and repair the erroneous data when reading the data in the memory. However, the ECC error recovery capability is limited, and if the error bits of the data exceed the correction range of the ECC, a data read error still occurs. Therefore, the storage system of the present invention detects and corrects the error bits of the data by writing the test data to the location where the error data is stored, and reduces the number of the error bits in the data to the range where the ECC can repair, so that the correct data can be read by using the repair method of the present invention when the ECC cannot repair the error data.

Referring to fig. 2, a functional block diagram of the storage system of the present invention is shown, in which the storage system 20 includes a control unit 22 and a memory unit (non-volatile) 24, which are electrically connected, and the control unit 22 includes a test data generator 221, a comparison unit 223, a repair unit 225, a data register 227 and an error detection and correction (ECC) unit 229.

The test data generator 221 is used for generating test data, which may be all "0" data, such as 0x00, all "1" data, such as 0xFF, or data with "0" and "1" arranged alternately, such as 0x55 or 0xAA, or randomly generated two-bit random number data, and writing the test data into the memory cells. The comparing unit 223 is used for comparing the test data written into the memory unit 24 (the embodiment can be a nonvolatile memory such as a flash memory) by the test data generator 221 with the data read from the memory unit 24, determining whether the memory unit 24 has an error bit, and determining the information of the error bit such as the address of the error bit in the memory unit.

For example, when the test data written into the memory cell 24 is all "0" data, and one bit of the data read from the memory cell 24 is "1", the bit is an error bit. The data register 227 is used for temporarily storing the error data that cannot be corrected by ECC in the memory unit 24. The repair unit 225 corrects the bit data corresponding to the error bit in the data buffer 227 according to the information of the error bit provided by the comparison unit 223, and particularly, the data buffer 227 may be selected from memories such as a Random Access Memory (RAM), a Non-volatile Memory (Non-volatile), a Phase Change Memory (Phase Change Memory), a Free ferroelectric random access Memory (Free ferroelectric random access Memory), and a Magnetic random access Memory (Magnetic RAM).

An ECC unit 229 is further provided for performing error detection and correction of the data in the memory unit 24 and for performing error checking and correction of the data in the data register 227, wherein the ECC unit 229 is also used for performing error checking and correction of the data in the data register 227 in addition to performing error detection and correction of the data during normal read and write operations.

Fig. 3A to 3I illustrate the interaction state between the memory blocks according to the present invention, and mainly illustrate the flow of checking and correcting the error bits by writing test data.

Initially, the error data is first discovered. Fig. 3A shows three blocks (blocks) of the memory units, i.e., a block a, a block B, and a block C, when the control unit reads the block a of the memory units, and finds that there is a memory page (assuming that the first page marked with "defect" in the drawing) where ECC cannot correct the error data, the control unit performs copying to copy the data in the block a to the block B, wherein the data of the first page marked with "defect" is not processed by ECC, but the data is completely and originally copied to the first page of the block B, and then the other pages are copied from the block a to the block B by ECC function.

Then, data copying and erasing are performed. As shown in fig. 3B, if the third page of block a is found to have a defect, after the second page (marked as "original data") is successfully copied, the third page of block a is also copied into the third page of block B; after the data of the fourth page is copied successfully, when the data of the fifth page is copied, the error bits of the data of the fifth page are found to exceed the correctable range of the ECC and are marked as 'defect', and the data of the fifth page of the block A is copied into the fifth page of the block B completely and originally. According to the above copying rule, all the data in the block A are copied to the block B, and then the block A is erased.

Then, test data is written and error bits are detected. After erasing the block a, the test data generator provides the first test data to write into the block a, as shown in fig. 3C, after the first test data is written into each page of the block a, the first test data is marked as "sample 1" on each page, and the control unit performs a reading procedure on the data in the first page, the third page and the fifth page of the block a to compare with the first test data of the block a to detect whether there is an error bit. If error bits are found in the memory page, the data of the memory page related to the block B is copied to the register of the control unit, and data correction is performed according to the error bits. After the error data is corrected, the control unit finds a block, such as block C, in the flash memory, and stores the corrected data and other data that do not need to be corrected into block C from the temporary storage area.

As shown in FIG. 3D, the control unit compares the data in the first page of block A, reads the data in the first page of block B back into the data buffer if an error bit is found, corrects the data in the data buffer according to the error bit information, and then writes the corrected data in the data buffer into block C, which is marked as "corrected". The second page of block B, which is the correct data, labeled "original data", is copied directly to block C.

The control unit then compares the data in the third page of block a (due to the data error), and if an error bit is found, reads the data in the third page of block B back into the data register, corrects the error bit in the register, and writes the corrected data into block C after correcting the error bit, such as the third page indicated as "corrected".

The fourth page of block B is the correct data and is copied directly to block C, labeled "" original data "". And comparing the data in the fifth page of the block A, reading the data in the fifth page of the block B back to the data buffer if an error bit is found, correcting the error bit in the buffer, and writing the corrected data into the block C after the error bit is corrected. According to the above-mentioned repairing method, the originally correct data and the corrected data in the block B are written to the block C.

Then, as shown in FIG. 3E, after all the data are written into the block C, the block B is erased, and the data of the first page, the third page and the fifth page in the block C are all corrected, and the error bits of each page are all reduced to the extent that the ECC can correct, and then the block A is marked as a damaged block.

However, when the first test data is written into the block a, the error bit may not be found, or the error bit may be found, but the error bit may not be corrected to the extent that the ECC can correct the error bit.

FIG. 3F shows the first test data recorded in block A, labeled "sample 1"; the block B is recorded with error data indicating "defect" and original data indicating "original data"; the block C describes the corrected memory page (labeled as "corrected"), the original data (labeled as "original data"), and the data (labeled as "defect") that has been corrected yet to reach the range that can be repaired by ECC or that has not found error bits and the data is still erroneous.

When the control unit writes the first test data into the block A, assuming that the first page in the block A successfully detects the error bit, the error bit is reduced to the range that the ECC can repair after the data of the first page in the block B is read back and corrected, and the corrected data is written into the block C. The second page of data of block B is then copied to block C.

When the error bits of the third page in the block a are detected by using the first test data, the comparing unit does not find the error bits or finds the error bits, but the number of the error bits cannot be reduced to the range that the ECC can repair the error bits after being corrected. In the former case (no error bit found), the control unit copies the data in the third page of block a into block C; or in the latter case (error bit found), the repaired data in the control unit data register is copied to block C, such as the page marked "defective" in block C. The fourth page of data of block B is then copied to block C.

And then detecting error bits of the fifth page in the block A, and writing the repair data in the data buffer into the block C after reading the data of the fifth page in the block B into the data buffer and correcting the data, assuming that the error data of the fifth page in the block B can be successfully repaired and the error bits are reduced to a range which can be corrected by ECC. The next page of block B is then copied to block C.

After all the data in block C in FIG. 3F are written, the data in block A and block B are erased, and then referring to FIG. 3G, the test data generator generates the second test data (labeled "sample 2") and writes block A in order to test the error bits in the third page of block A. As shown in fig. 3H, the data in the first page and the second page of block C is directly copied to another block, block D, in the flash memory. Then, the control unit detects the error bit of the third page of the block A according to the second test data, reads the data of the third page of the block C back to the data buffer if the error bit is found successfully, corrects the data in the buffer according to the error bit, and writes the corrected data in the data buffer into the block D assuming that the correction can reduce the error bit to the range that the ECC can correct. Then, the next page of block C is copied to block D. Finally, as shown in FIG. 3I, after all data is written into block D, block C is erased and block A is marked as a corrupted block.

According to the interactive status of the storage system architecture and the memory blocks applying the data recovery function, the data recovery method at least comprises the steps of detecting that an error detection and correction (ECC) of one memory block can not correct the error data, reducing the content of the error to a range that the ECC can handle, copying the data in the memory block to a temporary storage space, including another redundant memory block or a Random Access Memory (RAM) utilized by the embodiment of the invention, and erasing the data in the memory block. The location of the erroneous bits is then found by the test procedure, such as by comparing the written test data with the read data. And correcting the data of the relative error bit position in the temporary storage space according to the information of the error bit, judging whether the corrected error data which cannot be repaired by the ECC technology still exists or reducing the error bit number through one or more test procedures to enable the error bit number to reach the range which can be processed by the ECC technology so as to execute error detection and correction, and finally marking the memory block with the error as a damaged block so as to avoid the error from continuing to occur in the future and improve the reading efficiency of the storage medium. The preferred embodiment is the process shown in FIG. 4.

At the beginning of step S401, when the control unit in the storage system reads a physical block of a memory unit (e.g., a flash memory), such as a first block, one or more pages of the memory unit detect that ECC cannot correct erroneous data, the data in the original block is copied to a temporary storage space (e.g., a data buffer of the aforementioned RAM, NVM, PCM, RRAM or MRAM), such as step S403, the control unit copies the data in the first block to a redundant second block by using an unused redundant block, which is assumed to be the second block, and then erases the first block, such as step S405. It should be noted that, during the process of copying the data of the first bank to the second bank, the control unit controls the ECC unit to stop performing the functions of error detection and repair on the erroneous data page, so as to ensure that the data transmission is not affected by the ECC, and the data of the page of the first bank is copied to the second bank without change.

Next, a test procedure is performed, a test data generator in the control unit generates a test data (if the test procedure is the first test procedure, the first test data is written) and writes the test data into the first memory block with data error (step S407), such as data "sample 1" in block a in fig. 3C, then the control unit reads the data in the memory page with data error in the first memory block (step S409), and compares the data with the first test data provided by the test data generator and the data read from the memory page with data error in the first memory block by the comparison unit (step S411), and checks whether the two data are different.

In step S413, the comparison result is used to determine whether there is an erroneous bit, i.e. an erroneous hardware address, and if it is found that the corresponding bits between the two data records different data, the memory address of the memory page is an erroneous bit; if the corresponding bit between the two data does not find different data, no error bit is found.

If the comparing unit does not detect the error bit, the method continues with a testing process (step S415), and the testing data generator generates another set of different testing data (second testing data), and performs a second testing step, in which the second testing data is also written into the first memory block, including the memory page with the data error, and continues to detect the error bit on the memory page with the data error in the first memory block.

When the invention carries out the test circulation, the upper limit times (more than 1) of a test is set, if the error bit can not be found out after the test of the times, the control unit returns the information of reading the error, and then the first memory block is marked as a damaged block (Bad block).

However, if the error bit is found after the first or several test processes (including the above steps S405, S407, S409, S411 and S413), the information of the error bit is transmitted to the repair unit, and then the control unit copies the data of the relatively erroneous page on the second bank to the data buffer (step S417).

Then, the repair unit corrects the data corresponding to the error bit in the data register according to the information of the error bit (step S419). For example, if the error bits are the first and third bits and the data in the page in the data register is 11110101, the data in the data register is changed from 1 to 0 or from 0 to 1, and the error is corrected to 01010101. After the data in the data buffer is corrected, it is determined whether the ECC unit in the control unit can perform the error detection and repair process on the data in the data buffer, and whether the error detection and repair process is within the range that the ECC unit can repair (step S421), if the number of bits of the error data in the data buffer still exceeds the range that the ECC unit can repair, that is, the ECC unit cannot repair the error data detected in the data buffer, the test data generator is enabled to generate another set of test data to be written into the memory page of the first memory block where the data error occurs (step S415), and the error bits in the memory page are continuously detected, so as to further correct the error data in the data buffer, and to reduce the error bits to the range that the ECC can correct.

Similarly, if the number of test data tested is greater than the upper limit of the test times, the error bits in the data buffer can not be effectively reduced to the range that can be corrected by ECC, the control unit returns the read error message, and then marks the first memory block with errors as a damaged block.

If the data in the data buffer is repaired and the number of bits of the error has been reduced to the extent that the ECC unit can repair the error data, in other words, the ECC unit can repair the error data detected in the data buffer, the control unit further finds a redundant block in the memory unit, assuming that the redundant block is a third memory block (e.g., block C in fig. 3E), writes the corrected data in the data buffer into the position of the error-occurring memory page in the third memory block, and copies the data of the other memory pages in the third memory block from the corresponding memory pages in the second memory block with copied data (step S423). When the data in the third memory block is copied, the control unit erases the second memory block (step S425) and marks the first memory block as a damaged block (step S427).

Therefore, according to the above process, if there are multiple memory pages in a memory block where ECC cannot repair the error data, the storage system of the present invention can still use the method of writing test data to detect and repair the error bits.

Next, another embodiment is provided, as shown in the flowchart of the steps of the data recovery method shown in FIG. 5. In this example, when the control unit reads one of the physical blocks of the memory unit (e.g., the flash memory), assuming that the first block is the first block, when a certain page of the first block is read, the ECC cannot correct the error data (step S501), at this time, the data in the original block needs to be copied into a temporary storage space, in this example, the control unit copies all the data in the first block into a data buffer (step S503), such as a memory space in the storage system or a memory of a computer system connected to the storage system, and then erases the first block (step S505). In the process of copying the data of the first bank to the data buffer, the control unit controls the ECC unit to stop executing the functions of error detection and repair aiming at the memory page of which the ECC cannot correct the error data, so as to ensure that the data transmission is not influenced by the ECC, and the data in the memory page of the first bank with the data error can be completely and invariably copied to the data buffer.

Then, a test procedure is performed, in step S507, the test data generator provides a test data to write into the first bank, including the memory page with the error, then the control unit reads the data of the memory page with the error in the first bank (step S509) to the comparison unit for comparison (step S511), the comparison unit compares the test data provided by the test data generator with the data read from the memory page in the first bank, and checks whether there is a difference between the two data, in step S513, it is determined whether there is an error bit, which includes two conditions:

the first condition is as follows:

if the corresponding bits of the two data records different data, it is determined that the first bank has an error bit, and the data recorded in the bit needs to be corrected, i.e. the data written into the temporary storage area in step S503 is corrected.

The second situation:

if the corresponding bits of the two data blocks do not have different data, no error bit is found, and the next test procedure (step S515) is performed to make the test data generator generate another set of test data to write the first memory block into the memory page with the error to continue detecting the error bit on the memory page of the first memory block, which includes repeating steps S505, S507, S509, S511, and S513.

In the second situation, after repeating the above steps several times, if the number of test data provided by the test data generator exceeds a preset upper limit, the error bit of the memory page in the first memory block cannot be found, the control unit is convenient for finding a redundant block in the memory unit, which is the second memory block, to write the block data in the data buffer into the second memory block, and then the control unit reports the read error information, and marks the first memory block as a damaged block, which is described in fig. 6.

In the first case, when the comparing unit finds the error bit, the information of the error bit is transmitted to the repairing unit, then, the repair unit corrects the bit data corresponding to the error bit on the block data page in the data register according to the information of the error bit (step S517), after correcting the data in the data register, the ECC unit performs error detection and repair procedures on the data in the block data memory page in the data register, and it is determined whether the ECC is within the repairable range (step S519), if the bit number of the error data of the memory page (the memory page with the error in the original first memory block) in the data buffer still exceeds the repairable range of the ECC unit, then, the test data generator generates another set of test data to be written into the page of the first bank, and continues the detection and correction process, as described in step S515.

If the error bits on the memory page in the data buffer can not be effectively reduced to the range that can be corrected by the ECC after a predetermined number of test procedures have been performed, the control unit facilitates to find a redundant block in the memory unit, assume that the redundant block is a second memory block, and writes the block data in the data buffer into the second memory block, and then returns the read error information, and then marks the first memory block in which the error occurs as a damaged block (Bad block), as described in fig. 6.

If the number of erroneous bits in the data buffer is reduced to the range where the ECC unit can repair the data after the data in the data buffer is repaired, the control unit finds a redundant block in the memory unit, and writes the data in the data buffer into the block (step S521) assuming that the redundant block is the second block, and after all the data in the second block are copied, the control unit marks the first block with the data error as a damaged block (step S523).

Then, if the storage system encounters the situation that a plurality of memory pages in a block are error-correctable by ECC, the storage system of the present invention copies all the data in the block into the data buffer, and then performs the above-mentioned error detection and repair procedure for the memory pages in the block whose errors cannot be corrected by ECC one by one, when the error bits of each memory page whose errors cannot be corrected by ECC originally in the data buffer are all reduced to the range where ECC can be repaired, the data in the data temporary storage area is written back into the memory cells, and then the memory pages are repaired by the ECC unit, and the block where the error data cannot be corrected by ECC is marked as a damaged block.

In the above-mentioned processes described in step S415 of fig. 4 and step S515 of fig. 5, under the condition that the comparing unit does not find the error bit, the present invention repeatedly executes the testing process to make the testing data generator generate the next different set of testing data, and then executes one or more testing steps to detect the error bit on the memory page with data error on the memory block until the error bit is found.

In the present invention, when the test cycle is performed, an upper limit number of tests (greater than 1) is set, and mainly under the condition that no error bit is found or the ECC cannot be repaired (step S601), the next test process is performed, before which it is determined whether the upper limit of the number of tests has been exceeded? (step S603), if the error bits cannot be found after the number of tests, i.e. the number of testing processes performed exceeds the predetermined number, step S611 is performed, and the control unit returns the result of the read error to the storage system and marks the memory block to be tested as a damaged block (step S613). If the number of the test processes to be performed is still less than or equal to the predetermined number, the next test process is continued (step S605), and the test process refers to fig. 4 or fig. 5. And determine if an error bit is found in the test result? (step S607), or after repair, determine whether the ECC is repairable? (step S609), if no error bit is found yet or ECC cannot be repaired after being repaired, repeat the steps described in FIG. 6.

According to the above embodiments, the storage system with data recovery function provided by the present invention performs data exchange through a built-in temporary memory, especially a temporary memory for data recovery, and fig. 7A to 7H show that a random access memory in a computer system is used as a temporary memory location for data temporary storage and recovery, rather than using a memory block in a memory unit in the storage system, especially when the storage system in this embodiment is connected to a computer system, such as a desktop computer, a notebook computer, a portable computer system, etc., and the storage capacity of the random access memory in the computer system is large enough to provide temporary storage for the storage system.

Please refer to fig. 7A to 7C. In the data reading process, when the ECC cannot correct the page, such as the first page of block a in fig. 7A (page marked with "defect"), the control unit performs the operation of copying the data of block a into the data register, in this case, the data of block a is copied into the RAM temporary block. Assuming that the third page and the fifth page of the memory page are also found to be the situation that the ECC cannot correct the error data during the copying process, the data of the first page, the third page and the fifth page and other pages that cannot correct the error data without the ECC are directly copied from the block a to the RAM temporary block, as shown in fig. 7B.

After the data transfer is completed, the original data of block A is erased, and the first test data is written into block A, as indicated by "sample 1" in FIG. 7C. Then, as shown in fig. 7D, the error bits of the first page, the third page and the fifth page of the block a are checked according to the first test data written into each page, if an error bit is found, the data in the relevant page is modified directly in the RAM temporary block in the figure according to the found error bit, when the number of bits of the error data in the first page, the third page and the fifth page in the data register has been reduced to a range where the ECC can correct (for example, the page marked as "corrected" in the RAM temporary block in fig. 7D, and the rest are "original data"), the control unit facilitates to find a block in the memory unit, supposing that the block is block B, copy the data in the RAM temporary block into block B, and finally mark the block a as a damaged block.

However, as shown in FIG. 7E, when the control unit is about to detect the error bits of the third page of the block A to correct the data of the third page of the data register, there is a situation that the test process using the first test data (e.g., "sample 1") still cannot detect the error bits of the third page of the block A, such as the third page marked "defect"; or the second condition, the error bit is detected, but the error bit of the corrected data still exceeds the correction range of ECC.

When the second condition is detected, the control unit still modifies the data in the third page of the data register according to the error bit, and then reads the data in the fifth page of the block A for comparison, so as to correct the error bit in the fifth page of the RAM temporary storage block. If the first condition of the error bit cannot be detected, the data of the third page in the RAM data block is not corrected, such as the memory page marked with "defect" in the RAM temporary block shown in FIG. 7E, and then the data of the fifth page of the block A is read for comparison, so as to correct the error bit in the fifth page of the data temporary block.

Next, as shown in FIG. 7F, after the first page, the third page and the fifth page of the block A are tested by the first test data, the block A is erased to write the second test data (e.g. labeled "sample 2"). Referring to fig. 7G and 7H, after the second test data is written into the block a, the control unit reads the third page of data of the block a to compare with the second test data to find the error bit, and then corrects the third page of data in the RAM temporary storage block according to the error bit, such as the "corrected" third page of memory marked in the RAM temporary storage block of fig. 7G.

However, if the second test data still cannot correct the data of the third page in the data register, another test data may be written into the block a, and the above processing procedure is repeated to repair the data of the third page in the data register. If the next test times exceed a predetermined upper test limit number, the error bits still cannot be corrected to the range where ECC can be corrected, the control unit outputs read error information, and stores the data in the temporary storage block of RAM in this example back into another block of the memory unit, and then marks the block A as a damaged block, as indicated by the "damaged area" in the block A in FIG. 7H.

In another case, when the data of the first page, the third page and the fifth page in the temporary storage block of the RAM are all corrected, so that the error bits of the pages are all reduced to the range that the ECC can correct, the control unit is convenient for the memory unit to find a block B, write the data in the data buffer into the block B, and then mark the block a as a damaged block.

Therefore, when the error which can not be corrected by ECC occurs in one block originally, and the data reading error is caused, the data repairing method of the invention can reduce the data error bits in the block to the range which can be corrected by ECC, store the repaired block data in another block, and mark the block which can not correct the error data by ECC as a damaged block to avoid reuse, thereby effectively improving the reliability of the stored system data.

In summary, the present invention discloses a storage system with data recovery function and a data recovery method thereof, which utilize one or more repeated testing and recovery processes to reduce errors in a memory medium to a range that can be recovered by commonly used error detection and correction (ECC) functions, so as to ensure data reading correctness and effectively improve data reliability.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and these improvements and modifications should also be construed as the protection scope of the present invention.

Claims

1. A storage system having data repair functionality, the system comprising:

a non-volatile memory cell; and

a control unit, the control unit comprising:

a test data generator for generating test data and writing the test data into the memory unit at the location where the error data is stored;

a comparing unit for comparing the test data read from the memory unit with the original write test data to determine the address of the error bit of the memory unit; and

an error detection and correction unit for performing error detection and correction of data in the memory unit and for performing error checking and correction of data in the data register;

when the control unit detects that a memory block of the non-volatile memory unit has the condition that the error detection and correction unit cannot correct error data, the control unit copies the data in the memory block into a temporary storage space and erases the data of the memory block, and the test data generator performs a test process to generate test data and writes the test data into a position of the memory block for storing the error data; comparing the read test data in the memory block with the original write test data by the comparison unit to obtain the position of an error bit and repairing the error bit;

the repairing also comprises correcting error bits of the data stored in the temporary storage space; judging whether the corrected error data which can not be repaired by the error detection and correction unit still exist, if the error bit number of the data stored in the temporary storage space still exceeds the range which can be repaired by the error detection and correction unit, carrying out the next test flow; if the error bit number of the data stored in the temporary storage space is reduced to the range which can be repaired by the error detection and correction unit, performing error detection and correction; and marking the error memory block as a damaged block.

2. The storage system with data recovery function as claimed in claim 1, further comprising a recovery unit for correcting the erroneous bits of the data stored in the data register according to the information of the erroneous bits provided by the comparison unit.

3. The storage system according to claim 2, wherein the data register is configured to temporarily store data that cannot be repaired by the ECC and ECC unit in the memory unit.

4. A storage system with data recovery as in claim 3 wherein the data register is selected from the group consisting of random access memory, non-volatile memory, phase change memory, free ferroelectric random access memory, and magnetic random access memory.

5. The storage system with data recovery function of claim 1, wherein the test data is randomly generated two-bit random number data.

6. A data recovery method applied to a storage system, the method comprising:

detecting the condition that the error detection and correction unit can not correct the error data of a memory block;

copying the data in the memory block to a temporary storage space and erasing the data in the memory block;

performing a test procedure including generating a test data and writing the test data into the memory block at a location where the error data is stored;

by comparing the test data read from the memory block with the original write test data to obtain the location of the error bit and repairing,

the process of repairing further comprises:

correcting error bits of data stored in the temporary storage space;

judging whether the corrected error data which can not be repaired by the error detection and correction unit still exists, if the error bit number of the data stored in the temporary storage space still exceeds the range which can be repaired by the error detection and correction unit, performing the next test flow; if the number of error bits of the data stored in the temporary storage space has fallen to the extent that the error detection and correction unit can repair,

performing error detection and correction; and

the memory block is marked as a corrupted block.

7. The method as claimed in claim 6, wherein the temporary storage space is selected from the group consisting of random access memory, non-volatile memory, phase change memory, free ferroelectric random access memory, and magnetic random access memory.

8. The method as claimed in claim 6, wherein the step of copying the data of the memory block to the temporary space stops the original functions of error detection and repair in the storage system.

9. The method of claim 6, wherein the generated test data is randomly generated two-bit random number data.

10. The method as claimed in claim 6, wherein when no error bit is found, it is determined whether the number of next tests exceeds an upper limit number of tests.

11. The method as claimed in claim 10, wherein if the number of next tests exceeds the upper limit number of tests, a control unit reports a read error message and marks the memory block as a damaged block.