CN109388519B - Error recovery method and device and processor - Google Patents

Error recovery method and device and processor Download PDF

Info

Publication number
CN109388519B
CN109388519B CN201710668375.9A CN201710668375A CN109388519B CN 109388519 B CN109388519 B CN 109388519B CN 201710668375 A CN201710668375 A CN 201710668375A CN 109388519 B CN109388519 B CN 109388519B
Authority
CN
China
Prior art keywords
important
data
bits
error
error recovery
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710668375.9A
Other languages
Chinese (zh)
Other versions
CN109388519A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN202110625444.4A priority Critical patent/CN113419899A/en
Priority to CN201710668375.9A priority patent/CN109388519B/en
Publication of CN109388519A publication Critical patent/CN109388519A/en
Application granted granted Critical
Publication of CN109388519B publication Critical patent/CN109388519B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Detection And Correction Of Errors (AREA)
  • Retry When Errors Occur (AREA)

Abstract

An error recovery method, device and processor, the error recovery method includes dividing the data with errors into M important data and non-important data with important grades; and performing error recovery on important data, and performing error recovery on the important data and non-important data if the important data and the non-important data cannot be recovered, wherein M is a positive integer.

Description

Error recovery method and device and processor
Technical Field
The present invention relates to the field of data processing, and in particular, to an error recovery method, an error recovery device, and a processor.
Background
Neural networks (neural networks) and neural network processors have achieved very successful applications. However, the number of parameters and the amount of calculation for neural network applications are very large, and therefore, high requirements are placed on the safety and reliability of the storage unit and the calculation unit.
The error recovery mechanism can help the system recover from the error state when the system has an error. However, the conventional error recovery mechanism has many recovery cycles and high cost, greatly affects the throughput rate of the system, and is not suitable for a neural network processor with high throughput rate. Therefore, how to combine the fault-tolerant characteristic of the neural network to perform error recovery becomes an urgent problem to be solved. .
Disclosure of Invention
In view of the problems in the prior art, the present invention provides an error recovery method, an error recovery device, and a processor to overcome the above-mentioned deficiencies in the prior art.
According to an aspect of the present invention, there is provided an error recovery method including:
dividing the data with errors into important data and non-important data with M important levels; and performing error recovery on important data, and performing error recovery on the important data and non-important data if the important data and the non-important data cannot be recovered, wherein M is a positive integer.
In some embodiments, dividing the important data of each importance level into important bits and non-important bits, the performing error recovery on the important data includes: and carrying out error recovery on the important bits, and if the important bits cannot be recovered, carrying out error recovery on the important bits and the non-important bits of the important data.
In some embodiments, the dividing the data in which the error occurs into M important data and non-important data of important levels includes: monitoring the working time sequence in each module in the processor, and generating an error signal if an error is found; and locating the module, pipeline location and error type of the error in the processor according to the error signal.
In some embodiments, the dividing the error-occurring data into important data and non-important data includes dividing according to at least one of a size of the data, a size of an absolute value of the data, a type of the data, a read operation frequency of the data, and a write operation frequency of the data.
In some embodiments, dividing the important data of each importance level into important bits and non-important bits comprises: extracting important bits from important data of the ith important level, and if the important data has Xi bits and Yi bits are designated as important bits, the important data has Xi-Yi non-important bits, wherein i is 1,2 … …, M, Xi, Yi is a positive integer, and 0 < Yi ≦ Xi.
In some embodiments, the Yi bits comprise consecutive bits or non-consecutive bits.
In some embodiments, the data includes neural network parameters, and the error recovery method is for a neural network processor.
According to another aspect of the present invention, there is provided an error recovery apparatus including: an important grade dividing unit which divides the data with errors into M important data and non-important data with important grades; and an error recovery control unit which performs error recovery on the important data, and performs error recovery on the important data and the non-important data if the important data and the non-important data cannot be recovered, wherein M is a positive integer.
In some embodiments, the importance level dividing unit further includes an importance bit dividing unit configured to divide the importance data of each importance level into importance bits and non-importance bits, and the error recovery control unit performs error recovery on the importance data including error recovery on the importance bits, and if the importance bits and the non-importance bits of the importance data cannot be recovered, performs error recovery on the importance bits and the non-importance bits of the importance data.
In some embodiments, the error recovery apparatus includes: the error monitoring unit is used for monitoring the working time sequence in each module in the processor, and generating an error signal if an error is found; and an error locating unit for locating a module in which an error occurs, a pipeline position, and an error type in the processor according to the error signal.
In some embodiments, the importance ranking dividing unit divides the data in which the error occurs into the important data and the non-important data, including dividing according to at least one of a size of the data, a size of an absolute value of the data, a type of the data, a read operation frequency of the data, and a write operation frequency of the data.
In some embodiments, the importance ranking dividing unit divides the importance data of each importance ranking into the importance bits and the non-importance bits includes: extracting important bits from important data of the ith important level, and if the important data has Xi bits and Yi bits are designated as important bits, the important data has Xi-Yi non-important bits, wherein i is 1,2 … …, M, Xi, Yi is a positive integer, and 0 < Yi ≦ Xi.
In some embodiments, the Yi bits comprise consecutive bits or non-consecutive bits.
In some embodiments, the data includes neural network parameters, and the error recovery device is for a neural network processor.
According to yet another aspect of the present invention, there is provided a processor comprising: at least one of a preprocessing module, a DMA, a memory unit, an input cache unit, an instruction control unit, and an arithmetic unit, and at least one of the error recovery devices.
In some embodiments, the error recovery device is connected to at least one of the preprocessing module, the DMA, the storage unit, the input cache unit, the instruction control unit, and the arithmetic unit, and is configured to perform error recovery on an error of at least one of the preprocessing module, the DMA, the storage unit, the input cache unit, the instruction control unit, and the arithmetic unit.
In some embodiments, the error recovery devices are connected to the preprocessing module, the DMA, the storage unit, the input buffer unit, the instruction control unit, and the arithmetic unit in a one-to-one correspondence manner, and each error recovery device performs error recovery on an error in the preprocessing module, the DMA, the storage unit, the input buffer unit, the instruction control unit, and the arithmetic unit connected thereto.
In some embodiments, an error recovery device is embedded in at least one of the pre-processing module, the DMA, the storage unit, the input cache unit, the instruction control unit, and the arithmetic unit.
In some embodiments, error recovery for errors of the arithmetic unit includes re-executing the operation of the parameter.
In some embodiments, the processor comprises a neural network processor, the storage unit is configured to store neurons, weights, and/or instructions of a neural network; the instruction control unit is used for receiving the instruction, decoding the instruction and generating control information to control the operation unit; and the operation unit is used for receiving the weight and the neurons, finishing the neural network training operation and retransmitting the output neurons to the storage unit.
According to the technical scheme, the invention has at least the following beneficial effects:
distinguishing important levels of data with errors, setting important bits in the important levels, recovering the important bits preferentially, recovering the important data when the important bits cannot be recovered, and recovering all the data when the important bits cannot be recovered, wherein when the important bits are applied to a neural network processor, the recovery period is short, the power consumption benefit is good, and the neural network processor is suitable for the neural network processor with high throughput rate;
the storage unit, the instruction control unit and the arithmetic unit of the neural network processor respectively correspond to an error recovery device, so that the error recovery period is further shortened.
Drawings
FIG. 1 is a flow chart of an error recovery method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a data redundancy device according to another embodiment of the present invention;
FIG. 3 is a block diagram of a processor according to yet another embodiment of the invention;
FIG. 4 is a block diagram of a processor according to another embodiment of the invention.
Detailed Description
Certain embodiments of the invention now will be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. Indeed, various embodiments of the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements.
In this specification, the various embodiments described below which are meant to illustrate the principles of this invention are illustrative only and should not be construed in any way to limit the scope of the invention. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the invention as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for similar functions and operations.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
The embodiment of the invention provides an error recovery method, which is used for distinguishing important levels of error data, setting important bits in each important level, recovering the important bits preferentially, recovering the important data when the important bits cannot be recovered, and recovering all data again when the important bits cannot be recovered.
Specifically, fig. 1 shows a flowchart of an error recovery method, and as shown in fig. 1, the error recovery method includes the following specific steps:
step S101: monitoring the working time sequence in each module in the processor, and generating an error signal if an error is found;
step S102: according to the error signal, positioning a module with an error, a pipeline position and an error type in the processor;
the errors include, but are not limited to: uncorrectable errors, correctable errors, non-fatal errors, and other types of errors. Uncorrectable errors include, but are not limited to, error conditions for the functionality of the hardware interface. Correctable errors include, but are not limited to, error conditions where hardware can recover without any loss of information. Fatal errors include, but are not limited to, uncorrectable error conditions that render hardware unreliable. Non-fatal errors include, but are not limited to, uncorrectable errors that render a particular task unreliable, but with fully functioning hardware.
Step S103: dividing the data with errors into important data and non-important data with M important levels;
the division of the important data and the non-important data and the division of the important level of the important data may be performed according to at least one of the size of the parameter, the size of the absolute value of the parameter, the type of the parameter (shaping, floating point type), the read operation frequency of the parameter, the write operation frequency of the parameter, and the like.
Step S104: dividing the important data of each important level into important bits and non-important bits;
specifically, bits in the data are divided into significant bits and non-significant bits. Extracting important bits from important data of an ith important level, and if the important data has Xi bits and Yi bits are designated as important bits, the important data has Xi-Yi non-important bits, wherein i is 1,2 … …, M, Xi, Yi is a positive integer, 0 < Yi ≦ Xi, and the positions of Y important bits may be continuous or discontinuous.
Step S105: and carrying out error recovery on the error data.
Specifically, the error recovery is performed on the important bits of the important data of each important level preferentially, and if the error recovery is not possible, the error recovery is performed on the important bits and the non-important bits of the important data of each important level secondarily, and if the error recovery is still not possible, the error recovery is performed on both the important data and the non-important data of the error data.
The foregoing steps are not all necessary, and in some embodiments, the data may be divided into important data and non-important data without distinguishing important bits and non-important bits in the important data, where when performing error recovery, error recovery is preferentially performed on the important data, and if recovery is not possible, error recovery is performed on both the important data and the non-important data in the error data.
The data in the above embodiments includes neural network parameters, the neural network parameters include neurons and weights of the neural network, the topology of the neural network, and instructions, and the error recovery method may be applied to the neural network processor.
Fig. 2 is a block diagram illustrating a structure of an error recovery apparatus according to another embodiment of the present invention, and as shown in fig. 2, the error recovery apparatus 100 includes a monitoring unit 10, an error locating unit 20, an importance ranking unit 30, and an error recovery control unit 40.
The monitoring unit 10 monitors the timing of operations within the various modules in the processor and generates an error signal when an error is found. The error locating unit 20 receives the error signal generated by the monitoring unit 10, and locates the module in which the error occurred, the pipeline position, and the error type in the processor according to the error signal.
The errors include, but are not limited to: uncorrectable errors, correctable errors, non-fatal errors, and other types of errors. Uncorrectable errors include, but are not limited to, error conditions for the functionality of the hardware interface. Correctable errors include, but are not limited to, error conditions where hardware can recover without any loss of information. Fatal errors include, but are not limited to, uncorrectable error conditions that render hardware unreliable. Non-fatal errors include, but are not limited to, uncorrectable errors that render a particular task unreliable, but with fully functioning hardware.
The importance ranking dividing unit 30 divides the data in which the error occurs into the important data and the non-important data of M importance rankings, and the division of the important data and the non-important data and the division of the importance ranking of the important data may be performed according to at least one of the size of the parameter, the size of the absolute value of the parameter, the type of the parameter (shaping, floating point type), the read operation frequency of the parameter, the write operation frequency of the parameter, and the like.
The importance level dividing unit 30 may further include an importance bit dividing unit 31, and the importance bit dividing unit 31 divides the importance data of each importance level into importance bits and non-importance bits, specifically, bits in the data are divided into importance bits and non-importance bits. Extracting important bits from important data of an ith important level, and if the important data has Xi bits and Yi bits are designated as important bits, the important data has Xi-Yi non-important bits, wherein i is 1,2 … …, M, Xi, Yi is a positive integer, 0 < Yi ≦ Xi, and the positions of Y important bits may be continuous or discontinuous.
The error recovery control unit 40 performs error recovery on the error data.
Specifically, the error recovery control unit 40 performs error recovery on important bits of the important data of each important level preferentially, performs error recovery on important bits and non-important bits of the important data of each important level if recovery is not possible, and performs error recovery on both important data and non-important data of the error data if recovery is still not possible.
The above steps are not all necessary, and in some embodiments, the importance ranking dividing unit 30 does not include the importance bit dividing unit 31, and only divides the data into the important data and the non-important data without distinguishing the important bits and the non-important bits in the important data, at this time, when the error recovery control unit 40 performs error recovery, the error recovery is preferentially performed on the important data, and if the error recovery is not performed, the error recovery is performed on both the important data and the non-important data in the error data.
The data in the above embodiments includes neural network parameters, the neural network parameters include neurons and weights of the neural network, the topology of the neural network, and instructions, and the error recovery method may be applied to the neural network processor.
Yet another embodiment of the present invention provides a processor, including: at least one of a memory unit, an instruction control unit, and an arithmetic unit, and at least one of the error recovery apparatus 100 described above.
The processor may be a neural network processor 1000, fig. 3 shows a block diagram of a neural network processor in another embodiment, and as shown in fig. 3, the neural network processor 1000 includes a storage unit 200, an instruction control unit 300, and an arithmetic unit 400.
The storage unit 200 receives external input data, stores neurons, weights and/or instructions of the neural network, sends the instructions to the instruction control unit 300, and sends the neurons and weights to the operation unit 400.
The instruction control unit 300 receives the instruction transmitted from the storage unit 200, decodes the instruction, and generates control information to control the arithmetic unit 400.
And the operation unit 400 is configured to receive the weight and the neuron sent by the storage unit 200, complete neural network training operation, and retransmit the output neuron to the storage unit 200 for storage.
As shown in fig. 3, the neural network processor 1000 further includes error recovery devices 100 corresponding to the storage unit 200, the instruction control unit 300, and the operation unit 400, respectively, and the error recovery devices 100 are embedded in the corresponding storage units and are respectively used for performing error recovery on the storage unit 200, the instruction control unit 300, and the operation unit 400.
In some embodiments, the error recovery apparatus 100 corresponding to the storage unit 200, the instruction control unit 300 and the operation unit 400 can be connected to the storage unit 200, the instruction control unit 300 and the operation unit 400, respectively, and need not be embedded.
In some embodiments, the neural network processor 1000 may include only one error recovery device 100, which is connected to the storage unit 200, the instruction control unit 300, and the operation unit 400, and is used to perform error recovery on the storage unit 200, the instruction control unit 300, and the operation unit 400.
For the neural network processor 100 shown in fig. 3, the data includes neural network parameters including neurons and weights of the neural network, topology of the neural network, and instructions.
When the error of the memory cell 200 is recovered, it is preferable to perform error recovery on the important bits in the important parameters, and if the error cannot be recovered, the error recovery is performed on the parameters including the important bits, and if the error cannot be recovered, the error recovery is performed on all the parameters.
Important parameters and important bits in the neural network parameters can be redundantly stored in an error correction coding mode. Error correction coding includes, but is not limited to, Cyclic Redundancy Check (CRC), Error Checking and Correction (ECC).
ECC checking can correct 1bit errors, and errors of more than one bit cannot be recovered.
The CRC check includes CRC12, where CRC12 checks error detection capability as: the first type can be corrected if the number of error bits is odd, the second type can be corrected if the number of error bits is 5 or less, the third type can be corrected if the length of a single burst error is 12 or less, and the fourth type can be corrected if the length of two burst errors is 2 or less. When the error exceeds the above four cases, the error cannot be recovered.
Important parameters and important bits in the neural network parameters can be subjected to copy backup, and copy redundancy can be backed up in the same storage medium or in different storage media. The data may be backed up N copies at the same time, where N is a positive integer greater than 0.
When the error cannot be recovered by using the error correction coding mode, the error recovery can be carried out by using the backup data.
When the frequency of the physical address errors exceeds a threshold value T, the physical address is set as an invalid physical address, the invalid physical address is released, and an instruction corresponding to the invalid physical address is killed.
When the error of the instruction control unit is recovered, the instruction with the error is decoded again and executed.
When the operation unit is recovered to be wrong, preferably, the operation result of the unimportant parameter and the operation result of the unimportant bit are reserved, and the operation of the important bit in the important parameter is executed again; when the preferred method can not recover the error, the operation result of the unimportant parameters is retained for the second time, the operation of the important parameters is executed again, and when the selected method can still not recover the error, the operation of all the parameters is executed again.
Fig. 4 is a block diagram of a processor according to another embodiment of the present invention, and in this embodiment, the processor 2000 processes the structures of the units included in all the above embodiments. It is divided into a preprocessing module 2001 and a neural network operation module 2002 as a whole.
The preprocessing module 2001 performs preprocessing on the raw input data, including segmentation, gaussian filtering, binarization, regularization, normalization, etc., and inputs the processed data into the neural network operation module 2002.
The neural network operation module 2002 performs neural network operation and outputs a final transportation result.
The neural network computing module 2002 includes the same memory unit 200, control unit 300, and computing unit 400 as in the previous embodiment.
The storage unit 200 is mainly used to store the neurons, weights and instructions of the neural network. When storing the weight, only storing the nonzero weight and the position information of the nonzero weight, and when storing the quantized nonzero weight, only storing the nonzero weight codebook and the nonzero weight dictionary. The control unit 300 receives the command from the storage unit 200, decodes the command, and generates control information to control the operation unit 400. And the operation unit 400 is used for executing corresponding operation on the data according to the instruction stored in the storage unit 22.
The neural network operation module 2002 further includes a direct data access unit (DMA direct memory access)500, an input buffer unit 600, a lookup table 700, a number selection unit 800, and an output neuron buffer 900.
The DMA500 is used for reading and writing data or instructions in the memory unit 200, the input buffer unit 600, and the output neuron buffer 900.
The input buffer unit 600 includes an instruction buffer 601, a non-zero weight codebook buffer 602, a non-zero weight dictionary buffer 603, a non-zero weight location buffer 604, and an input neuron buffer 605.
An instruction cache 601 for storing dedicated instructions;
a non-zero weight codebook cache 602 for caching a non-zero weight codebook;
a non-zero weight dictionary cache 603 for caching a non-zero weight dictionary;
a non-zero weight location cache 604 for caching non-zero weight location data; the non-zero weight position cache corresponds each connection weight in the input data to the corresponding input neuron one by one.
In one case, the one-to-one correspondence method of the non-zero weight position caches is that 1 is adopted to represent connection, 0 is adopted to represent no connection, and each group of output and all input connection states form a character string of 0 and 1 to represent the connection relation of the output. In another situation, the non-zero weight position cache one-to-one correspondence method is that 1 is adopted to represent connection, 0 is adopted to represent no connection, and the connection state of each group of input and all output forms a character string of 0 and 1 to represent the connection relation of the input. In another case, the one-to-one correspondence method of the non-zero weight position caches is that the distance from the position of an input neuron where a group of output first connections are located to a first input neuron, the distance from the output second group of input neurons to a last input neuron, the distance from the output third group of input neurons to a last input neuron, … … and the like are repeated until all the inputs of the outputs are exhausted to represent the connection relation of the outputs.
And an input neuron buffer 605 for buffering the input neurons inputted to the coarse-grained selection unit.
The lookup table unit 700 is configured to analyze the quantized weight of the neural network, receive the weight dictionary and the weight codebook, obtain the weight through a lookup operation, and directly transmit the unquantized weight to the operation unit through a bypass.
The number selecting unit 800 is configured to receive the input neurons and the non-zero weight position information, and select the neurons that need to be calculated. And the number selection unit is used for receiving the input neurons and the position information of the nonzero weight value and selecting the neurons corresponding to the nonzero weight value. That is to say: for each output neuron data, the number selection unit removes input neuron data that does not have corresponding non-zero weight data to the output neuron data.
The error recovery apparatus may be embedded in at least one of the preprocessing module 2001, the storage unit 200, the control unit 300, the arithmetic unit 400, the DMA500, the input buffer unit 600, the look-up table unit 700, and the select unit 800 to perform error recovery.
The processes or methods depicted in the preceding figures may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, etc.), firmware, software (e.g., software carried on a non-transitory computer readable medium), or a combination of both. Although the processes or methods are described above in terms of some sequential operations, it should be understood that some of the operations described may be performed in a different order. Further, some operations may be performed in parallel rather than sequentially.
It is to be noted that, in the attached drawings or in the description, the implementation modes not shown or described are all the modes known by the ordinary skilled person in the field of technology, and are not described in detail. Further, the above definitions of the various elements and methods are not limited to the various specific structures, shapes or arrangements of parts mentioned in the examples, which may be easily modified or substituted by those of ordinary skill in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (18)

1. An error recovery method, comprising:
dividing the data with errors into important data and non-important data with M important levels; and
error recovery is carried out on important data, if the important data and non-important data can not be recovered, wherein M is a positive integer,
dividing important data of each important level into important bits and non-important bits, wherein the performing error recovery on the important data comprises: and carrying out error recovery on the important bits, and if the important bits cannot be recovered, carrying out error recovery on the important bits and the non-important bits of the important data.
2. The error recovery method of claim 1, wherein the dividing of the data in which the error occurred into the important data and the non-important data of M importance levels comprises:
monitoring the working time sequence in each module in the processor, and generating an error signal if an error is found; and
based on the error signal, the module in which the error occurred, the pipeline location, and the type of error are located within the processor.
3. The error recovery method of claim 1, wherein the dividing of the data in which the error occurs into important data and non-important data comprises dividing according to at least one of a size of the data, a size of an absolute value of the data, a type of the data, a read operation frequency of the data, and a write operation frequency of the data.
4. The error recovery method of claim 1, wherein dividing the important data of each importance level into important bits and non-important bits comprises:
extracting important bits from important data of the ith important level, and if the important data has Xi bits and Yi bits are designated as important bits, the important data has Xi-Yi non-important bits, wherein i is 1,2 … …, M, Xi, Yi is a positive integer, and 0 < Yi ≦ Xi.
5. The error recovery method of claim 4, wherein the Yi bits comprise consecutive bits or non-consecutive bits.
6. The error recovery method of claim 1, wherein the data comprises neural network parameters, the error recovery method being for a neural network processor.
7. An error recovery apparatus, comprising:
an important grade dividing unit which divides the data with errors into M important data and non-important data with important grades; and
an error recovery control unit for performing error recovery on the important data, and performing error recovery on the important data and the non-important data if the important data and the non-important data cannot be recovered, wherein M is a positive integer,
the important level dividing unit also comprises an important bit dividing unit which is used for dividing the important data of each important level into important bits and non-important bits, the error recovery control unit carries out error recovery on the important data, including the important bits, and if the important data cannot be recovered, the important bits and the non-important bits of the important data are subjected to error recovery.
8. The error recovery apparatus of claim 7, further comprising:
the error monitoring unit is used for monitoring the working time sequence in each module in the processor, and generating an error signal if an error is found; and
and the error positioning unit is used for positioning the module with the error, the pipeline position and the error type in the processor according to the error signal.
9. The error recovery apparatus of claim 7, wherein the importance ranking dividing unit divides the data in which the error occurs into the important data and the non-important data includes dividing according to at least one of a size of the data, a size of an absolute value of the data, a type of the data, a read operation frequency of the data, and a write operation frequency of the data.
10. The error recovery apparatus of claim 7, wherein the importance level dividing unit divides the importance data of each importance level into the importance bits and the non-importance bits comprises:
extracting important bits from important data of the ith important level, and if the important data has Xi bits and Yi bits are designated as important bits, the important data has Xi-Yi non-important bits, wherein i is 1,2 … …, M, Xi, Yi is a positive integer, and 0 < Yi ≦ Xi.
11. The error recovery apparatus of claim 10, wherein the Yi bits comprise consecutive bits or non-consecutive bits.
12. The error recovery apparatus of claim 10, wherein the data comprises neural network parameters, the error recovery apparatus for a neural network processor.
13. A processor, comprising:
at least one of a preprocessing module, a DMA, a storage unit, an input buffer unit, an instruction control unit, and an arithmetic unit, an
At least one error recovery apparatus as claimed in any one of claims 7 to 12.
14. The processor according to claim 13, wherein the error recovery device is connected to at least one of the preprocessing module, the DMA, the storage unit, the input buffer unit, the instruction control unit, and the arithmetic unit, and is configured to perform error recovery on an error of at least one of the preprocessing module, the DMA, the storage unit, the input buffer unit, the instruction control unit, and the arithmetic unit.
15. The processor according to claim 13, wherein the plurality of error recovery devices are connected in a one-to-one correspondence with the preprocessing module, the DMA, the storage unit, the input cache unit, the instruction control unit, and the arithmetic unit, and each error recovery device performs error recovery on an error in the preprocessing module, the DMA, the storage unit, the input cache unit, the instruction control unit, and the arithmetic unit connected thereto.
16. The processor of claim 13, wherein an error recovery device is embedded in at least one of the pre-processing module, the DMA, the storage unit, the input cache unit, the instruction control unit, and the arithmetic unit.
17. The processor of claim 13, wherein error recovery for errors of the arithmetic unit comprises re-executing operations of the parameters.
18. The processor of claim 13, wherein the processor comprises a neural network processor, and the storage unit is configured to store neurons, weights, and/or instructions for a neural network; the instruction control unit is used for receiving the instruction, decoding the instruction and generating control information to control the operation unit; and the operation unit is used for receiving the weight and the neurons, finishing the neural network training operation and retransmitting the output neurons to the storage unit.
CN201710668375.9A 2017-08-07 2017-08-07 Error recovery method and device and processor Active CN109388519B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110625444.4A CN113419899A (en) 2017-08-07 2017-08-07 Error recovery method and device and processor
CN201710668375.9A CN109388519B (en) 2017-08-07 2017-08-07 Error recovery method and device and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710668375.9A CN109388519B (en) 2017-08-07 2017-08-07 Error recovery method and device and processor

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202110625444.4A Division CN113419899A (en) 2017-08-07 2017-08-07 Error recovery method and device and processor

Publications (2)

Publication Number Publication Date
CN109388519A CN109388519A (en) 2019-02-26
CN109388519B true CN109388519B (en) 2021-06-11

Family

ID=65413441

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202110625444.4A Pending CN113419899A (en) 2017-08-07 2017-08-07 Error recovery method and device and processor
CN201710668375.9A Active CN109388519B (en) 2017-08-07 2017-08-07 Error recovery method and device and processor

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202110625444.4A Pending CN113419899A (en) 2017-08-07 2017-08-07 Error recovery method and device and processor

Country Status (1)

Country Link
CN (2) CN113419899A (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489852A (en) * 2019-08-14 2019-11-22 北京天泽智云科技有限公司 Improve the method and device of the wind power system quality of data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101547144A (en) * 2008-12-29 2009-09-30 华为技术有限公司 Method, device and system for improving data transmission quality
CN102017498A (en) * 2008-05-06 2011-04-13 阿尔卡特朗讯公司 Recovery of transmission errors
CN106648968A (en) * 2016-10-19 2017-05-10 盛科网络(苏州)有限公司 Data recovery method and device when ECC correction failure occurs on chip
CN107025148A (en) * 2016-10-19 2017-08-08 阿里巴巴集团控股有限公司 A kind for the treatment of method and apparatus of mass data

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1750426A1 (en) * 2000-12-07 2007-02-07 Sony United Kingdom Limited Methods and apparatus for embedding data and for detecting and recovering embedded data
JP5161696B2 (en) * 2008-08-07 2013-03-13 株式会社日立製作所 Virtual computer system, error recovery method in virtual computer system, and virtual computer control program
US9274898B2 (en) * 2011-09-09 2016-03-01 Nokia Technologies Oy Method and apparatus for providing criticality based data backup

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102017498A (en) * 2008-05-06 2011-04-13 阿尔卡特朗讯公司 Recovery of transmission errors
CN101547144A (en) * 2008-12-29 2009-09-30 华为技术有限公司 Method, device and system for improving data transmission quality
CN106648968A (en) * 2016-10-19 2017-05-10 盛科网络(苏州)有限公司 Data recovery method and device when ECC correction failure occurs on chip
CN107025148A (en) * 2016-10-19 2017-08-08 阿里巴巴集团控股有限公司 A kind for the treatment of method and apparatus of mass data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向高性能计算的众核处理器轻量级错误恢复技术研究;郑方等;《计算机研究与发展》;20150630;第52卷(第6期);第1.1-1.2节 *

Also Published As

Publication number Publication date
CN113419899A (en) 2021-09-21
CN109388519A (en) 2019-02-26

Similar Documents

Publication Publication Date Title
CN111553473B (en) Data redundancy method and neural network processor for executing the same
US9619324B2 (en) Error correction in non—volatile memory
CN101937724B (en) Method for performing copy back operations and flash storage device
US9411683B2 (en) Error correction in memory
CN112860475B (en) Method, device, system and medium for recovering check block based on RS erasure code
US8566672B2 (en) Selective checkbit modification for error correction
KR101819152B1 (en) Method and associated decoding circuit for decoding an error correction code
CN102789806B (en) Anti-irradiation protection method for TCAM of space devices
CN103594120A (en) Memorizer error correction method adopting reading to replace writing
JP6799262B2 (en) Arithmetic processing unit and control method of arithmetic processing unit
US20190089384A1 (en) Memory system
CN112000512B (en) Data restoration method and related device
CN109388519B (en) Error recovery method and device and processor
CN109785895B (en) ECC device and method for correcting multi-bit errors in NAND Flash
CN105320575A (en) Self-checking and recovering device and method for dual-modular redundancy assembly lines
CN210110352U (en) ECC device for correcting multi-bit errors in NAND Flash
KR20170064978A (en) Method and apparatus for correcting data in multiple ecc blocks of raid memory
CN105023616A (en) Method for storing and retrieving data based on Hamming code and integrated random access memory
WO2020199490A1 (en) Dual-mode error detection memory and dual-mode error detection method
WO2016119120A1 (en) Fec decoding apparatus and method
CN109254867B (en) Data redundancy method and device
CN102684841A (en) Coding computation unit and decoding data verification method
KR101496052B1 (en) Decoding circuit and method for improved performance and lower error floors of block-wise concatenated BCH codes with cyclic shift of constituent BCH codes
US10379952B2 (en) Data recovery and regeneration using parity code
CN106877975B (en) Hard decision joint decoding method in distributed storage capable of zigzag decoding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant