CN109933478B - Storage system and fault processing method thereof - Google Patents

Storage system and fault processing method thereof Download PDF

Info

Publication number
CN109933478B
CN109933478B CN201711377004.1A CN201711377004A CN109933478B CN 109933478 B CN109933478 B CN 109933478B CN 201711377004 A CN201711377004 A CN 201711377004A CN 109933478 B CN109933478 B CN 109933478B
Authority
CN
China
Prior art keywords
link
storage
branch
storage controller
controller
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711377004.1A
Other languages
Chinese (zh)
Other versions
CN109933478A (en
Inventor
刘玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201711377004.1A priority Critical patent/CN109933478B/en
Publication of CN109933478A publication Critical patent/CN109933478A/en
Application granted granted Critical
Publication of CN109933478B publication Critical patent/CN109933478B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Techniques For Improving Reliability Of Storages (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to a fault processing method of a storage system and the storage system, aiming at the problem that a plurality of links between a storage controller and a storage disk in the existing storage system only execute the link switching to a standby link for subsequent IO operation when IO timeout occurs in a main link or only repair the storage disk.

Description

Storage system and fault processing method thereof
Technical Field
The present invention relates to the field of storage technologies, and in particular, to a storage system and a fault handling method thereof.
Background
A hard disk array in a storage system uniformly manages a plurality of hard disks connected by interfaces and protocols through a storage controller so as to provide mature, reliable and large-capacity data storage service.
The typical hard disk array architecture is a dual-control system, and two controllers are connected together through a mirror image channel to exchange data. The hard disks are typically dual-port disks, and the back-end of both controllers may be connected to each hard disk separately, the hard disks being visible to both controllers. Two controllers of the hard disk array can be connected with a plurality of hard disks through PCIe chips, and for each hard disk, two links are respectively connected with the two controllers.
In the prior art, the controller manages all hard disks through hard disk management software, stores hard disk link information, finds out abnormality in time and makes diagnosis by detecting the state of the hard disks so as to repair the abnormality as much as possible and ensure the reliability of data storage of the hard disk array.
However, in the fault processing scheme in the prior art, only the hard disk state and the hard disk fault are repaired, but the fault in the link cannot be repaired, which results in an increase in IO processing delay.
Disclosure of Invention
The invention provides a method for repairing the fault of a connection chip in a link and a storage system based on the condition that only the state of a hard disk and the fault of the hard disk are repaired in the prior art.
In a first aspect, the present application provides a storage failure handling method, which is applied to a storage system, where the storage system includes at least one storage disk and at least two storage controllers, each storage controller includes a connection chip, each controller is connected to each storage disk through its own connection chip, and the at least two storage controllers are connected to each other;
the method comprises the following steps:
a first storage controller receives a first data operation request, and sends a first operation instruction to a first storage disk corresponding to a first data read-write operation through a first branch of a first link, wherein the first link is a link including a connection chip of the first storage controller, the first branch of the first link is a connection of a target end in the first link to the first storage disk, and the first storage controller is any one of the at least two storage controllers;
when the first storage controller monitors that the execution of the first operation instruction is overtime, the first storage controller forwards the first operation instruction to the first storage disk through a first branch of a second link, the second link is a link including a connection chip of the second storage controller, the first branch of the second link is a connection of a target end of the second link to the first storage disk, and the second storage controller is any storage controller connected with the first storage controller;
the first storage controller receives an operation success response of the first operation instruction transmitted through a first branch of the second link, and counts the times of operation abnormity according to the operation success response, wherein the operation abnormity indicates that the operation instruction received by the first storage controller is executed overtime through the first link but is executed successfully through the second link;
and the first storage controller determines that the statistical frequency of the operation abnormity exceeds a preset threshold value in preset time, and performs fault repair on the connection chip in the first link.
According to the method, under the condition that the first link is switched to the second link, the storage controller can count the times of the operation abnormity according to the actual execution condition of the operation instruction, and after the operation abnormity reaches the preset threshold value, the first link is repaired, so that the link fault is identified and repaired, and the stability of the storage system and the execution efficiency of subsequent data operation are improved.
For the first aspect described above, one possible way to perform the operation anomaly statistics is as follows: the first storage controller increases the number of times of operation exception once or maintains the original counted number of times of operation exception unchanged according to a statistical rule and the operation success response, wherein the statistical rule comprises: counting the operation abnormity appearing on each branch of the first link only once; accordingly, the predetermined threshold is less than or equal to the number N of storage disks in the storage system. In this specific embodiment, the predetermined threshold is set as the number N of the storage disks, that is, the predetermined threshold is set as the number of the branches of the first link, when the operation abnormality is counted, the operation abnormality occurring in each branch of the first link is counted only once, the operation abnormality occurring again in any branch that has been counted is not counted, and after the operation abnormality occurs in each branch, the counted number of times of the operation abnormality reaches the number N, that is, the predetermined threshold is reached, and in this case, it can be determined that the first link has a link failure.
For the first aspect, in one possible implementation manner: the method further comprises the following steps: after determining that an operation abnormality occurs on an nth branch of the first link, the first storage controller sets a fault tag of the nth branch of the first link, wherein the fault tag indicates that the nth branch of the first link is unavailable or the level of the nth branch of the first link is reduced, N is a natural number variable, and N is greater than or equal to 1 and less than or equal to N; after receiving a subsequent data operation request aiming at the nth storage disk, the first controller directly sends a subsequent operation instruction to the nth storage disk through the nth branch of the second link according to the fault label of the nth branch of the first link
Before the first link failure is not repaired, the implementation mode avoids the execution delay or the execution failure of the subsequent operation instruction.
Further, after performing fault repair on the connected chips in the first link, the method further includes: deleting a failure label of each branch of the first link, or setting a normal label of each branch of the first link, wherein the normal label indicates that the first link state is available or the level of the first link is normal; the first controller switches back to the first link to send an operation instruction to the storage disk to which the subsequent data operation request is directed according to the state of the first link failure tag or the normal tag of the first link after receiving the subsequent data operation request to the first storage disk.
According to the implementation manner, after receiving a subsequent data operation request to be sent to the storage disk, the first storage controller directly selects the first link to send the operation instruction according to the state of the first link (the fault tag is deleted or the normal tag of the first link is deleted), and as the path of the first link is shorter than that of the second link, the subsequent operation instruction is processed more quickly, so that the processing manner avoids the time delay caused by the execution of the subsequent operation instruction through the second link, and improves the processing efficiency of the operation instruction.
Optionally, after performing fault repair on the connection chip of the first link, the method further includes: detecting whether a connection chip in the first link is successfully repaired; and deleting the fault label of the first link or setting the normal label of the first link after detecting that the connection chip in the first link is successfully repaired.
According to the method, after the first link is repaired, the first link is further detected, so that the real state of the link is obtained, and the fact that subsequent operation execution can be carried out according to the real link state is guaranteed.
Optionally, the method further comprises: and sending out a fault maintenance notice of the connection chip in the first link after detecting that the repair of the connection chip in the first link is unsuccessful.
Specifically, the performing fault repair on the connection chip in the first link includes: restarting a connection chip of the first storage controller; or, isolating the connection chip of the first memory controller; or, repairing a queue on a connection chip of the first storage controller; or repairing a port on a connection chip of the first storage controller.
The link repair method focuses on repairing the connection chip on the link, so that the hardware problem is thoroughly checked, and the repair efficiency is guaranteed.
Optionally, after the first storage controller monitors that the first operation instruction is executed for a timeout, the method further includes: the first storage controller records a first flag indicating that the first operation instruction is executed over a first branch of the first link for a timeout; before counting the number of times of operation abnormality according to the operation success response, the method further comprises the following steps: the first storage controller records a second mark, and the second mark indicates that the first operation instruction is successfully executed through the first branch of the second link; determining whether the first operating instruction is provided with the first flag and the second flag at the same time; and if the first operation instruction has the first mark and the second mark at the same time, determining that the operation is abnormal.
In a second aspect, the present application provides a storage system, comprising: at least one storage disk and at least two storage controllers; each storage controller comprises a connection chip, and each controller is connected to each storage disk through the connection chip of the controller; the at least two storage controllers are connected to each other;
a first storage controller, configured to receive a first data operation request, send a first operation instruction to a first storage disk corresponding to the data read-write operation through a first branch of a first link, and forward the first operation instruction to the first storage disk through a first branch of a second link when it is monitored that the first operation instruction is executed overtime, where the first storage controller is any one of the at least two storage controllers, the second storage controller is any one of the storage controllers connected to the first storage controller, the first link includes a link of a connection chip of the first storage controller, a first branch of the first link is a connection of the first storage disk by a target end in the first link, the second link is a link of a connection chip including the second storage controller, and a first branch of the second link is a connection of the first storage disk by a target end of the second link; and the first storage controller is used for receiving an operation success response of the first operation instruction transmitted through the first branch of the second link, counting the times of operation abnormity according to the operation success response, and performing fault repair on a connection chip in the first link after the times of operation abnormity exceed a preset threshold value in preset time, wherein the operation abnormity indicates that the operation instruction received by the first storage controller is executed overtime through the first link but is executed successfully through the second link.
The first storage controller in the storage system is further configured to perform a function related to the failure handling method performed by the first storage controller in the first aspect.
In a third aspect, the present application provides a storage controller comprising:
the storage processing module is configured to receive a first data operation request, and send a first operation instruction to a first storage disk corresponding to the first data read-write operation through a first branch of a first link, where the first link is a link including a connection chip of the first storage controller, and the first branch of the first link is a connection of a target end in the first link to the first storage disk.
A link failure processing module, configured to forward the first operation instruction to the first storage disk through a first branch of a second link after monitoring that the execution of the first operation instruction is overtime, where the second link is a link including a connection chip of the second storage controller, and the first branch of the second link is a connection of a target end of the second link to the first storage disk; and receiving an operation success response of the first operation instruction transmitted through the first branch of the second link, counting the times of operation abnormity according to the operation success response, and performing fault repair on a connection chip in the first link after the counted times of the operation abnormity exceed a preset threshold value in preset time, wherein the operation abnormity indicates that the execution of the first operation instruction through the first link is overtime, but the execution is successful through the second link.
Optionally, the executing, by the link failure processing module, the operation exception statistics specifically includes: according to the statistical rule and the operation success response, increasing the number of operation abnormity once or keeping the original statistical operation abnormity number unchanged, wherein the statistical rule comprises the following steps: counting the operation abnormity appearing on each branch of the first link only once; accordingly, the predetermined threshold is the number N of the storage disks in the storage system.
Optionally, the link failure processing module is further configured to set a failure tag of the nth branch of the first link after determining that an operation abnormality occurs on the nth branch of the first link, where the failure tag indicates that the nth branch of the first link is unavailable or the level of the nth branch of the first link is reduced, N is a natural number variable, and N is greater than or equal to 1 and less than or equal to N;
the storage processing module is further configured to, after receiving a subsequent data operation request for the nth storage disk, directly send a subsequent operation instruction to the nth storage disk through the nth branch of the second link according to the failure tag of the nth branch of the first link.
Optionally, the link failure processing module is further configured to delete a failure tag of each branch of the first link or set a normal tag of each branch of the first link after performing failure repair on a connection chip in the first link, where the normal tag indicates that the first link is available in a state or a level of the first link is normal; and after the storage processing module requests subsequent data operation, switching back to the first link to send an operation instruction to the storage disk targeted by the subsequent data operation request according to the state of the first link failure tag or the normal tag of the first link.
In a fourth aspect, the present application provides a storage controller comprising:
a storage device to store instructions; and
at least one processor coupled to the storage device;
wherein the instructions, when executed by the at least one processor, cause the processor to perform the method of the first aspect. The method, the storage system and the storage controller provided by the aspects of the application can really solve the problem of execution delay of the operation instruction caused by link failure in the storage system, avoid the problem of low efficiency although the operation is successful caused by a mode of switching paths to process the failure in the prior art, and further improve the efficiency of the storage system.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of an architecture of a storage system provided herein;
FIG. 2 is a schematic diagram of a memory controller provided herein;
FIG. 3 is a schematic diagram illustrating a flow chart of a fault handling method provided in the present application operating in a storage system;
FIG. 4 is another schematic diagram of the memory controller components provided herein.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
As shown in fig. 1, an architecture diagram of a memory system provided for an embodiment of the present invention includes two memory controllers (10, 20) and three memory disks (30, 40, 50), where the memory controller 10 includes a connection chip 60, the memory controller 20 includes a connection chip 70, the memory controller 10 is connected to the memory disks 30, 40, and 50 through the connection chip 60, and the memory controller 20 is connected to the memory disks 30, 40, and 50 through the connection chip 70; the storage controller 10 and the storage controller 20 are further connected by a link, which may be a network connection or a bus direct connection. The storage controller 10 has 3 links connected to 3 storage disks respectively, the link 01 of the storage controller 10 connected to the storage disk 30 through the connection chip 60, the link 02 of the storage disk 40 through the connection chip 60, the link 03 of the storage controller 50 through the connection chip 60, the 3 links of the storage controller 20 connected to 3 storage disks respectively, the link 04 of the storage controller 20 connected to the storage disk 30 through the connection chip 70, the link 05 of the storage controller 40 through the connection chip 70, and the link 06 of the storage controller 20 connected to the storage disk 50 through the connection chip 70.
The connection chips (60, 70) may be a peripheral component interconnect express (PCIe) chip or a Small Computer System Interface (SCSI) chip, where the connection chip in the embodiment of the present invention refers to a connection chip on the storage controller for connecting the storage disk.
The number of the storage controllers and the number of the storage disks in the storage system of fig. 1 are all examples, and the storage system provided by the embodiment of the invention includes at least two storage controllers and at least one storage disk. The hardware configuration of the storage controller and the storage disk may be flexible, and may be, for example, a configuration of a disk controller separated from a disk controller (disk refers to a storage disk, and control refers to a storage controller) in which the storage controller and the storage controller are collectively provided to form a control frame, and a plurality of storage disks are collectively provided to form a hard disk frame, or may be, for example, a configuration of a disk controller integrated with a disk controller in which the storage controller and the storage disks are collectively provided.
The storage disk in fig. 1 may be a conventional magnetic disk, such as a hard disk, a Solid State Disk (SSD), or a storage disk formed by other storage media. In the specific use of the storage disk in fig. 1, a disk array (RAID for short) may be formed, and a storage system with higher reliability may be provided by forming the disk array to store data.
The storage controller in fig. 1 may be a hardware entity device, as shown in fig. 2, the storage controller 200 may include a processing unit 201 and a communication interface 202, where the processing unit 201 is configured to perform a data storage function of the storage controller 200, the communication interface 202 is configured to perform communication interaction with other devices, and the other devices may be an accessing host or other storage systems, for example, the processing unit 201 receives a data read request or a data write request sent by the accessing host through the communication interface 202, and specifically, the communication interface 202 may be a network adapter card. Optionally, the hardware-based storage controller 200 may further include an input/output interface 203, and the input/output interface 203 is connected with an input/output device for receiving input information and outputting an operation result. The input/output interface 203 may be a mouse, a keyboard, a display, or an optical drive, etc. Optionally, the hardware form of the storage controller 200 may further include a secondary storage 204, also commonly referred to as an external storage, and the storage medium of the secondary storage 204 may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., an optical disk), a semiconductor medium (e.g., a solid state disk), or the like.
The processing unit 201 is configured to perform a data storage function of the storage controller 200, and may have various specific implementation forms, for example, the processing unit 201 may include a processor 2011 and a memory 2012, the processor 2011 executes related data storage processing according to a program unit stored in the memory 2012, the processor 2011 may be a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU), and the processor 2011 may be a single-core processor or a multi-core processor. The processing unit 201 may also be implemented by using a logic device with built-in processing logic, such as a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), or the like.
The storage controller in fig. 1 may also be composed of processing logic of the storage controller, where the processing logic may specifically be implemented in the form of program codes residing in a memory, or may be implemented by using logic circuits, respectively, as schematically shown in the processing logic composition in the memory in fig. 2, and the processing logic of the storage controller may include: the storage processing logic and the storage management logic are used for executing relevant data storage operation after receiving a data reading or data writing request sent by an access host; the storage management logic is used for managing the processing processes of the storage controller, the storage disk and the data, monitoring the state and/or flow faults of the equipment and carrying out corresponding fault processing in the execution process of the storage processing logic.
In the storage system shown in fig. 1, for each storage disk, two links are connected to two storage controllers respectively. The storage management logic in the storage controller may be configured to manage all the storage disks, for example, to maintain link information of the storage disks, detect states of the storage disks, discover an abnormality in time, and make a diagnosis, so as to repair the abnormality as much as possible and ensure reliability of data storage. The storage management logic in the storage controller in the prior art performs failure handling in the following manner: the storage controller 10 receives the storage operation request, determines that the storage operation request needs to read the storage disk 30, and sends a read instruction to the storage disk 30 through its own connection chip 60, after determining that the sent read instruction is processed overtime, the storage controller 10 determines that IO overtime occurs, and then sends the read instruction to the storage controller 20 according to an internal predetermined fault processing strategy, so that the storage controller 20 sends the read instruction to the storage disk 30 again through the connection chip 70, and after the storage controller 20 succeeds in processing, the storage controller 20 returns a response message to the storage controller 10. Although the storage operation request (IO request) can be successfully processed by replacing the link in the end, the IO link is long and has a long time delay, which may cause the overall external service interruption phenomenon. In the prior art, in a scenario where a service has a large delay or interruption, for example, the storage controller 10 determines that an IO timeout sent to the storage disk 30 occurs, the storage disk 30 may also be repaired, however, if the IO timeout is caused by a failure of a link between the storage controller 10 and the storage disk 30, the repairing of the storage disk 30 is useless.
The present invention provides a method and an apparatus for processing a failure of a storage system, which are used to solve an IO delay or a service interruption caused by a link failure. According to the invention, the processing module of the link failure is added in the storage controller to identify the IO processing abnormity caused by the link failure (the failure of the connection chip), and the link is repaired correspondingly, so that the subsequent IO request can be switched back to the original link for processing, and the processing efficiency is improved.
The link failure processing module provided by the storage controller in the embodiment of the present invention may be a functional enhancement of the storage management logic, or may be a separate processing logic independent of the storage management logic, and the link failure processing module is used to identify a link failure and process a repair of the link failure. The specific fault handling procedure and details will be described in detail in the following embodiments.
Before describing particular embodiments, for convenience and clarity of description, the links in embodiments of the present invention are referred to herein as being uniform. Since there may be multiple links between each storage controller and each storage disk in the storage system in which the embodiment of the present invention is located, each storage disk has at least two links connected to different storage controllers for each storage disk, as in the storage system shown in fig. 1, since there are two storage controllers (10, 20), there are two links corresponding to each storage controller to which each storage disk is connected, for example, two links are included between the storage disk 30 and the storage controller 10, a first link is a link 01 where the storage controller 10 is connected to the storage disk 30 through the connection chip 60, and a second link is a link 04 where the storage controller 10 is connected to the storage disk 30 through the connection chip 70 of the storage controller 20; for the storage disk 40, two links are also included between the storage disk 40 and the storage controller 10, the first link is a link 02 where the storage controller 10 is connected to the storage disk 40 through the connection chip 60, and the second link is a link 05 where the storage controller 10 is connected to the storage disk 40 through the connection chip 70 of the storage controller 20; it can be seen that there are n links between each storage disk and each storage controller, where n is the number of storage controllers in the storage system. For convenience of description, in this document, links between each storage disk and each storage controller are described distinctively, and a link including a connection chip of a first storage controller is referred to as a first link (for example, links 01, 02, 03 of connection chip 60 of storage controller 10 in fig. 1), a link including a connection chip of a second storage controller is referred to as a second link (for example, links 04, 05, 06 of connection chip 70 of storage controller 20 in fig. 1), and a link including a connection chip of an nth storage controller is referred to as an nth link. The branches of the links are distinguished according to the difference between the target ends of the first link connection, for example, a connection in which the target end of the first link is the storage disk 30 is referred to as a first branch of the first link (for example, link 01), a connection in which the target end of the second link is the storage disk 30 is referred to as a first branch of the second link (for example, link 04), a connection in which the target end of the first link is the storage disk 40 is referred to as a second branch of the first link (for example, link 02), and a connection in which the target end of the second link is the storage disk 40 is referred to as a second branch of the second link (for example, link 05).
As shown in fig. 3, for a specific implementation process of the link failure processing method provided by the embodiment of the present invention, it should be noted that the storage system shown in fig. 3 is a simplified version of the storage system shown in fig. 1, and a connection chip on the storage controller, other storage disks, and a connection relationship between the storage disks and the storage controller are omitted in fig. 3, and are mainly used for explaining a flow processing relationship. The accessing host initiates a first data operation request to the first storage controller in step 301, the first data operation request being used for data access to the first storage disk. In step 302, a first storage controller initiates a first operation instruction to a first storage disk through a first branch of a first link, and requests to perform read operation or write operation on corresponding data, where the first link includes a connection chip of the first storage controller; in step 303, when monitoring that the execution of the first operation instruction is overtime, the first storage controller switches the link and forwards the first operation instruction to the second storage controller through the connection between the first storage controller and the second storage controller; in step 304, the second storage controller sends a first operation instruction to the first storage disk through its own connection chip; step 303 and step 304, that the first storage controller forwards the first operation instruction to the first storage disk through a first branch of a second link, where the second link includes a connection chip of a second storage controller; after the first storage disk executes the finishing operation, an operation success response of the first operation instruction is sent to the second controller in step 305, and the second storage controller forwards the operation success response to the first storage controller in step 306; at step 307, the first storage controller sends an access response to the host.
In step 308 (steps 308 and 307 do not actually execute a fixed order, and the order of the two can be interchanged), the first memory controller counts the number of operation exceptions after receiving the operation success response of the first operation instruction through the first branch of the second link. According to a predetermined statistical rule, when determining that the number of times of operation exception needs to be increased, the number of times of operation exception can be increased once, wherein the operation exception indicates that the operation instruction received by the first storage controller is executed overtime through the first link, but the operation is executed successfully through the second link; specifically, there may be two statistical rules, a first statistical rule: the operation exception occurring on each branch of each link is counted once, and it can also be understood as: the first operation abnormity appears on each branch of the first link, the operation abnormity frequency is increased once, and the non-first operation abnormity appears on each branch of the first link is not counted; the second statistical rule is as follows: counting the number of times of the operation abnormity is carried out as long as the operation abnormity occurs, and the number of times of the operation abnormity is increased once no matter whether the first operation abnormity on the branch occurs, namely, the operation abnormity occurs once. In this embodiment, according to a first statistical rule, it is determined that a first operation exception occurs on a first branch of the first link, and a statistical number of the operation exceptions is updated from zero to one.
As to how to identify the operation exception, specifically, after monitoring that the execution time of the first operation instruction is out in step 303, the first storage controller may record a first flag, where the first flag indicates that the execution time of the first operation instruction is out through a first branch of the first link; in step 306, after receiving the operation success response message forwarded by the second storage controller, the first storage controller records a second flag indicating that the first operation instruction is successfully executed through the first branch of the second link; the first storage controller determines whether a first operation instruction has a first mark and a second mark at the same time, and if the first operation instruction has the first mark and the second mark at the same time, it may be determined that an operation abnormality occurs in an execution process of the first operation instruction. After determining the operation abnormity, the first storage controller increases the number of times of the operation abnormity once or maintains the operation abnormity according to the first statistical rule or the second statistical rule. The first storage controller, after determining that the first branch of the first link exhibits an operational anomaly at step 308, may also set a failure tag for the first branch of the first link, the failure tag indicating that the first branch of the first link is unavailable or that the first branch of the first link is of a reduced rank; after setting the failure tag of the first branch of the first link, the accessing host initiates a second data operation request for accessing the first storage disk, the first storage controller, after receiving a subsequent second data operation request to be sent to the first storage disk, according to the failure label of the first branch of the first link, direct switching of the link or shunting of the link can be performed, for example, the forwarding of subsequent operation instructions is not performed any more through the first branch of the first link, but rather, the forwarding of subsequent operation instructions is performed directly through the first branch of the second link, in this embodiment, the first memory controller, based on the failure tag of the first branch of the first link, the second operation command is directly forwarded through the second link, and a response to the second operation command is obtained through the second link, and the first storage controller sends a response to the second data operation request to the accessing host (as in steps 309 to 314). Before the first link failure is not repaired, the direct switching of the link or the shunting failure processing mode of the link avoids the execution delay or the execution failure of the subsequent operation instruction.
Next, in step 315, the accessing host initiates a third data operation request to the first storage controller, the third data operation request for data access to the second storage disk. In step 316, the first storage controller initiates a third operation instruction to the second storage disk through the second branch of the first link, requesting to perform a read operation or a write operation on corresponding data; in step 317, the first storage controller monitors that the execution of the third operation instruction is overtime, switches the link, and forwards the third operation instruction to the second storage controller through the connection between the first storage controller and the second storage controller; in step 318, the second storage controller sends a third operation instruction to the second storage disk through its own connection chip; step 317 and step 318, the first storage controller is enabled to forward the third operation instruction to the second storage disk through a second branch of a second link, where the second link includes a connection chip of the second storage controller; after the second storage disk finishes the operation, an operation success response of a third operation instruction is sent to the second controller in step 319, and the second storage controller forwards the operation success response of the third operation instruction to the first storage controller in step 320; at step 321, the first storage controller sends a corresponding access response to the host.
In step 322, the first memory controller counts again the number of operation exceptions after receiving the operation success response of the third operation instruction through the second branch of the second link. According to a predetermined statistical rule, when determining that the number of times of operation exception needs to be increased, the number of times of operation exception can be increased once, wherein the operation exception indicates that the operation instruction received by the first storage controller is executed overtime through the first link, but the operation is executed successfully through the second link; in this embodiment, according to a first statistical rule, it is determined that a first operation exception occurs on a second branch of the first link, and a statistical number of the operation exception is updated from one to two. At this step, the first memory controller may further perform marking of the failed tag of the second branch of the first link. The first storage controller, after determining at step 322 that the second branch of the first link exhibits an operational anomaly, may also set a failure tag for the second branch of the first link, the failure tag indicating that the second branch of the first link is unavailable or that the second branch of the first link is of a reduced level; after setting the failure tag of the second branch of the first link, the access host initiates another data operation request for accessing the second storage disk, and after receiving another subsequent data operation request to be sent to the second storage disk, the first storage controller may perform direct link switching or link offloading according to the failure tag of the second branch of the first link.
Further, the first storage controller may start a timer to monitor a statistical number of operation anomalies for a period of time before step 301, or at any time from step 301 to step 322, and when the statistical number of operation anomalies reaches a predetermined threshold within a predetermined time, the statistical number of operation anomalies may further indicate that the first link has failed; at this point, the first storage controller may perform failover of the first link. For example, in step 323, the timer expires, the first storage controller determines whether the counted number of operation anomalies reaches a predetermined threshold, and after the counted number of operation anomalies reaches the predetermined threshold, the first storage controller performs the fault recovery of the first link. The first statistical rule only counts the first operation abnormity occurring in each branch of the link, and accordingly, the predetermined threshold is set as the number of the storage disks, that is, the number of the branches of the link, when it is determined that the operation abnormity occurs in each branch of the link (or the operation abnormity occurs in the data operation instruction sent to each storage disk), that is, the condition of meeting the predetermined threshold is met, the link can be determined to have a fault, and at this time, the connection chip of the link is repaired, and the fault can be solved. In the second statistical rule, as long as an operation abnormality occurs on the link, the operation abnormality is counted regardless of whether the first operation abnormality occurs on the branch, and in this case, the predetermined threshold is set as an empirical value, the link failure can be identified from a certain probability, and the connection chip of the link is repaired after the link failure is identified, so that the failure can be solved.
In the case of the first statistical rule, the predetermined threshold may also be smaller than the number of the storage disks, for example, the number of the storage disks with the predetermined threshold being set to two thirds, and when the statistical frequency of the operation abnormality reaches the number of the storage disks with two thirds, that is, the operation abnormality occurs on the two thirds of the link branches, it may also be identified that the link has a failure.
The predetermined time may be executed by a timer, or may be implemented in another manner. When the predetermined time is not reached, the number of operational anomalies has reached a predetermined threshold, and a link failure may also be deemed to have occurred.
Specifically, since the first link includes the connection chip of the first memory controller, the first memory controller repairs the connection chip, and the failure of the first link can be repaired. More specifically, the first memory controller may restart the connection chip of the first memory controller; or, isolating the connection chip of the first memory controller; or, repairing a queue on a connection chip of the first storage controller; or repairing a port on a connection chip of the first storage controller.
After the fault processing, the embodiment of the invention may further perform the following operations: after the first storage controller performs the failure repair of the first link, the failure tag of the first link may also be deleted, or setting a normal tag of the first link, the normal tag indicating that the first link state is available or that the level of the first link is normal, such that after receiving a subsequent request for further data operations to the first storage disk, according to the state of the first link (the fault label is deleted or the normal label of the first link), directly selecting the first link to send the operation instruction, since the path of the first link is shorter than the second link, subsequent operation instructions will be processed more quickly, the processing mode avoids time delay caused by the execution of the subsequent operation instruction through the second link, and improves the processing efficiency of the operation instruction. The embodiment of the invention really solves the problem of execution time delay of the operation instruction caused by the link failure in the storage system, avoids the problem of low efficiency although the operation is successful caused by the way of switching the path to process the failure in the prior art, and further improves the efficiency of the storage system.
In the embodiment of the present invention, after the first link is repaired, whether the first link is successfully repaired may be further detected, and after the first link is successfully repaired, the deletion of the failure tag of the first link or the marking of the normal tag of the first link is performed. For example, whether the connection chip in the first link is successfully repaired is detected, and after the state of the connection chip is normal, the failure tag of the first link is deleted or the normal tag of the first link is set. If it is detected that the first link is not repaired successfully, for example, the connection chip in the first link is repaired unsuccessfully, the first storage controller may further send a fault maintenance notification of the first link, for example, specifically, a fault maintenance or replacement notification of the connection chip in the first link may be sent, so as to completely solve the hardware fault problem.
In the following, functional modules of the storage controller according to the embodiment of the present invention are described, as shown in fig. 4, the storage controller includes a storage processing module 401 and a link failure processing module 402, where the link failure processing module 402 may be a functional enhancement of the storage management logic, or may be a separate processing logic independent of the storage management logic, and those skilled in the art may flexibly implement the functional enhancement according to the description of the embodiment of the present invention.
The storage processing module 401 is configured to receive a first data operation request, and send a first operation instruction to a first storage disk corresponding to the first data read-write operation through a first branch of a first link, where the first link is a link including a connection chip of the first storage controller, and the first branch of the first link is a connection of a target end in the first link to the first storage disk.
A link failure processing module 402, configured to forward the first operation instruction to the first storage disk through a first branch of a second link after monitoring that the execution of the first operation instruction is overtime, where the second link is a link including a connection chip of the second storage controller, and the first branch of the second link is a connection of a target end of the second link to the first storage disk; and receiving an operation success response of the first operation instruction transmitted through the first branch of the second link, counting the times of operation abnormity according to the operation success response, and performing fault repair on a connection chip in the first link after the counted times of the operation abnormity exceed a preset threshold value in preset time, wherein the operation abnormity indicates that the execution of the first operation instruction through the first link is overtime, but the execution is successful through the second link.
The link failure processing module 402, which executes the operation exception statistics, specifically includes: according to the statistical rule and the operation success response, increasing the number of operation abnormity once or keeping the original statistical operation abnormity number unchanged, wherein the statistical rule comprises the following steps: counting the operation abnormity appearing on each branch of the first link only once; accordingly, the predetermined threshold is the number N of the storage disks in the storage system.
A link failure processing module 402, further configured to set a failure tag of an nth branch of the first link after determining that an operation abnormality occurs on the nth branch of the first link, where the failure tag indicates that the nth branch of the first link is unavailable or the level of the nth branch of the first link is reduced, N is a natural number variable, and N is greater than or equal to 1 and less than or equal to N; the storage processing module 401 sends a subsequent operation instruction to the nth storage disk directly through the nth branch of the second link according to the fault tag of the nth branch of the first link after receiving a subsequent data operation request for the nth storage disk.
A link failure processing module 402, further configured to delete a failure tag of each branch of the first link or set a normal tag of each branch of the first link after performing failure repair on a connection chip in the first link, where the normal tag indicates that the first link is available in a state or a level of the first link is normal; and after the storage processing module requests subsequent data operation, switching back to the first link to send an operation instruction to the storage disk targeted by the subsequent data operation request according to the state of the first link failure tag or the normal tag of the first link.
The specific functions of each functional module are also described in the embodiment shown in fig. 3, and are not described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

Claims (13)

1. The storage failure processing method is applied to a storage system, wherein the storage system comprises at least one storage disk and at least two storage controllers, each storage controller comprises a connection chip, each storage controller is connected to each storage disk through the connection chip of the storage controller, and the at least two storage controllers are connected with each other;
the method comprises the following steps:
a first storage controller receives a first data operation request, and sends a first operation instruction to a first storage disk corresponding to the first data operation request through a first branch of a first link, where the first link is a link including a connection chip of the first storage controller, the first branch of the first link is a connection of a target end in the first link to the first storage disk, and the first storage controller is any one of the at least two storage controllers;
when the first storage controller monitors that the execution of the first operation instruction is overtime, the first storage controller forwards the first operation instruction to the first storage disk through a first branch of a second link, the second link is a link of a connection chip including a second storage controller, the first branch of the second link is the connection of a target end of the second link to the first storage disk, and the second storage controller is any storage controller connected with the first storage controller;
the first storage controller receives an operation success response of the first operation instruction transmitted through a first branch of the second link, and counts the times of operation abnormity according to the operation success response, wherein the operation abnormity indicates that the operation instruction received by the first storage controller is executed overtime through the first link but is executed successfully through the second link;
and the first storage controller determines that the statistical frequency of the operation abnormity exceeds a preset threshold value in preset time, and performs fault repair on the connection chip in the first link.
2. The method of claim 1, wherein said counting a number of operational anomalies based on said operational success responses comprises:
the first storage controller increases the number of times of operation exception once or maintains the original counted number of times of operation exception unchanged according to a statistical rule and the operation success response, wherein the statistical rule comprises: counting the operation abnormity appearing on each branch of the first link only once; accordingly, the predetermined threshold is less than or equal to the number N of storage disks in the storage system.
3. The method of claim 1, wherein the method further comprises:
after determining that an operation abnormality occurs on an nth branch of the first link, the first storage controller sets a fault tag of the nth branch of the first link, wherein the fault tag indicates that the nth branch of the first link is unavailable or the level of the nth branch of the first link is reduced, N is a natural number variable, and N is greater than or equal to 1 and less than or equal to N;
after receiving a subsequent data operation request aiming at the nth storage disk, the first storage controller directly sends a subsequent operation instruction to the nth storage disk through the nth branch of the second link according to the fault label of the nth branch of the first link.
4. The method of claim 3, wherein after performing fault repair on a connected chip in the first link, the method further comprises:
deleting a failure label of each branch of the first link, or setting a normal label of each branch of the first link, wherein the normal label indicates that a branch state of the first link is available or a branch level of the first link is normal;
the first storage controller switches back to the first link to send an operation instruction to the storage disk targeted by the subsequent data operation request according to the state of the first link failure tag or the normal tag of the first link after receiving the subsequent data operation request.
5. The method of any of claims 1-4, wherein after performing failover of a connected chip of the first link, the method further comprises:
detecting whether a connection chip in the first link is successfully repaired;
and sending out a fault maintenance notice of the connection chip in the first link after detecting that the repair of the connection chip in the first link is unsuccessful.
6. The method of any of claims 1-4, wherein the performing fault repair on the connected chips in the first link comprises:
restarting a connection chip of the first storage controller; or, isolating the connection chip of the first memory controller; or, repairing a queue on a connection chip of the first storage controller; or repairing a port on a connection chip of the first storage controller.
7. The method of any of claims 1-4, wherein after the first storage controller monitors that the first operation instruction executed a timeout, the method further comprises:
the first storage controller records a first flag indicating that the first operation instruction is executed over a first branch of the first link for a timeout;
before counting the number of times of operation abnormality according to the operation success response, the method further comprises the following steps:
the first storage controller records a second mark, and the second mark indicates that the first operation instruction is successfully executed through the first branch of the second link;
determining whether the first operating instruction is provided with the first flag and the second flag at the same time;
and if the first operation instruction has the first mark and the second mark at the same time, determining that the operation is abnormal.
8. A storage system, comprising:
at least one storage disk and at least two storage controllers;
each storage controller comprises a connection chip, and each storage controller is connected to each storage disk through the connection chip of the storage controller;
the at least two storage controllers are connected to each other;
a first storage controller, configured to receive a first data operation request, send a first operation instruction to a first storage disk corresponding to the first data operation request through a first branch of a first link, and forward the first operation instruction to the first storage disk through a first branch of a second link when it is monitored that the first operation instruction is executed overtime, where the first storage controller is any one of the at least two storage controllers, the second storage controller is any one of the storage controllers connected to the first storage controller, the first link includes a link of a connection chip of the first storage controller, a first branch of the first link is a connection of the first storage disk by a target end in the first link, the second link is a link of a connection chip including the second storage controller, and a first branch of the second link is a connection of the first storage disk by a target end of the second link; and the number of the first and second groups,
the first storage controller is further configured to receive an operation success response of the first operation instruction transmitted through the first branch of the second link, count the number of times of operation exception according to the operation success response, and perform fault repair on a connection chip in the first link after the number of times of operation exception exceeds a predetermined threshold within a predetermined time, where the operation exception indicates that the operation instruction received by the first storage controller is executed overtime through the first link but is executed successfully through the second link.
9. The storage system of claim 8, wherein the first storage controller counting the number of operational anomalies specifically comprises: according to the statistical rule and the operation success response, increasing the number of operation abnormity once or keeping the original statistical operation abnormity number unchanged, wherein the statistical rule comprises the following steps: counting the operation abnormity appearing on each branch of the first link only once; accordingly, the predetermined threshold is less than or equal to the number N of storage disks in the storage system.
10. The storage system of claim 8, wherein the first storage controller is further configured to set a failure flag for the nth branch of the first link after determining that an operational anomaly has occurred on the nth branch of the first link, the failure flag indicating that the nth branch of the first link is unavailable or that the nth branch of the first link is of a reduced level;
the first storage controller is further configured to send a subsequent operation instruction to the nth storage disk directly through the nth branch of the second link according to the fault tag of the nth branch of the first link after receiving a subsequent data operation request for the nth storage disk, where N is a natural number variable, and N is greater than or equal to 1 and less than or equal to N.
11. The memory system according to claim 10, wherein the first memory controller is further configured to delete a failure flag of each branch of the first link or set a normal flag of each branch of the first link after performing failure repair on a connection chip in the first link, the normal flag indicating that a branch status of the first link is available or a branch level of the first link is normal;
the first storage controller is further configured to switch back to the first link to forward the operation instruction to the storage disk corresponding to the subsequent data operation request according to the state of the first link failure tag or the normal tag of the first link after receiving the subsequent data operation request.
12. The memory system according to any one of claims 8 to 11, wherein the first memory controller is further configured to detect whether the repair of the connected chip in the first link is successful, and to issue a failure maintenance notification of the connected chip of the first memory controller after detecting that the repair of the connected chip in the first link is unsuccessful.
13. The memory system according to any one of claims 8 to 11, wherein the performing, by the first memory controller, the fault recovery on the connection chip of the first memory controller specifically includes: restarting a connection chip of the first storage controller; or, isolating the connection chip of the first memory controller; or, repairing a queue on a connection chip of the first storage controller; or repairing a port on a connection chip of the first storage controller.
CN201711377004.1A 2017-12-19 2017-12-19 Storage system and fault processing method thereof Active CN109933478B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711377004.1A CN109933478B (en) 2017-12-19 2017-12-19 Storage system and fault processing method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711377004.1A CN109933478B (en) 2017-12-19 2017-12-19 Storage system and fault processing method thereof

Publications (2)

Publication Number Publication Date
CN109933478A CN109933478A (en) 2019-06-25
CN109933478B true CN109933478B (en) 2021-02-26

Family

ID=66983970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711377004.1A Active CN109933478B (en) 2017-12-19 2017-12-19 Storage system and fault processing method thereof

Country Status (1)

Country Link
CN (1) CN109933478B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782137A (en) * 2020-06-17 2020-10-16 杭州宏杉科技股份有限公司 Path fault processing method and device
CN111858122A (en) * 2020-07-29 2020-10-30 北京浪潮数据技术有限公司 Fault detection method, device, equipment and storage medium of storage link
CN112286743B (en) * 2020-10-23 2023-01-06 苏州浪潮智能科技有限公司 Storage equipment backboard link detection and diagnosis device and method
CN114020661B (en) * 2021-10-27 2023-07-25 浪潮(北京)电子信息产业有限公司 Storage device and configuration method thereof
CN113986142B (en) * 2021-11-09 2023-08-08 苏州浪潮智能科技有限公司 Disk fault monitoring method, device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103001998A (en) * 2011-12-19 2013-03-27 深圳市安云信息科技有限公司 FC-SAN (fiber channel-storage area network) storage system and method for improving stability of fiber channel
CN103428333A (en) * 2012-05-15 2013-12-04 宇龙计算机通信科技(深圳)有限公司 Mobile terminal, server and error restoration method
CN104407999A (en) * 2014-11-04 2015-03-11 浪潮(北京)电子信息产业有限公司 Information security access architecture, method and system
CN104917624A (en) * 2014-03-10 2015-09-16 华耀(中国)科技有限公司 Health check system and method for link aggregation path
CN105389127A (en) * 2015-11-04 2016-03-09 华为技术有限公司 Method and apparatus for transmitting message in storage system, storage system and controller

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5142669B2 (en) * 2007-11-02 2013-02-13 株式会社東芝 Communication device and method and program for identifying cause of failure
US9794808B2 (en) * 2016-02-17 2017-10-17 King Fahd University Of Petroleum And Minerals Route repair of Ad hoc On-demand Distance Vector routing protocol in a wireless sensor network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103001998A (en) * 2011-12-19 2013-03-27 深圳市安云信息科技有限公司 FC-SAN (fiber channel-storage area network) storage system and method for improving stability of fiber channel
CN103428333A (en) * 2012-05-15 2013-12-04 宇龙计算机通信科技(深圳)有限公司 Mobile terminal, server and error restoration method
CN104917624A (en) * 2014-03-10 2015-09-16 华耀(中国)科技有限公司 Health check system and method for link aggregation path
CN104407999A (en) * 2014-11-04 2015-03-11 浪潮(北京)电子信息产业有限公司 Information security access architecture, method and system
CN105389127A (en) * 2015-11-04 2016-03-09 华为技术有限公司 Method and apparatus for transmitting message in storage system, storage system and controller

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向大数据存储***的故障检测技术研究;黎沛春;《中国优秀硕士学位论文全文数据库》;20170515;全文 *

Also Published As

Publication number Publication date
CN109933478A (en) 2019-06-25

Similar Documents

Publication Publication Date Title
CN109933478B (en) Storage system and fault processing method thereof
CN109725822B (en) Method, apparatus and computer program product for managing a storage system
US7603583B2 (en) Fault recovery method in a system having a plurality of storage system
CN102880522B (en) Hardware fault-oriented method and device for correcting faults in key files of system
US9886451B2 (en) Computer system and method to assist analysis of asynchronous remote replication
US8122158B1 (en) Method for improving I/O performance of host systems by applying future time interval policies when using external storage systems
WO2021047234A1 (en) Hard disk management method and apparatus
CN105468484A (en) Method and apparatus for determining fault location in storage system
CN113051104B (en) Method and related device for recovering data between disks based on erasure codes
US9069712B2 (en) Communication of conditions at a primary storage controller to a host
US10606490B2 (en) Storage control device and storage control method for detecting storage device in potential fault state
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
JP2015099487A (en) Information processing device, control device and control program
CN113176963A (en) PCIe fault self-repairing method, device, equipment and readable storage medium
CN103870367A (en) SAS (Serial Attached SCSI (small computer system interface)) expander automatic switching system and method
US7861112B2 (en) Storage apparatus and method for controlling the same
US11704180B2 (en) Method, electronic device, and computer product for storage management
CN111240903A (en) Data recovery method and related equipment
US9251016B2 (en) Storage system, storage control method, and storage control program
JP2018005826A (en) Control apparatus and storage device
US11747990B2 (en) Methods and apparatuses for management of raid
TWI756007B (en) Method and apparatus for performing high availability management of all flash array server
US9990382B1 (en) Secure erasure and repair of non-mechanical storage media
CN111190781A (en) Test self-check method of server system
CN113868000B (en) Link fault repairing method, system and related components

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200426

Address after: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Applicant after: HUAWEI TECHNOLOGIES Co.,Ltd.

Address before: 301, A building, room 3, building 301, foreshore Road, No. 310052, Binjiang District, Zhejiang, Hangzhou

Applicant before: Hangzhou Huawei Digital Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220223

Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province

Patentee after: Huawei Cloud Computing Technologies Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.

TR01 Transfer of patent right