CN114064401A - Method and device for positioning hard disk fault, electronic equipment and storage medium - Google Patents

Method and device for positioning hard disk fault, electronic equipment and storage medium Download PDF

Info

Publication number
CN114064401A
CN114064401A CN202111294917.3A CN202111294917A CN114064401A CN 114064401 A CN114064401 A CN 114064401A CN 202111294917 A CN202111294917 A CN 202111294917A CN 114064401 A CN114064401 A CN 114064401A
Authority
CN
China
Prior art keywords
hard disk
positioning
fault
information
state information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111294917.3A
Other languages
Chinese (zh)
Inventor
李杨杨
范晓晋
刘星星
张孟威
何永占
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111294917.3A priority Critical patent/CN114064401A/en
Publication of CN114064401A publication Critical patent/CN114064401A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a method and a device for positioning hard disk faults, electronic equipment and a storage medium, relates to the technical field of computers, and at least solves the technical problems that in the prior art, due to the fact that hard disk fault positioning and replacement are carried out by relying on hard disk topology of a server manufacturer, hard disk replacement efficiency is low and accuracy is low in operation and maintenance of a server hard disk. The specific implementation scheme is as follows: acquiring hard disk fault state information of target equipment, wherein the hard disk fault state information indicates that hard disk faults which are not automatically repaired exist on the target equipment; uploading the fault state information of the hard disk to a target server; receiving a first positioning control instruction issued by a target server based on hard disk fault state information; and executing a first positioning operation according to the first positioning control instruction so as to indicate that the first hard disk with the hard disk fault exists on the target equipment.

Description

Method and device for positioning hard disk fault, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for locating a hard disk fault, an electronic device, and a storage medium.
Background
In the storage equipment of the data center, the hard disk is the component with the highest failure rate except the memory, and the large-scale failed hard disk replacement performed every year is very important.
In the existing scheme, the replacement of a server hard disk is based on a server overall hard disk topology provided by a server manufacturer, that is, after-sales personnel of the server manufacturer quickly find a corresponding failed hard disk according to a hard disk configuration record reserved by the server manufacturer to perform a replacement action, where the configuration record includes a SLOT Number (SLOT Number) corresponding to the hard disk, a Serial Number (SN Number for short), and the like. However, this approach has the disadvantages that: the method depends on the hard disk topology of a server manufacturer, depends on manual identification in the hard disk replacement process, lacks a unified processing tool, and has low efficiency and low accuracy for replacing the hard disks of high-density hard disk servers (such as 38-disk servers, 68-disk servers, 98-disk servers, and the like).
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The present disclosure provides a method, an apparatus, an electronic device and a storage medium for locating a hard disk fault, so as to at least solve the technical problems of low efficiency and low accuracy of hard disk replacement in operation and maintenance of a server hard disk caused by the fact that the hard disk fault location and replacement are performed depending on the hard disk topology of a server manufacturer in the prior art.
According to an aspect of the present disclosure, there is provided a method for locating a hard disk fault, including: acquiring hard disk fault state information of target equipment, wherein the hard disk fault state information indicates that hard disk faults which are not automatically repaired exist on the target equipment; uploading the fault state information of the hard disk to a target server; receiving a first positioning control instruction issued by a target server based on hard disk fault state information; and executing a first positioning operation according to the first positioning control instruction so as to indicate that the first hard disk with the hard disk fault exists on the target equipment.
According to another aspect of the present disclosure, there is provided an apparatus for locating a hard disk failure, including: the system comprises an acquisition module, a storage module and a recovery module, wherein the acquisition module is used for acquiring hard disk fault state information of target equipment, and the hard disk fault state information indicates that a hard disk fault which is not automatically repaired exists on the target equipment; the sending module is used for uploading the hard disk fault state information to a target server; the receiving module is used for receiving a first positioning control instruction issued by the target server based on the hard disk fault state information; and the positioning module is used for executing a first positioning operation according to the first positioning control instruction so as to indicate that the first hard disk with the hard disk fault exists on the target equipment.
According to still another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the storage stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the method for locating hard disk faults.
According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method for locating hard disk failure proposed by the present disclosure.
According to yet another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the method for locating a hard disk failure as set forth in the present disclosure.
In the embodiment of the disclosure, hard disk fault state information of a target device is obtained, wherein the hard disk fault state information indicates that a hard disk fault which is not automatically repaired exists on the target device; uploading the fault state information of the hard disk to a target server; receiving a first positioning control instruction issued by a target server based on hard disk fault state information; the method for executing the first positioning operation according to the first positioning control instruction to indicate the first hard disk with the hard disk fault on the target device is adopted, the purpose of quickly, accurately and automatically positioning the hard disk fault is achieved, the technical effects of accurately positioning and quickly replacing the fault hard disk in the automatic operation and maintenance of the server hard disk are achieved, and the technical problems that in the prior art, the efficiency of hard disk replacement in the operation and maintenance of the server hard disk is low and the accuracy is low due to the fact that the fault hard disk is positioned and replaced by relying on the hard disk topology of a server manufacturer are solved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing a method for locating a hard disk fault according to an embodiment of the present disclosure;
FIG. 2 is a flow chart of a method of locating hard disk failures according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an alternative system component and its interaction in accordance with an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of an alternative hard disk status monitoring according to an embodiment of the present disclosure;
FIG. 5 is a flow chart of an alternative hard disk replacement action according to an embodiment of the present disclosure;
fig. 6 is a block diagram of a structure of an apparatus for locating a hard disk fault according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the operation and maintenance process of the hard disk of the server, the fault hard disk needs to be accurately positioned in batches and replaced quickly, so that the automatic and efficient management and control of the operation and maintenance process of the hard disk are realized, and meanwhile, the hard disk replacement error is prevented, and risks are brought to services.
In the existing scheme, two schemes for automatic operation and maintenance of the server hard disk are provided aiming at different use scenes:
the first method is to perform hard disk positioning based on the joint configuration of an intelligent hard disk backboard and a Backboard Management Controller (BMC), connect the BMC with the intelligent hard disk backboard through an I2C bus, send a positioning or canceling instruction to a microcontroller on the intelligent hard disk backboard through the BMC, and further realize the control of turning on or turning off a specific SLOT positioning lamp on the intelligent hard disk backboard. The scheme is supported by default by a server manufacturer at present, but because the positioning information in the positioning mode is brought out, the positioning of the hard disk only depends on the SLOT number, and the disk identifier and the SN number of the hard disk cannot be associated, if the positioning is not effectively matched with a Basic Input Output System (BIOS) or an Operating System (OS), the positioning of the fault hard disk cannot be fully realized.
And secondly, hard disk positioning information of the server is configured in advance to position the hard disk, in the scheme, the SLOT information corresponding to each disk identifier is stored in the OS, and once the hard disk corresponding to the disk identifier fails, the OS can immediately inquire and obtain the positioning information of the failed hard disk. However, this solution is applicable to a case where the topology of the server hard disk is simple, but is not applicable to a case where the topology of the server hard disk is complicated and a high-density server. In addition, the scheme can not realize the control of turning on or turning off the specific SLOT locating lamp on the back plate of the intelligent hard disk.
The existing scheme can not realize automatic operation and maintenance of the server hard disk, and has the technical problems of low efficiency and low accuracy of hard disk replacement in the operation and maintenance of the server hard disk due to the fact that fault hard disk positioning and replacement are carried out by relying on the hard disk topology of a server manufacturer.
In accordance with an embodiment of the present disclosure, there is provided a method of locating hard disk failures, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
The method embodiments provided by the embodiments of the present disclosure may be executed in a mobile terminal, a computer terminal or similar electronic devices. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein. Fig. 1 shows a hardware structure block diagram of a computer terminal (or mobile device) for implementing the method for locating a hard disk failure.
As shown in fig. 1, the computer terminal 100 includes a computing unit 101 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)102 or a computer program loaded from a storage unit 108 into a Random Access Memory (RAM) 103. In the RAM 103, various programs and data necessary for the operation of the computer terminal 100 can also be stored. The computing unit 101, the ROM 102, and the RAM 103 are connected to each other via a bus 104. An input/output (I/O) interface 105 is also connected to bus 104.
A number of components in the computer terminal 100 are connected to the I/O interface 105, including: an input unit 106 such as a keyboard, a mouse, and the like; an output unit 107 such as various types of displays, speakers, and the like; a storage unit 108, such as a magnetic disk, optical disk, or the like; and a communication unit 109 such as a network card, modem, wireless communication transceiver, etc. The communication unit 109 allows the computer terminal 100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
Computing unit 101 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 101 performs the method of locating hard disk failures described herein. For example, in some embodiments, the method of locating hard disk failures may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 108. In some embodiments, part or all of the computer program may be loaded and/or installed onto the computer terminal 100 via the ROM 102 and/or the communication unit 109. When the computer program is loaded into RAM 103 and executed by computing unit 101, one or more steps of the method of locating a failed hard disk described herein may be performed. Alternatively, in other embodiments, the computing unit 101 may be configured to perform the method of locating a failed hard disk by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here can be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
It should be noted here that in some alternative embodiments, the electronic device shown in fig. 1 may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the electronic device described above.
In the above operating environment, the present disclosure provides a method for locating a hard disk failure as shown in fig. 2, which may be executed by a computer terminal or similar electronic device as shown in fig. 1. FIG. 2 is a flowchart of a method of locating hard disk failures according to an embodiment of the present disclosure. As shown in fig. 2, the method may include the steps of:
step S20, acquiring hard disk fault state information of the target device, wherein the hard disk fault state information indicates that a hard disk fault which is not automatically repaired exists on the target device;
wherein, the target device may be a server of a data center, and the target device includes: CPU, hard disk, memory, system bus, etc. may be used to manage the computing resources of the data center. In the operation process of the data center, the target device can detect the fault information of the internal hard disk, try to automatically repair the hard disk fault, mark the hard disk fault which cannot be automatically repaired, and store the hard disk fault which is not automatically repaired on the target device as the hard disk fault state information.
The hard disk is one of the important devices in the data center, and the interfaces of the hard disk are the common SATA, SASA, and NVME interfaces. The actual data of the operation and maintenance platform shows that the hard disk is the component with the highest failure rate except the memory, the large-scale failure hard disk replacement is carried out in the data center every year through manpower, and the accurate positioning and the quick replacement of the failure hard disk can be realized in the replacement process.
The hard disk failures include hardware failures and software failures. The hardware failure, i.e. the physical failure, is caused by physical damage to mechanical parts or electronic components of the hard disk, for example, a bad track occurs; software failures are non-physical failures, such as the main boot record, the partition table, the boot file and the like are damaged to cause the system to be unable to boot, the hard disk is infected by virus to cause the system to be unable to run, and further, illegal operation, improper maintenance and the like are further exemplified. Software failures can generally be repaired automatically by the software.
Step S22, uploading the hard disk fault state information to a target server;
the target server can acquire the hard disk fault state information from the target equipment, and the hard disk fault state information can be used for subsequent accurate positioning and automatic replacement processes of hard disk faults in the target equipment.
Step S24, receiving a first positioning control instruction issued by a target server based on the hard disk fault state information;
after the target server obtains the hard disk failure state information in the target device, a first positioning control instruction can be generated and sent based on the hard disk failure state information. The first positioning control instruction is used for positioning the hard disk fault in the target device.
And step S26, executing a first positioning operation according to the first positioning control instruction to indicate that the first hard disk with the hard disk fault exists on the target device.
Optionally, the first positioning control instruction includes a first positioning operation, the first hard disk may be a hard disk with a fault on the target device, and the performing of the first positioning operation may implement an indication of a hard disk fault in the target device.
Fig. 3 is a schematic diagram of an optional system component and interaction thereof according to an embodiment of the present disclosure, and as shown in fig. 3, this embodiment is a general scheme for accurately positioning and replacing a fault of a server hard disk, and this embodiment relies on an agent (agent) tool, i.e., an autonomously active software or hardware entity, operated by a system process and a server OS, so as to complete operations of hard disk fault state information reporting, hard disk replacement, hard disk positioning, replacement completion, and fault removal, thereby implementing full operation and maintenance automation. The system of the embodiment is mainly divided into a hard disk group, an OS agent and an upper layer integrated operation and maintenance platform, wherein the target device is the hard disk group and the OS agent, and the target server is the upper layer integrated operation and maintenance platform.
As shown in fig. 3, the OS agent is a real-time process running in the system, and in this embodiment, the OS agent is configured to query state information of the hard disk according to a certain frequency, try to automatically process a hard disk fault, send a hard disk fault alarm message to the operation and maintenance platform in real time, receive an instruction of "positioning" or "canceling positioning" of the hard disk sent by the operation and maintenance platform, and implement a function of "positioning" or "canceling positioning" of the faulty hard disk by using a hard disk positioning tool. The hard disk fault warning information comprises an SN number of the hard disk.
Still as shown in fig. 3, the upper-layer integrated operation and maintenance platform is a platform for implementing integrated state monitoring and operation and maintenance on the server of the data center, and in this embodiment, the upper-layer integrated operation and maintenance platform is used for performing unified management and control on the whole hard disk automation process, and includes: and receiving hard disk fault alarm information from the OS agent, determining whether a work order for replacing the hard disk is issued, controlling the corresponding SLOT locating lamp to flicker after the work order is issued, and controlling the corresponding SLOT locating lamp to be turned off after the hard disk is successfully replaced.
According to the present disclosure, in the steps S20 to S26, hard disk failure state information of the target device is obtained, where the hard disk failure state information indicates that there is a hard disk failure on the target device that has not been automatically repaired; uploading the fault state information of the hard disk to a target server; receiving a first positioning control instruction issued by a target server based on hard disk fault state information; the method for executing the first positioning operation according to the first positioning control instruction to indicate the first hard disk with the hard disk fault on the target device is adopted, the purpose of quickly, accurately and automatically positioning the hard disk fault is achieved, the technical effects of accurately positioning and quickly replacing the fault hard disk in the automatic operation and maintenance of the server hard disk are achieved, and the technical problems that in the prior art, the efficiency of hard disk replacement in the operation and maintenance of the server hard disk is low and the accuracy is low due to the fact that the fault hard disk is positioned and replaced by relying on the hard disk topology of a server manufacturer are solved.
The above-described method of this embodiment is further described below.
As an optional implementation manner, the hard disk failure status information is obtained by at least one of the following manners: the hard disk fault state information is obtained through a first thread configured on target equipment, wherein the first thread is used for scanning a drive letter of each hard disk in a plurality of hard disks configured on the target equipment according to a preset time interval and carrying out fault detection on the hard disk corresponding to the drive letter so as to determine the first hard disk; the hard disk fault state information is obtained through a second thread configured on the target equipment, wherein the second thread is used for analyzing system log information of the target equipment to determine a first hard disk; the hard disk failure state information is obtained through a third thread configured on the target device, wherein the third thread is used for detecting whether the drive letter changes of a plurality of hard disks configured on the target device so as to determine the first hard disk.
The target device may configure at least one of a first thread, a second thread, and a third thread to obtain hard disk failure status information on the target device, where the first thread is configured to scan a drive letter of each of a plurality of hard disks configured on the target device at a certain frequency, and the certain frequency is scanned once at each preset time interval; the second thread is used for analyzing system log information of the target device to determine a first hard disk, wherein the system log information comprises hard disk fault state information, and the first hard disk is a fault hard disk; the third program is used for detecting whether the drive letter changes of a plurality of hard disks configured on the target device so as to determine the first hard disk. The drive letter is an identifier of the storage device scanned by the OS, and the fault state information of the hard disk can be determined by analyzing the change of the drive letter.
FIG. 4 is a schematic diagram of an alternative hard disk status monitoring according to an embodiment of the present disclosure; as shown in fig. 4, in the monitoring logic of the target device for the hard disk state, three threads are configured: the method comprises a hard disk state timing query thread, a system log real-time query thread and a hard disk plug-in and plug-out state monitoring thread.
As shown in fig. 4, the hard disk state timing query thread may specify a preset interval according to an actual requirement, and in this embodiment, the preset interval is specified to be 7200 seconds according to an actual situation of occurrence of a hard disk fault state and a utilization rate of the thread to the CPU. And performing dynamic scanning on all hard disks in a device directory (device, abbreviated as/dev) once every 7200 seconds, wherein the dynamic scanning device supports an extended disk identifier of a high-density hard disk server, the extended disk identifier is marked as sd [ a-z ] [ a-z ], and health status detection is performed on each hard disk in all disk identifiers in the dynamic scanning once.
Still as shown in fig. 4, the real-time query thread of the system log takes 1 second as a period, and performs real-time capture and further analysis of the system log information, and if keywords such as input/output errors (IO error) are found, it is determined whether to immediately trigger the health status detection of a certain hard disk according to a certain screening logic. Compared with the regular inquiry thread of the state of the hard disk, the real-time inquiry thread of the system log detects the health state of the failed hard disk according to the system log, and the trigger time of the health state detection is random.
Still as shown in fig. 4, the hard disk plugging and unplugging state monitoring thread determines the change condition of the number of disk signatures of the hard disk under the system/dev based on the monitoring and reporting mechanism (select + notify), and if there is a hard disk plugging event and the number of disk signatures increases, the health state detection of the newly added hard disk is immediately triggered.
Still as shown in fig. 4, the health status detection is performed on a certain hard disk, the system health status of the hard disk is detected and determined by using a traversal detection tool, if a hard disk fault is detected, an automatic repair operation on the hard disk is immediately triggered, if the repair is not successful, the hard disk fault status information is reported to a sending process in an operating system agent, and the sending process sends the hard disk fault status information to an upper-layer integrated operation and maintenance platform in a certain format. In this embodiment, the traversal detection tool uses a SMART hard disk detection tool and a hard disk failure early warning tool, and an operating system agent tool is an OS agent.
As an alternative implementation, in step S26, the first positioning operation is executed according to the first positioning control instruction, and the method includes the following steps:
step S261, analyzing the first positioning control instruction to obtain first identification information of the first hard disk and first indicator light control information, where the first indicator light control information is used to control an indicator light corresponding to the first hard disk to enter an on state;
step S262, acquiring first positioning information of the first hard disk based on the first identification information;
step S263, locate the first hard disk by using the first location information, and turn on an indicator light corresponding to the first hard disk.
Optionally, the first positioning control instruction includes first identification information and first indicator light control information of the first hard disk, where the first indicator light control information is used to control an indicator light corresponding to the first hard disk to enter an open state, the first identification information is used to obtain first positioning information of the first hard disk, and the first positioning information may be used to position the first hard disk and may also be used to open the indicator light corresponding to the first hard disk.
The indicating lamp can be a specific SLOT locating lamp of each hard disk on the intelligent hard disk backboard, and can control a certain SLOT locating lamp to be turned on or turned off to indicate whether the hard disk corresponding to the SLOT locating lamp has a fault or not, so that operation and maintenance personnel can conveniently perform subsequent operation.
Optionally, the method for locating a hard disk fault further includes the following steps:
step S30, detecting the hard disk mounting state of the target device;
step S32, responding to the hard disk mounting state to be adjusted from a first mounting state to a second mounting state, and uploading the second mounting state to a target server, wherein the first mounting state is the mounting state of the first hard disk, and the second mounting state is the mounting state of the second hard disk;
step S34, receiving a second positioning control instruction issued by the target server based on the second mounting state;
and step S36, executing a second positioning operation according to the second positioning control instruction to cancel the instruction of the second hard disk.
The above-mentioned mounting state of the hard disk refers to whether a computer file and a directory on a storage device are accessible to a user through a file system of the computer.
Optionally, the first hard disk is a failed hard disk, the second hard disk is a new replaced hard disk, the first mount state is a mount state of the first hard disk, and the second mount state is a mount state of the second hard disk; and when the system detects that the mounting state of the hard disk is adjusted from the first mounting state to the second mounting state, uploading the second mounting state to the target server.
The target server may issue a second positioning control instruction based on a second mount state, where the second positioning control instruction includes a second positioning operation, and the second positioning operation is used to cancel the instruction to the second hard disk. Optionally, in step S36, executing the second positioning operation according to the second positioning control instruction further includes the following method steps:
s361, analyzing the second positioning control instruction to obtain second identification information and second indicator light control information of the second hard disk, wherein the second indicator light control information is used for controlling an indicator light corresponding to the second hard disk to enter a closed state;
s362, acquiring second positioning information of the second hard disk based on the second identification information;
and S363, positioning the second hard disk by using the second positioning information, and turning off an indicator lamp corresponding to the second hard disk.
Optionally, the second positioning control instruction includes second identification information of the second hard disk and second indicator light control information, where the second indicator light control information is used to control an indicator light corresponding to the second hard disk to enter a turned-off state, the second identification information is used to obtain second positioning information of the second hard disk, and the second positioning information may be used to position the second hard disk and may also be used to turn on the indicator light corresponding to the second hard disk.
The indicating lamp can be a specific SLOT locating lamp of each hard disk on the intelligent hard disk backboard, and can control a certain SLOT locating lamp to be turned on or turned off to indicate whether the hard disk corresponding to the SLOT locating lamp has a fault or not, so that operation and maintenance personnel can conveniently perform subsequent operation.
Optionally, in step S32, when a preset condition is met, the first hard disk is replaced by the second hard disk, and the hard disk mount state is adjusted from the first mount state to the second mount state, where the preset condition is that the storage space occupied by the hard disk fault state information in the preset storage area of the target server is greater than a first preset threshold, or the preset condition is that the number of the first hard disks reaches a second preset threshold based on the hard disk fault state information.
The target server comprises a preset storage area for temporarily storing the hard disk fault state information of the target device received by the target server, and when the storage space occupied by the hard disk fault state information in the preset storage area is larger than a first preset threshold value, each determined hard disk of the first hard disk on the target device is replaced by a second hard disk.
Alternatively, the operation of replacing the hard disk of each determined first hard disk on the target device with the second hard disk may be triggered when such a condition is satisfied: and the target server determines that the number of the hard disks of the first hard disk on the target equipment reaches a second preset threshold value based on the hard disk fault state information.
Still as shown in fig. 3, the OS agent uses a hard disk health status monitoring program to perform health monitoring on all hard disks in the hard disk group, and sends the monitored hard disk fault status information to the upper integrated operation and maintenance platform, when a preset condition is met, the upper integrated operation and maintenance platform initiates a hard disk replacement work order, and an operation and maintenance worker in the data center replaces all hard disks in the hard disk group determined as faulty hard disks with new hard disks, wherein the upper integrated operation and maintenance platform has a database which is specially used for storing hard disk fault status information and is called a fault pool, the hard disk fault status information from the OS agent is continuously stored in the fault pool, and the preset condition may be: and when the storage space occupied by the hard disk fault state information stored in the fault pool reaches a preset value, or when the number of the fault hard disks on the target equipment determined based on the hard disk fault state information stored in the fault pool reaches another preset value.
FIG. 5 is a flow chart of an alternative hard disk replacement action according to an embodiment of the present disclosure; as shown in fig. 5, the hard disk replacement work order initiated by the upper-layer integrated operation and maintenance platform includes a hard disk replacement action flow, which includes the following steps:
step S51, the upper layer integrated operation platform sends out a fault hard disk positioning instruction;
step S52, the operating system agent tool receives the fault hard disk positioning instruction, further analyzes to obtain the physical address of the fault hard disk, and executes the corresponding control instruction to make the corresponding hard disk positioning lamp on the intelligent hard disk backboard flash;
step S53, the operation and maintenance personnel of the data center determine the fault hard disk according to the hard disk positioning lamp and replace the fault hard disk with a new hard disk, and send a statement application to the operating system agent tool;
step S54, the operating system agent tool receives the statement application, re-detects the hard disk mounting state and sends the hard disk mounting state to the upper comprehensive operation and maintenance platform;
step S55, the upper layer comprehensive operation and maintenance platform detects and finds that the new hard disk at each hard disk position contained in the fault hard disk positioning instruction sent this time has no fault, and sends a control instruction to the operating system agent tool;
and step S56, the operating system agent tool receives the control instruction, extinguishes the corresponding hard disk positioning lamp on the intelligent hard disk backboard, and automatically applies for the order.
The operating system agent tool is the OS agent, and the hard disk positioning lamp may be an indicator lamp corresponding to a hard disk position on an intelligent hard disk backplane, that is, a SLOT positioning lamp.
In addition, as shown in fig. 3 and fig. 5, the instruction interaction between the upper-layer integrated operation and maintenance platform and the OS agent is implemented by an internet https protocol, specifically, a hard disk location script tool in the OS agent is called, the hard disk location script tool can output an SN number of a failed hard disk, the hard disk location script tool can further analyze the SN number of the failed hard disk to obtain a physical address and a SLOT number of the failed hard disk and a hard disk group in which the failed hard disk is located, and the hard disk location script tool can further control a SLOT location lamp corresponding to the failed hard disk to flash or extinguish by an SES protocol.
For example, the command for controlling the flashing of the SLOT positioning lamp corresponding to the failed hard disk can be "Curl-X POST-d '{" state ": flash" }' $ hostname $ Port/hdd/$ hddSN "; the command for controlling the extinction of the SLOT locating lamp corresponding to the failed hard disk can be ' Curl-X POST-d ' { ' state ': off ' } ' $ hostname $ Port/hdd/$ hdSN '. In the instruction, "Hostname" refers to the name of the host where the corresponding failed hard disk is located; "Port" refers to a service Port number in the OS agent; "hddSN" refers to the SN number of the failed hard disk.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable a terminal device (which may be a mobile phone, a computer, a server, or a network device) to execute the methods according to the embodiments of the present disclosure.
The present disclosure further provides a device for locating a hard disk fault, where the device is used to implement the foregoing embodiments and preferred embodiments, and the description of the device that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 6 is a block diagram of a structure of an apparatus for locating a hard disk fault according to an embodiment of the present disclosure, and as shown in fig. 6, the apparatus 600 for locating a hard disk fault includes: the system comprises an acquisition module 601, a sending module 602, a receiving module 603 and a positioning module 604.
An obtaining module 601, configured to obtain hard disk fault status information of a target device, where the hard disk fault status information indicates that a hard disk fault that has not been automatically repaired exists on the target device; a sending module 602, configured to upload hard disk failure state information to a target server; a receiving module 603, configured to receive a first positioning control instruction issued by a target server based on the hard disk fault state information; the positioning module 604 is configured to execute a first positioning operation according to the first positioning control instruction, so as to position a first hard disk indicating that a hard disk fault exists on the target device.
Optionally, in the apparatus 600 for locating a hard disk fault, the hard disk fault status information is obtained by at least one of the following methods: the hard disk fault state information is obtained through a first thread configured on target equipment, wherein the first thread is used for scanning a drive letter of each hard disk in a plurality of hard disks configured on the target equipment according to a preset time interval and carrying out fault detection on the hard disk corresponding to the drive letter so as to determine the first hard disk; the hard disk fault state information is obtained through a second thread configured on the target equipment, wherein the second thread is used for analyzing system log information of the target equipment to determine a first hard disk; the hard disk failure state information is obtained through a third thread configured on the target device, wherein the third thread is used for detecting whether the drive letter changes of a plurality of hard disks configured on the target device so as to determine the first hard disk.
Optionally, the positioning module 604 is configured to perform the first positioning operation according to the first positioning control instruction, and includes: the first positioning control instruction is used for analyzing the first positioning control instruction to obtain first identification information and first indicator light control information of the first hard disk, wherein the first indicator light control information is used for controlling an indicator light corresponding to the first hard disk to enter an opening state; the first positioning information is used for acquiring first positioning information of the first hard disk based on the first identification information; and the indicating lamp is used for positioning the first hard disk by utilizing the first positioning information and starting the indicating lamp corresponding to the first hard disk.
Optionally, the apparatus 600 for locating a hard disk failure is further configured to: detecting the hard disk mounting state of target equipment; responding to the hard disk mounting state, adjusting the first mounting state to a second mounting state, and uploading the second mounting state to a target server, wherein the first mounting state is the mounting state of a first hard disk, and the second mounting state is the mounting state of a second hard disk; receiving a second positioning control instruction issued by the target server based on the second mounting state; and executing a second positioning operation according to the second positioning control instruction so as to cancel the indication of the second hard disk.
Optionally, the apparatus 600 for locating a hard disk failure is configured to perform a second locating operation according to the second locating control instruction, and includes: the second positioning control instruction is used for analyzing the second positioning control instruction to obtain second identification information and second indicator light control information of the second hard disk, wherein the second indicator light control information is used for controlling an indicator light corresponding to the second hard disk to enter a closed state; the second positioning information is used for acquiring second positioning information of the second hard disk based on the second identification information; and the second positioning information is used for positioning the second hard disk and turning off the indicator light corresponding to the second hard disk.
Optionally, in the apparatus 600 for locating a hard disk fault, when a preset condition is met, the first hard disk is replaced by a second hard disk, where the preset condition is that a storage space occupied by the hard disk fault state information in a preset storage area of the target server is greater than a first preset threshold, or the preset condition is that the number of the first hard disks reaches a second preset threshold based on the hard disk fault state information.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
According to an embodiment of the present disclosure, there is also provided an electronic device including a memory having stored therein computer instructions and at least one processor configured to execute the computer instructions to perform the steps in any of the above method embodiments.
Optionally, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
step S1, acquiring hard disk fault state information of the target device, wherein the hard disk fault state information indicates that a hard disk fault which is not automatically repaired exists on the target device;
step S2, uploading the hard disk fault state information to a target server;
step S3, receiving a first positioning control instruction issued by a target server based on the hard disk fault state information;
and step S4, executing a first positioning operation according to the first positioning control instruction to indicate that the first hard disk with the hard disk fault exists on the target device.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
According to an embodiment of the present disclosure, there is also provided a non-transitory computer readable storage medium having stored therein computer instructions, wherein the computer instructions are arranged to perform the steps in any of the above method embodiments when executed.
Alternatively, in the present embodiment, the above-mentioned nonvolatile storage medium may be configured to store a computer program for executing the steps of:
step S1, acquiring hard disk fault state information of the target device, wherein the hard disk fault state information indicates that a hard disk fault which is not automatically repaired exists on the target device;
step S2, uploading the hard disk fault state information to a target server;
step S3, receiving a first positioning control instruction issued by a target server based on the hard disk fault state information;
and step S4, executing a first positioning operation according to the first positioning control instruction to indicate that the first hard disk with the hard disk fault exists on the target device.
Optionally, in this embodiment, the non-transitory computer readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a U disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and the like.
The present disclosure also provides a computer program product according to an embodiment of the present disclosure. Program code for implementing the method of locating a failed hard disk of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
The above-mentioned serial numbers of the embodiments of the present disclosure are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present disclosure, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present disclosure, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
The foregoing is merely a preferred embodiment of the present disclosure, and it should be noted that modifications and embellishments could be made by those skilled in the art without departing from the principle of the present disclosure, and these should also be considered as the protection scope of the present disclosure.

Claims (10)

1. A method for locating hard disk faults comprises the following steps:
acquiring hard disk fault state information of target equipment, wherein the hard disk fault state information indicates that hard disk faults which are not automatically repaired exist on the target equipment;
uploading the hard disk fault state information to a target server;
receiving a first positioning control instruction issued by the target server based on the hard disk fault state information;
and executing a first positioning operation according to the first positioning control instruction so as to indicate that the first hard disk with the hard disk fault exists on the target equipment.
2. The method of claim 1, wherein the hard disk failure status information is obtained by at least one of:
the hard disk fault state information is obtained through a first thread configured on the target device, wherein the first thread is used for scanning a drive letter of each hard disk in a plurality of hard disks configured on the target device according to a preset time interval, and performing fault detection on the hard disk corresponding to the drive letter to determine the first hard disk;
the hard disk fault state information is obtained through a second thread configured on the target device, wherein the second thread is used for analyzing system log information of the target device to determine the first hard disk;
the hard disk failure state information is obtained through a third thread configured on the target device, wherein the third thread is used for detecting whether the plurality of hard disks configured on the target device have a disk character change so as to determine the first hard disk.
3. The method of claim 1, wherein performing the first positioning operation in accordance with the first positioning control instruction comprises:
analyzing the first positioning control instruction to obtain first identification information and first indicator light control information of the first hard disk, wherein the first indicator light control information is used for controlling an indicator light corresponding to the first hard disk to enter a starting state;
acquiring first positioning information of the first hard disk based on the first identification information;
and positioning the first hard disk by using the first positioning information, and turning on an indicator light corresponding to the first hard disk.
4. The method of claim 1, wherein the method further comprises:
detecting the hard disk mounting state of the target equipment;
responding to the hard disk mounting state to be adjusted from a first mounting state to a second mounting state, and uploading the second mounting state to the target server, wherein the first mounting state is the mounting state of the first hard disk, and the second mounting state is the mounting state of the second hard disk;
receiving a second positioning control instruction issued by the target server based on the second mounting state;
and executing a second positioning operation according to the second positioning control instruction so as to cancel the indication of the second hard disk.
5. The method of claim 4, wherein performing the second positioning operation in accordance with the second positioning control instruction comprises:
analyzing the second positioning control instruction to obtain second identification information and second indicator light control information of the second hard disk, wherein the second indicator light control information is used for controlling an indicator light corresponding to the second hard disk to enter a closed state;
acquiring second positioning information of the second hard disk based on the second identification information;
and positioning the second hard disk by using the second positioning information, and turning off an indicator lamp corresponding to the second hard disk.
6. The method according to claim 4, wherein the first hard disk is replaced by a second hard disk when a preset condition is met, wherein the preset condition is that a storage space occupied by the hard disk failure state information in a preset storage area of the target server is larger than a first preset threshold, or the preset condition is that the number of the first hard disks reaches a second preset threshold based on the hard disk failure state information.
7. An apparatus for locating hard disk failure, comprising:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring hard disk fault state information of target equipment, and the hard disk fault state information indicates that a hard disk fault which is not automatically repaired exists on the target equipment;
the sending module is used for uploading the hard disk fault state information to a target server;
the receiving module is used for receiving a first positioning control instruction issued by the target server based on the hard disk fault state information;
and the positioning module is used for executing a first positioning operation according to the first positioning control instruction so as to position a first hard disk with the hard disk fault on the target equipment.
8. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
9. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.
CN202111294917.3A 2021-11-03 2021-11-03 Method and device for positioning hard disk fault, electronic equipment and storage medium Pending CN114064401A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111294917.3A CN114064401A (en) 2021-11-03 2021-11-03 Method and device for positioning hard disk fault, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111294917.3A CN114064401A (en) 2021-11-03 2021-11-03 Method and device for positioning hard disk fault, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114064401A true CN114064401A (en) 2022-02-18

Family

ID=80273722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111294917.3A Pending CN114064401A (en) 2021-11-03 2021-11-03 Method and device for positioning hard disk fault, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114064401A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116775353A (en) * 2023-05-19 2023-09-19 北京百度网讯科技有限公司 Method and device for repairing failed disk, electronic equipment and readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116775353A (en) * 2023-05-19 2023-09-19 北京百度网讯科技有限公司 Method and device for repairing failed disk, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
US7213179B2 (en) Automated and embedded software reliability measurement and classification in network elements
CN102761439B (en) Device and method for detecting and recording abnormity on basis of watchdog in PON (Passive Optical Network) access system
CN111796959B (en) Self-healing method, device and system for host container
US20240103961A1 (en) PCIe Fault Auto-Repair Method, Apparatus and Device, and Readable Storage Medium
US20170132102A1 (en) Computer readable non-transitory recording medium storing pseudo failure generation program, generation method, and generation apparatus
CN112529223A (en) Equipment fault repair method and device, server and storage medium
WO2021056913A1 (en) Fault locating method, apparatus and system based on i2c communication
CN114064401A (en) Method and device for positioning hard disk fault, electronic equipment and storage medium
US20160197994A1 (en) Storage array confirmation of use of a path
US10938623B2 (en) Computing element failure identification mechanism
CN106411643B (en) BMC detection method and device
CN116483613B (en) Processing method and device of fault memory bank, electronic equipment and storage medium
WO2024124862A1 (en) Server-based memory processing method and apparatus, processor and an electronic device
CN113992501A (en) Fault positioning system, method and computing device
JPWO2011051999A1 (en) Information processing apparatus and information processing apparatus control method
CN111078454A (en) Cloud platform configuration recovery method and device
CN113608959B (en) Method, system, terminal and storage medium for positioning fault hard disk
CN114860494A (en) SAS expander configuration self-adaptive system
CN115543707A (en) Hard disk fault detection method, system and device, storage medium and electronic device
CN111625185B (en) Method, system and related assembly for monitoring disk fault
TWI698741B (en) Method for remotely clearing abnormal status of racks applied in data center
CN111414267A (en) Far-end eliminating method for abnormal state of cabinet applied to data center
CN111416721A (en) Far-end eliminating method for abnormal state of cabinet applied to data center
CN111414274A (en) Far-end eliminating method for abnormal state of cabinet applied to data center
CN114513398B (en) Network equipment alarm processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination