WO2024139423A1 - 故障检测方法及计算机设备 - Google Patents

故障检测方法及计算机设备 Download PDF

Info

Publication number
WO2024139423A1
WO2024139423A1 PCT/CN2023/118911 CN2023118911W WO2024139423A1 WO 2024139423 A1 WO2024139423 A1 WO 2024139423A1 CN 2023118911 W CN2023118911 W CN 2023118911W WO 2024139423 A1 WO2024139423 A1 WO 2024139423A1
Authority
WO
WIPO (PCT)
Prior art keywords
hardware
fault information
information table
fault
register
Prior art date
Application number
PCT/CN2023/118911
Other languages
English (en)
French (fr)
Inventor
陈刚
Original Assignee
超聚变数字技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 超聚变数字技术有限公司 filed Critical 超聚变数字技术有限公司
Publication of WO2024139423A1 publication Critical patent/WO2024139423A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2273Test methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • BIOS basic input output system
  • the embodiments of the present application provide a fault detection method and a computer device, which solve the problem of how to successfully detect a server fault.
  • a method for fault detection comprising: obtaining a fault information table, the fault information table being used to indicate the correspondence between multiple hardware and registers, the register corresponding to each hardware being associated with fault information of at least one hardware; according to the fault information table, obtaining fault information of the first hardware fed back by the register corresponding to the first hardware, the fault information of the first hardware being stored in the register corresponding to the first hardware, the first hardware being any one of the multiple hardware.
  • the fault information table obtained by the processor from the management controller is generated according to the user's instructions.
  • the fault information table includes the correspondence between multiple hardware and registers of the server, and the register corresponding to each hardware is associated with the fault information of at least one hardware. Therefore, when performing fault detection, the register corresponding to the faulty hardware can be determined according to the fault information table, and all fault information of the faulty hardware can be obtained through the corresponding register, which can effectively improve the efficiency of fault detection.
  • the embodiment of the present application can obtain the fault diagnosis requirements increased by the user through the fault information table, which can ensure the quality of fault detection, ensure that all faults that need to be detected are detected, and effectively improve the efficiency of fault detection.
  • a fault information table and a flag bit are obtained, and the flag bit is used to verify the fault information table; when the fault information table verification succeeds, the fault information of the first hardware fed back by the register corresponding to the first hardware is obtained.
  • the flag bit of the fault information table can be used to check whether the fault information table obtained by the processor from the management controller has been tampered with.
  • the fault information table verification succeeds, it means that the obtained fault information table has not been tampered with, and the fault information table can be used to obtain the fault information of the first hardware.
  • the fault information table verification fails, it means that the obtained fault information table has been tampered with, and the fault information of the first hardware fed back by the register corresponding to the first hardware cannot be obtained using the fault information table, thereby avoiding the use of the tampered fault information table to detect the fault of the first hardware, resulting in an erroneous detection result.
  • the fault information of the first hardware fed back by the register corresponding to the first hardware in the fault information table is consistent with the fault information stored in the register corresponding to the first hardware, there is no need to feed the first hardware fed back by the register corresponding to the first hardware in the fault information table.
  • the fault information is updated to the register corresponding to the first hardware, which can simplify the detection process and improve the detection efficiency.
  • the fault information table further includes register information, and the register information includes register type, register bit width, and register parameters.
  • the fault information table includes different registers. Different registers store fault information of different hardware. When storing fault information of different hardware in different registers, it is necessary to consider the register type, register bit width, and register parameters of the register, so as to distinguish different registers to store different fault information.
  • a method for fault detection includes a management controller and a processor.
  • the method is executed by the management controller.
  • the method includes: generating a correspondence between multiple hardware and registers to form a fault information table, in which the register corresponding to each hardware is associated with the fault information of at least one hardware; and sending the fault information table to the processor.
  • the fault information table is generated by the management controller according to user instructions, the fault information table can be dynamically configured according to user needs, so that the registers included in the fault information table are associated with all fault information of the hardware, thereby improving the efficiency of fault detection and shortening the time of fault detection.
  • the corresponding register-associated fault information of the first hardware is updated to obtain an updated corresponding relationship, and the first hardware is any one of the multiple hardwares; the updated corresponding relationship is sent to the processor.
  • a fault detection device includes an acquisition module.
  • the acquisition module is also used to acquire the fault information of the first hardware fed back by the register corresponding to the first hardware according to the fault information table.
  • the fault information of the first hardware is stored in the register corresponding to the first hardware.
  • the first hardware is any one of the multiple hardware.
  • the acquisition module is specifically used to obtain a fault information table and a flag bit, and the flag bit is used to verify the fault information table; when the fault information table is verified successfully, the fault information of the first hardware fed back by the register corresponding to the first hardware is obtained.
  • the acquisition module is further used to determine whether the fault information table is consistent with the first fault information table stored in the computer device; when inconsistent, the fault information table is updated to the computer device.
  • the configuration module is used to generate a correspondence between multiple hardware and registers to form a fault information table, and the register corresponding to each hardware is associated with the fault information of at least one hardware.
  • the configuration module is also used to update the corresponding register-associated fault information of the first hardware according to the fault information of the first hardware indicated by the user, obtain the updated corresponding relationship, and the first hardware is any one of the multiple hardware; and send the updated corresponding relationship to the processor.
  • a server comprising a management controller, a processor and a memory.
  • the management controller is used to generate a correspondence between multiple hardware and registers, form a fault information table, associate the register corresponding to each hardware with the fault information of at least one hardware, and configure the correspondence between the multiple hardware and registers to the processor; when the management controller executes a set of computer instructions, the functions of each module of the method in the second aspect or any possible implementation of the second aspect are executed.
  • a computer-readable storage medium comprising computer software instructions; when the computer software instructions are executed in a computer, the computer executes a method as described in any one of the first aspect or any possible implementation of the first aspect.
  • a computer-readable storage medium comprising computer software instructions; when the computer software instructions are executed in a computer, the computer executes a method as described in any one of the second aspect or possible implementations of the second aspect.
  • a computer program product comprising instructions, which, when executed on a computer, enables the computer to execute the method described in the second aspect or any one of the implementations of the second aspect.
  • FIG2 is a schematic diagram of a flow chart of a fault detection method provided in an embodiment of the present application.
  • the processor 110 may run the processor firmware, that is, obtain a fault information table from the management controller 120.
  • the fault information table is used to indicate the correspondence between hardware and registers. Different registers indicate different fault information.
  • at least one register recording the fault information of the first hardware is determined, and the fault information of the first hardware is obtained from at least one register.
  • the fault information of the first hardware is sent to the management controller 120 for the user to locate the hardware fault.
  • the first hardware may be a central processing unit (CPU), a memory, or a high-speed serial computer expansion bus (PCIE) device.
  • CPU central processing unit
  • memory or a high-speed serial computer expansion bus (PCIE) device.
  • PCIE serial computer expansion bus
  • the processor 110 runs the processor firmware and obtains a fault information table from the management controller 120.
  • the fault information table indicates the registers corresponding to the processor 110, the memory 140, and the PCIE device 150.
  • an interrupt signal is sent to the corresponding register, so that the corresponding register outputs the fault information, collects the fault information, and sends the collected fault information to the management controller 120.
  • BIOS and BMC will communicate through EDMA, which is an important technology for fast data exchange in digital signal processors. It has the ability to transfer background batch data independently of the CPU.
  • EDMA is an important technology for fast data exchange in digital signal processors. It has the ability to transfer background batch data independently of the CPU.
  • B2H BMC to Host
  • H2B Host to BMC.
  • B2H refers to the block used when BMC transfers data (i.e., fault information table) to BIOS
  • H2B refers to the block used when BIOS transfers data (i.e., fault information) to BMC.
  • BIOS when BIOS detects that new fault information has appeared in the hardware of the computer device, the user can update the fault information table in the BMC interface, or the BMC can adaptively adjust the fault information table.
  • the fault information can be stored in the register corresponding to the memory, or in the register corresponding to other hardware.
  • the register parameter of the register in the fault information table is updated.
  • the register corresponding to the memory in the fault information table is updated.
  • Step 220 The processor obtains a fault information table.
  • the BIOS After the BIOS obtains the fault information table from the BMC, it needs to verify the fault information table. The BIOS verifies the validity and version of the fault information table based on the flag bit of the fault information table. The validity is used to indicate whether the fault information table is incorrect. As shown in Figure 4, the BIOS obtains the fault information table and the corresponding first flag bit and second flag bit from the BMC, and verifies whether the fault information table is incorrect based on the first flag bit (i.e., execute step 410).
  • the fault information table is discarded, and the fault detection is stopped (i.e., execute step 420); when the first flag bit is the first preset value, it indicates that the fault information table is correct, and the second flag bit of the fault information table is verified (i.e., execute step 430).
  • the second flag bit of the fault information table is inconsistent with the second flag bit of the fault information table stored in the BIOS, it indicates that the user has updated the fault information table, and the updated fault information table is stored in the BIOS (i.e., executing step 440), and at least one register recording the fault information of the first hardware is determined from the updated fault information table.
  • the fault information table stored in the BIOS can be used to determine at least one register recording the fault information of the hardware, and then when the hardware performs a fault self-check and generates an interrupt signal, wherein the interrupt signal may include the fault type sent by the hardware, the interrupt signal is directly sent to the determined register, and the register records the fault type through the interrupt signal.
  • the BIOS instructs the hardware to perform a power-on self-test to detect whether the computer device's hardware is faulty to ensure that the computer device's hardware can be used normally.
  • the hardware includes the CPU, memory, motherboard, and PCIE devices.
  • the self-test program will trigger the hardware to send an interrupt signal.
  • An interrupt signal refers to an alarm signal generated by a computing device after detecting a hardware failure error, which is used to indicate that the computer device is abnormal.
  • Interrupt signals may include traditional interrupts (Interrupt, INT), system management interrupts (System Management Interrupt, SMI), message signaled interrupts (Message Signaled Interrupt, MSI), non-maskable interrupts (Non Maskable Interrupt, NMI) or other interrupt signals used to indicate a hardware failure error of a computer device, etc., which are not specifically limited in the embodiments of the present application.
  • a machine-check architecture is used to perform a self-check on the server hardware, and an interrupt signal is issued when a hardware fault is detected.
  • the hardware faults detected by the machine-check architecture may be system bus faults, memory faults, parity errors, cache faults, translation backup buffer faults, etc. These hardware faults may cause damage to the stability of computer equipment and cannot be recovered, however, these hardware faults are inevitable in a large server environment, such as a server cluster or a cloud computing environment.
  • an interrupt signal is generated, and the interrupt signal is sent down to the corresponding register, and then the fault information of the faulty hardware is obtained from the corresponding register, so that the user can repair the hardware according to the fault information.
  • Step 230 The processor obtains the fault information of the first hardware fed back by the register corresponding to the first hardware according to the fault information table.
  • the register corresponding to the hardware will set the corresponding bit according to the interrupt signal sent by the hardware, and each bit is used to indicate different fault information.
  • the bit of the register is set, it means that the first hardware has a fault corresponding to the bit, and the BIOS obtains the fault information indicated by the bit.
  • the BIOS sends the obtained fault information to the BMC and displays it on the BMC interface, so that the user can intuitively know the fault of the computer device.
  • FIG5 is a schematic diagram of a fault detection method provided by an embodiment of the present application.
  • the registers corresponding to the CPU are determined to be A0, A1, A2, A3, and A4.
  • A0, A1, A2, A3, and A4 include all the fault information of the CPU.
  • the registers corresponding to the memory are B0, B1, B2, and B3.
  • the registers corresponding to the PCIE device are C0, C1, and C2.
  • BIOS When there is no fault, BIOS directly instructs the memory to execute the self-test program to obtain the corresponding fault information, and instructs the PCIE device to execute the self-test program to obtain the corresponding fault information. BIOS reports the fault information to BMC so that the user can repair the hardware fault according to the fault information.
  • FIG6 is a specific flow chart of a fault detection method provided by an embodiment of the present application.
  • the corresponding relationship between hardware and registers is pre-configured in BIOS.
  • BIOS When the hardware performs fault self-checking and generates an interrupt signal (i.e., executing step 610), fault information can be statically collected according to user needs (i.e., executing step 620).
  • the computer device displays a selection interface, and the selection interface includes two options, namely, static detection and dynamic detection.
  • static detection it indicates that the pre-configured register can meet the user's needs to detect hardware faults, and the register corresponding to the hardware can be directly triggered by the interrupt signal, thereby obtaining the corresponding fault information.
  • BIOS obtains a fault information table from BMC (i.e., executing step 630), verifies the fault information table (i.e., executing step 640), obtains the fault information fed back by the register corresponding to the first hardware, and reports the fault information (i.e., executing step 650), and ends the fault detection (i.e., executing step 660).
  • the configuration module 801 is used to generate a correspondence between multiple hardware and registers to form a fault information table, and the register corresponding to each hardware is associated with the fault information of at least one hardware.
  • the sending module 802 is used to send the fault information table to the processor.
  • the configuration module 801 is also used to: update the corresponding register-associated fault information of the first hardware according to the fault information of the first hardware indicated by the user, obtain the updated corresponding relationship, and the first hardware is any one of the multiple hardwares; and send the updated corresponding relationship to the processor.
  • the fault detection device 800 further includes an acquisition module 803 .
  • the acquisition module 803 is used to: acquire a fault information table, where the fault information table is used to indicate the correspondence between multiple hardware and registers, and the register corresponding to each hardware is associated with at least one piece of hardware fault information.
  • the acquisition module 803 is further used to: acquire the fault information of the first hardware fed back by the register corresponding to the first hardware, the fault information of the first hardware is determined by the register corresponding to the first hardware according to the interrupt signal of the first hardware, and the first hardware is any one of the multiple hardware.
  • the acquisition module 803 is further used to: acquire a fault information table and a flag bit, the flag bit being used to verify the fault information table; and when the fault information table verification succeeds, acquire the fault information of the first hardware fed back by the register corresponding to the first hardware.
  • the acquisition module 803 is further used to determine whether the fault information table is consistent with the first fault information table stored in the computer device; when inconsistent, the fault information table is updated to the computer device.
  • the fault detection device 800 further includes a storage module 804.
  • the storage module 804 is used to store a fault information table.
  • configuration module 801, the sending module 802, the acquisition module 803 and the storage module 804 can be found in The relevant description in the method embodiment shown in FIG. 2 can be directly obtained and will not be repeated here.
  • Fig. 9 provides a computer device.
  • the computer device 900 shown in Fig. 9 can be specifically used to implement the functions of the fault detection device 800 in the embodiment shown in Fig. 8 above.
  • Computer device 900 includes bus 901, processor 902, management controller 903, communication interface 904 and memory 905.
  • Processor 902, management controller 903, memory 905 and communication interface 904 communicate with each other through bus 901.
  • Bus 901 can be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus.
  • the bus can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, only one thick line is used in FIG9, but it does not mean that there is only one bus or one type of bus.
  • Communication interface 904 is used to communicate with the outside, such as receiving user instructions.
  • the processor 902 may be a central processing unit (CPU), and the processor 902 is used to obtain a fault information table; instruct multiple hardware to perform fault self-checking, and registers corresponding to each hardware in the multiple hardware; obtain fault information of the first hardware fed back by the register corresponding to the first hardware.
  • the management controller 903 is used to generate a correspondence between multiple hardware and registers, and the register corresponding to each hardware is associated with the fault information of at least one hardware; and configure the correspondence between multiple hardware and registers to the processor.
  • the management controller 903 may include a monitoring management unit outside the computer device, a management system in a management chip outside the processor, a computer device baseboard management unit (baseboard management controller, BMC), and a system management module (system management mode, SMM).
  • the memory 906 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM).
  • volatile memory such as a random access memory (random access memory, RAM).
  • RAM random access memory
  • non-volatile memory non-volatile memory
  • ROM read-only memory
  • flash memory a HDD or a SSD.
  • the memory 905 stores executable codes, and the processor 902 and the management controller 903 execute the executable codes to perform the aforementioned fault detection method.
  • the memory 905 stores the software or program code required to execute the functions of the configuration module 801, the sending module 802, and the acquisition module 803 in Figure 8, and the processor 902 and the management controller 903 are used to execute the instructions in the memory 905 and execute the fault detection method applied to the fault detection device 800.
  • the embodiment of the present application further provides a computer-readable storage medium, comprising instructions, which, when executed on a computer, enables the computer to execute the above-mentioned fault detection method applied to the fault detection device 800.
  • the embodiment of the present application also provides a computer program product, when the computer program product is executed by a computer, the computer executes any of the aforementioned methods.
  • the computer program product may be a software installation package, and when any of the aforementioned methods is needed, the computer program product may be downloaded and executed on a computer.
  • the device embodiments described above are merely schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • the technical solution of the embodiments of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, disk or CD, etc., including a number of instructions for a computer device (which can be a personal computer, training equipment, or network equipment, etc.) to execute the methods described in each embodiment of the embodiments of the present application.
  • a computer device which can be a personal computer, training equipment, or network equipment, etc.
  • all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof.
  • all or part of the embodiments may be implemented in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

本申请实施例提供了一种故障检测方法及计算机设备,涉及计算机技术领域。方法包括:获取故障信息表,故障信息表用于指示多个硬件和寄存器的对应关系,每个硬件对应的寄存器关联至少一个硬件的故障信息;根据故障信息表,获取第一硬件对应的寄存器反馈的第一硬件的故障信息,第一硬件的故障信息存储于第一硬件对应的寄存器,第一硬件为多个硬件中任意一个硬件。由于故障信息表是根据用户指示生成的,故障信息表中包括了服务器的多个硬件和寄存器的对应关系,每个硬件对应的寄存器关联至少一个硬件的故障信息。因此在进行故障检测时,可以根据故障信息表确定硬件对应的寄存器,通过对应的寄存器得到全部故障信息,可以有效提高故障检测的效率。

Description

故障检测方法及计算机设备
本申请要求于2022年12月29日提交国家知识产权局、申请号为202211715921.7、申请名称为“故障检测方法及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别是涉及一种故障检测方法及计算机设备。
背景技术
目前,服务器启动过程中自行进行故障检测。例如,基本输入输出***(basic input output system,BIOS)预先配置寄存器,寄存器存储有服务器中硬件的故障信息。但是,随着用户日益增加的故障诊断需求,在服务器检测到故障时,预先配置的寄存器无法识别出服务器新增的故障,进而导致硬件的部分故障检测失败。因此,如何成功检测服务器的故障是目前亟需解决的问题。
发明内容
本申请实施例提供一种故障检测方法及计算机设备,解决了如何成功检测服务器的故障的问题。
第一方面,提供了一种故障检测的方法,方法包括:获取故障信息表,故障信息表用于指示多个硬件和寄存器的对应关系,每个硬件对应的寄存器关联至少一个硬件的故障信息;根据故障信息表,获取第一硬件对应的寄存器反馈的第一硬件的故障信息,第一硬件的故障信息存储于第一硬件对应的寄存器,第一硬件为多个硬件中任意一个硬件。
处理器从管理控制器获取的故障信息表是根据用户指示生成的,故障信息表中包括了服务器的多个硬件和寄存器的对应关系,每个硬件对应的寄存器关联至少一个硬件的故障信息。因此在进行故障检测时,可以根据故障信息表确定故障的硬件对应的寄存器,并通过对应的寄存器得到故障硬件的全部故障信息,可以有效提高故障检测的效率。换言之,本申请实施例可以通过故障信息表获取用户增加的故障诊断需求,能够保证故障检测的质量,确保需要检测的故障均被检测到,还能有效提高故障检测效率。
结合第一方面,在一种可能的实现方式中,获取故障信息表和标志位,标志位用于校验故障信息表;当故障信息表校验成功时,获取第一硬件对应的寄存器反馈的第一硬件的故障信息。
故障信息表的标志位可以用于校验处理器从管理控制器获取的故障信息表是否被篡改,当故障信息表校验成功时,表示获取的故障信息表未被篡改,可以使用该故障信息表获取第一硬件的故障信息。当故障信息表校验失败时,表示获取的故障信息表被篡改,不可以使用改故障信息表获取第一硬件对应的寄存器反馈的第一硬件的故障信息,进而避免使用被篡改的故障信息表检测第一硬件的故障,导致检测结果错误。
结合第一方面,在另一种可能的实现方式中,判断故障信息表与存储于计算机设备的第一故障信息表是否一致;当不一致时,将故障信息表更新至计算机设备中。
当故障信息表中第一硬件对应的寄存器反馈的第一硬件的故障信息与存储于第一硬件对应的寄存器的故障信息一致时,无需将故障信息表中第一硬件对应的寄存器反馈的第一硬件 的故障信息更新至第一硬件对应的寄存器中,可以简化检测过程,提高检测效率。
结合第一方面,在另一种可能的实现方式中,故障信息表还包括寄存器的信息,寄存器的信息包括寄存器类型、寄存器位宽和寄存器参数。
故障信息表中包括不同的寄存器,不同的寄存器存储有不同硬件的故障信息,在不同的寄存器中存储不同的硬件的故障信息时,需要考虑寄存器的寄存器类型、寄存器位宽、寄存器参数,进而区分不同的寄存器存储不同的故障信息。
第二方面,提供了一种故障检测的方法,计算机设备包括管理控制器和处理器,方法由管理控制器执行,方法包括:生成多个硬件和寄存器的对应关系,形成故障信息表,每个硬件对应的寄存器关联至少一个硬件的故障信息;将故障信息表发送给处理器。
由于故障信息表是管理控制器根据用户指示生成的,因此可以根据用户需求动态配置故障信息表,使得故障信息表中包括的寄存器关联了硬件的所有故障信息,进而提升故障检测的效率,缩短故障检测的时间。
结合第二方面,在一种可能的实现方式中,根据用户指示第一硬件的故障信息,更新第一硬件的对应的寄存器关联故障信息,得到更新后对应关系,第一硬件为多个硬件中任意一个硬件;将更新后对应关系发送给处理器。
用户可以根据故障诊断需求在管理控制器中配置故障信息表,动态增加寄存器中存储的故障信息。由于管理控制器是完全独立于计算机设备的操作***,因此在更新管理控制器中的故障信息表时,不会影响计算机设备操作***的运行,无需重启计算机设备,进而缩短故障检测时间,提高故障检测效率。
第三方面,提供一种故障检测装置,故障检测装置包括获取模块。
获取模块用于获取故障信息表,故障信息表用于指示多个硬件和寄存器的对应关系,每个硬件对应的寄存器关联至少一个硬件的故障信息。
获取模块还用于根据故障信息表,获取第一硬件对应的寄存器反馈的第一硬件的故障信息,第一硬件的故障信息存储于第一硬件对应的寄存器,第一硬件为多个硬件中任意一个硬件。
结合第三方面,在一种可能的实现方式中,获取模块具体用于获取故障信息表和标志位,标志位用于校验故障信息表;当故障信息表校验成功时,获取第一硬件对应的寄存器反馈的第一硬件的故障信息。
结合第三方面,在另一种可能的实现方式中,获取模块还用于判断故障信息表与存储于计算机设备的第一故障信息表是否一致;当不一致时,将故障信息表更新至计算机设备中。
第四方面,提供一种故障检测装置,故障检测装置包括生成模块和发送模块。
配置模块用于生成多个硬件和寄存器的对应关系,形成故障信息表,每个硬件对应的寄存器关联至少一个硬件的故障信息。
发送模块用于将故障信息表发送给处理器。
结合第四方面,在一种可能的实现方式中,配置模块还用于根据用户指示第一硬件的故障信息,更新第一硬件的对应的寄存器关联故障信息,得到更新后对应关系,第一硬件为多个硬件中任意一个硬件;将更新后对应关系发送给处理器。
第五方面,提供一种服务器,服务器包括管理控制器和处理器和存储器。管理控制器用于生成多个硬件和寄存器的对应关系,形成故障信息表,每个硬件对应的寄存器关联至少一个硬件的故障信息,将多个硬件和寄存器的对应关系配置到处理器;管理控制器执行一组计算机指令时,执行第二方面或第二方面任一种可能实现方式中的方法的各个模块的功能。 处理器用于获取故障信息表;根据故障信息表,获取第一硬件对应的寄存器反馈的第一硬件的故障信息,第一硬件的故障信息存储于第一硬件对应的寄存器,第一硬件为多个硬件中任意一个硬件;处理器执行一组计算机指令时,执行第一方面或第一方面任一种可能实现方式中的方法的各个模块的功能。
第六方面,提供一种计算机可读存储介质,包括计算机软件指令;当计算机软件指令在计算机中运行时,使得计算机执行如第一方面或第一方面可能的实现方式中任一项所述的方法。
第七方面,提供一种计算机可读存储介质,包括计算机软件指令;当计算机软件指令在计算机中运行时,使得计算机执行如第二方面或第二方面可能的实现方式中任一项所述的方法。
第八方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面的任一种实现方式所述的方法。
第九方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第二方面或第二方面的任一种实现方式所述的方法。
本申请实施例在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为本申请实施例提供的一种***架构的示意图;
图2为本申请实施例提供的一种故障检测方法的流程示意图;
图3为本申请实施例提供的一种BMC界面的示意图;
图4为本申请实施例提供的一种检验故障信息表的流程示意图;
图5为本申请实施例提供的一种故障检测方法的示意图;
图6为本申请实施例提供的一种故障检测方法的具体流程示意图;
图7为本申请实施例提供的一种选择界面的示意图;
图8为本申请实施例提供的一种故障检测装置的结构示意图;
图9为本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
本申请实施例提供了一种故障检测方法,即获取故障信息表,故障信息表用于指示多个硬件和寄存器的对应关系,每个硬件对应的寄存器关联至少一个硬件的故障信息;根据故障信息表,获取第一硬件对应的寄存器反馈的第一硬件的故障信息,第一硬件的故障信息存储于第一硬件对应的寄存器,第一硬件为多个硬件中任意一个硬件。处理器从管理控制器获取的故障信息表是根据用户指示生成的,故障信息表中包括了服务器的多个硬件和寄存器的对应关系,每个硬件对应的寄存器关联至少一个硬件的故障信息。因此在进行故障检测时,可以根据故障信息表确定故障的硬件对应的寄存器,并通过对应的寄存器得到故障硬件的全部故障信息,可以有效提高故障检测的效率。
下面将结合附图对本申请实施例的实施方式进行详细描述。
图1为本申请实施例提供的一种***架构的示意图。该***架构图是计算机设备的举例说明。参考图1,计算机设备100可以包括多个处理器110、管理控制器120、多个寄存器130、多个内存140和高速串行计算机扩展总线(peripheral component interconnect express,PCIE) 设备150、集成南桥(Platform Controller Hub、PCH)160和存储器170。多个处理器110通过超路径互连(Ultra Path Interconnect,UPI)总线连接,处理器110通过内存通道访问内存140,处理器110通过PCIE接口连接PCIE设备150,处理器110通过直接媒体接口(Direct Media Interface,DMI)总线连接集成南桥160,DMI总线用于连接处理器和南桥,集成南桥160通过全双工同步串行(Serial Peripheral Interface,SPI)总线连接存储器170,SPI总线用于微处理控制单元和***设备之间的通信。存储器170通过交互协议连接管理控制器120。
存储器170可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器170还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,HDD或SSD。存储器170中存储有处理器固件和可执行代码,处理器110和管理控制器120执行该可执行代码以执行前述故障检测方法。
处理器固件(也称为处理器固件程序)可以为固件(Firmware)、基本输入输出***(basic input output system,BIOS)、管理引擎(management engine,ME)、微码或智能管理单元(intelligent management unit,IMU)等固件。本申请实施例对处理器固件的具体形式并不限定,以上仅为示例性说明。在下述实施例中,仅以处理器固件为BIOS为例进行说明。
处理器110可以运行处理器固件,即从管理控制器120获取故障信息表,故障信息表用于指示硬件和寄存器的对应关系,不同寄存器指示的故障信息不同,根据故障信息表中的对应关系确定记录第一硬件的故障信息的至少一个寄存器,并从至少一个寄存器中获取第一硬件的故障信息,将第一硬件的故障信息发送至管理控制器120,以供用户定位硬件的故障。第一硬件可以是中央处理器(central processing unit,CPU)、内存、高速串行计算机扩展总线(peripheral component interconnect express,PCIE)设备。
例如,处理器110运行处理器固件,从管理控制器120获取故障信息表,故障信息表指示了处理器110、内存140、PCIE设备150对应的寄存器,在处理器110、内存140、PCIE设备150产生故障时,发送中断信号至对应的寄存器,进而使得对应的寄存器输出故障信息,收集故障信息,并将收集的故障信息发送至管理控制器120。
管理控制器120包括带外管理模块121。带外管理模块可以为非业务模块的管理单元。例如,带外管理模块可以通过专用的数据通道对计算机设备进行远程维护和管理,该带外管理模块是完全独立于计算机设备的操作***之外,可以通过计算机设备的带外管理接口与基本输入输出***和操作***(Operating System,OS)进行通信。
示例性的,带外管理模块可以包括计算机设备外部的监控管理单元、处理器外的管理芯片中的管理***、计算机设备基板管理单元(Baseboard Management Controller,BMC)、***管理模块(system management mode,SMM)等。需要说明的,本申请实施例对带外管理模块的具体形式并不限定,以上仅为示例性说明。在下述实施例中,仅以带外管理模块为BMC为例进行说明。
BMC是完全独立于计算机设备的操作***之外,可以通过计算机设备的带外管理接口与BIOS和操作***进行通信的带外管理模块。
需要说明的是,不同公司的计算机设备对BMC有不同的称呼,例如一些公司称为BMC,一些公司称为iLO,另一公司称为iDRAC。不论是叫BMC,还是叫iLO或iDRAC,都可以理解为是本申请实施例中的BMC。
带外管理模块121用于可以根据用户指示生成多个硬件和寄存器的对应关系,形成故障信息表,并将多个硬件和寄存器的对应关系配置到处理器110,还可以将处理器110获取的故障信息呈现给用户,便于用户直观的定位硬件的故障。
当第一硬件新增故障信息时,可以根据用户指示将第一硬件新增故障信息更新第一硬件的对应的寄存器中,也可以根据用户指示将第一硬件新增故障信息更新至其他寄存器中,在将第一硬件新增故障信息更新至其他寄存器中时,还需要更新故障信息表中第一硬件和寄存器的对应关系。由于第一硬件和寄存器的对应关系存储在管理控制器120中,管理控制器120是完全独立于计算机设备的操作***之外。因此,在本申请实施例中,在新增故障信息时,管理控制器120可以直接更新故障信息表中第一硬件和寄存器的对应关系,并将更新后的故障信息表发送至处理器110,处理器110即可根据更新的故障信息表确定记录第一硬件的故障信息的寄存器,进而得到完整的故障信息,无需重启计算机设备,避免中断计算机设备正在运行的业务。
例如,下述实施例中描述的带外管理模块执行某个步骤(如以下的步骤210),可以理解为是:管理控制器调用带外管理模块执行该步骤。
BIOS和BMC会通过EDMA做沟通,EDMA是数字信号处理器中用于快速数据交换的重要技术,具有独立于CPU的后台批量数据传输的能力,在本申请实施例中,EDMA内会有B2H(BMC to Host)和H2B(Host to BMC)两个区域。B2H是指BMC将数据(即故障信息表)传输给BIOS时使用的区块,H2B是指BIOS将数据(即故障信息)传输给BMC时使用的区块。
寄存器130用于存储第一硬件的故障信息,并反馈第一硬件的故障信息,即在接收到第一硬件发送的中断信号时,触发对应的比特位,输出对应的故障信息。寄存器130可以为机器专用寄存器(Machine Specific Registers,MSR)、配置空间寄存器(Configuration Space Registers,CSR)、内存映射I/O(Memory-mapped I/O,MMIO)。需要说明的,本申请实施例对寄存器的具体形式并不限定,以上仅为示例性说明。
内存140是计算机***的重要部件,即外部存储器(或称为辅助存储器)与CPU进行沟通的桥梁。内存用于暂时存放CPU中的运算数据以及CPU与硬盘等外部存储器交换的数据。例如,计算机开始运行,将需要运算的数据从内存加载到CPU中进行运算,运算完成后,CPU将运算结果存入内存。
PCIE设备150是通过PCIE接口来扩展如GPU(Graphics Processing Unit,图形处理器)等各类能够使用PCIE接口进行连接的扩展设备,PCIE设备可以增强计算机设备的数据处理能力。
集成南桥160负责I/O接口等一些外设接口的控制、PCIE设备的控制及附加功能等。
本申请实施例描述的***架构以及应用场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着***架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
接下来,结合附图对故障检测方法进行详细说明。图2为本申请实施例提供的一种故障检测方法的流程示意图。在这里以图1中所示的处理器110和管理控制器120为例进行说明。
步骤210、管理控制器生成多个硬件和寄存器的对应关系,形成故障信息表。
BMC界面如图3所示,用户可以在BMC界面根据需求动态配置故障信息表,故障信息表用于指示多个硬件和寄存器的对应关系。故障信息表中包括不同的寄存器,不同的寄存器存储有不同硬件的故障信息,在不同的寄存器中存储不同的硬件的故障信息时,需要考虑寄存器的寄存器类型、寄存器位宽、寄存器参数,进而区分不同的寄存器存储不同的的故障信息。
示例的,表1为用户在BMC界面配置的故障信息表,寄存器类型可分为机器专用寄存器(Machine Specific Registers,MSR)、配置空间寄存器(Configuration Space Registers,CSR)、内存映射I/O(Memory-mapped I/O,MMIO)。机器专用寄存器可以用于指示CPU的部分故障,配置空间寄存器可以用于指示CPU的部分故障,也可以用于指示内存的部分故障,内存映射I/O寄存器可以用于指示高速串行扩展总线(peripheral component interconnect express,PCIE)设备的故障,也可以用于指示内存的部分故障。不同类型、不同的参数和不同的位宽的寄存器存储有不同的硬件的不同的故障信息。寄存器位宽可以是8位、16位、32位、64位。
表1
举例来说,记录有CPU的故障信息的寄存器为A0、A1、A2、A3、A4,寄存器A0、A1、A2、A3、A4包括CPU所有的故障信息,不同的寄存器指示了CPU不同的故障信息。CPU的故障可以是CPU针脚接触不良;CPU测温装置失灵;CPU供电故障;CPU频率降低故障。寄存器A0指示了CPU针脚接触不良和CPU测温装置失灵,寄存器A1指示了CPU供电故障,寄存器A3指示了CPU频率降低故障,寄存器包括的其他故障信息不在此一一列举。
在另一种实施例中,BIOS在收集计算机设备的硬件的故障信息,并将计算机设备的硬件的故障信息发送给BMC后,BMC可以自适应的调整寄存器中存储的故障信息,并更新硬件与寄存器的对应关系,即故障信息表。
在一种实施方式中,当BIOS检测到计算机设备中的硬件出现了新的故障信息时,用户可以在BMC界面更新故障信息表,也可以由BMC自适应调整故障信息表。例如,BIOS检测到内存出现了新的故障信息时,可以将该故障信息存储在内存对应的寄存器中,也可以存储在其他硬件对应的寄存器中。当该故障信息存储在内存对应的寄存器中时,更新故障信息表中该寄存器的寄存器参数。该故障信息存储在其他硬件对应的寄存器中时,更新故障信息表中内存对应的寄存器。
用户可以根据故障诊断需求在BMC中配置故障信息表,动态增加寄存器中存储的故障信息。由于BMC是完全独立于计算机设备的操作***,因此在更新BMC中的故障信息表时,不会影响计算机设备操作***的运行,无需重启计算机设备,进而缩短故障检测时间,提高故障检测效率。
步骤220、处理器获取故障信息表。
BIOS从BMC获取故障信息表后需要验证故障信息表。BIOS根据故障信息表的标志位验证该故障信息表的有效性和版本,有效性用于指示该故障信息表是否有误。如图4所示,BIOS从BMC获取故障信息表和对应的第一标志位、第二标志位,根据该第一标志位验证该故障信息表是否有误(即执行步骤410),当第一标志位不是第一预设值时,表示故障信息表有误,丢弃故障信息表,并停止故障检测(即执行步骤420);当第一标志位是第一预设值时,表示故障信息表无误,验证故障信息表的第二标志位(即执行步骤430)。
当该故障信息表的第二标志位与BIOS中存储的故障信息表的第二标志位不一致时,表示用户更新了故障信息表,将更新的故障信息表存储至BIOS中(即执行步骤440),并从更新的故障信息表中确定记录第一硬件的故障信息的至少一个寄存器。当该故障信息表的第二标志位与BIOS中存储的故障信息表的第二标志位一致时,表示用户未更新故障信息表,无需更新BIOS中的故障信息表(即执行步骤450),可以使用BIOS中存储的故障信息表确定记录硬件的故障信息的至少一个寄存器,进而在硬件执行故障自检产生中断信号时,其中,中断信号可以包括硬件发送的故障类型,直接将中断信号下发至确定的寄存器,寄存器通过中断信号记录故障类型。
在计算机设备上电后,BIOS指示硬件执行开机自检,用来检测计算机设备的硬件是否存在故障,以确保计算机设备的硬件可以正常使用,硬件包括CPU、内存、主板、PCIE设备。在发现计算机设备中的某一硬件存在故障时,自检程序会触发该硬件发出中断信号。
中断信号是指计算设备在检测到硬件出现故障错误后产生的一种告警信号,用于指示计算机设备出现异常。中断信号可以包括传统中断(Interrupt,INT)、***管理中断(System Management Interrupt,SMI)、消息信号中断(Message Signaled Interrupt,MSI)、不可屏蔽中断(Non Maskable Interrupt,NMI)或者其他用于指示计算机设备的硬件出现故障错误的中断信号等,本申请实施例在此不做具体限定。
在一种示例中,使用机器检测架构(Machine-Check Architecture,MCA)对服务器硬件进行自检,在发现硬件故障的时候发出中断信号。机器检测架构检测出的硬件故障可以是***总线故障,内存故障,奇偶校验错误,缓存故障,转译后备缓冲器故障等。这些硬件故障可以对计算机设备的稳定性造成危害并且无法恢复,然而这些硬件故障在一个大型的服务器环境下是不可避免的,例如服务器集群或者云计算环境。因此,在本申请实施例中,硬件在检测到故障时,产生中断信号,并将中断信号下发至对应的寄存器,进而从对应的寄存器中获取故障的硬件的故障信息,以供用户根据故障信息对硬件进行修复。
步骤230、处理器根据故障信息表,获取第一硬件对应的寄存器反馈的第一硬件的故障信息。
硬件对应的寄存器会根据硬件下发的中断信号对对应的比特位进行置位,每个比特位用于指示不同的故障信息。当寄存器的比特位置位时,表示第一硬件存在该比特位对应的故障,BIOS获取该比特位指示的故障信息。BIOS将获取的故障信息发送至BMC,在BMC界面上显示,用户可以直观的知道计算机设备的故障。
图5为本申请实施例提供的一种故障检测方法的示意图,根据故障信息表确定CPU对应的寄存器为A0、A1、A2、A3、A4,A0、A1、A2、A3、A4包括CPU所有的故障信息,内存对应的寄存器为B0、B1、B2、B3,PCIE设备对应的寄存器为C0、C1、C2。在BIOS获取到故障信息表后,CPU执行自检程序,在CPU存在故障时向对应的寄存器下发中断信号,进而得到对应的故障信息,BIOS收集CPU的故障信息,并指示内存执行自检程序;在CPU 无故障时,BIOS直接指示内存执行自检程序得到对应的故障信息,以及指示PCIE设备执行自检程序得到对应的故障信息,BIOS上报故障信息给BMC,以供用户根据故障信息修复硬件的故障。
图6为本申请实施例提供的一种故障检测方法的具体流程示意图。在另一种实施方式中,BIOS中会预先配置硬件和寄存器的对应关系,在硬件执行故障自检产生中断信号时(即执行步骤610),可以根据用户需求静态收集故障信息(即执行步骤620)。如图7所示,计算机设备显示选择界面,选择界面上包括两种选择,即静态检测和动态检测,当用户选择静态检测时,表示预先配置的寄存器可以满足用户检测硬件故障的需求,可以直接通过中断信号触发硬件对应的寄存器,进而获取对应的故障信息。当用户选择动态检测时,表示预先配置的寄存器无法满足用户检测硬件故障的需求,BIOS从BMC获取故障信息表(即执行步骤630),验证故障信息表(即执行步骤640),获取第一硬件对应的寄存器反馈的故障信息,并上报故障信息(即执行步骤650),结束故障检测(即执行步骤660)。
可以理解的是,为了实现上述实施例中功能,计算机包括了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本申请实施例中所公开的实施例描述的各示例的单元及方法步骤,本申请实施例能够以硬件或硬件和计算机软件相结合的形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用场景和设计约束条件。
图8为本申请实施例的实施例提供的故障检测装置的结构示意图。这些故障检测装置可以用于实现上述方法实施例中计算机设备的功能,因此也能实现上述方法实施例所具备的有益效果。在本申请实施例的实施例中,该故障检测装置可以是如图1所示的计算机设备100。
如图8所示,故障检测装置800包括配置模块801和发送模块802。故障检测装置800用于实现上述图2中所示的方法实施例中计算机设备100的功能。
当故障检测装置800用于实现图2所示的方法实施例中计算机设备100的功能时:
配置模块801用于生成多个硬件和寄存器的对应关系,形成故障信息表,每个硬件对应的寄存器关联至少一个硬件的故障信息。
发送模块802用于将故障信息表发送给处理器。
配置模块801还用于:根据用户指示第一硬件的故障信息,更新第一硬件的对应的寄存器关联故障信息,得到更新后对应关系,第一硬件为多个硬件中任意一个硬件;将更新后对应关系发送给处理器。
故障检测装置800还包括获取模块803。
获取模块803用于:获取故障信息表,故障信息表用于指示多个硬件和寄存器的对应关系,每个硬件对应的寄存器关联至少一个硬件的故障信息。
获取模块803还用于:获取第一硬件对应的寄存器反馈的第一硬件的故障信息,第一硬件的故障信息是第一硬件对应的寄存器依据第一硬件的中断信号确定的,第一硬件为多个硬件中任意一个硬件。
获取模块803还用于:获取故障信息表和标志位,标志位用于校验故障信息表;当故障信息表校验成功时,获取第一硬件对应的寄存器反馈的第一硬件的故障信息。
获取模块803还用于判断故障信息表与存储于计算机设备的第一故障信息表是否一致;当不一致时,将故障信息表更新至计算机设备中。
故障检测装置800还包括存储模块804。存储模块804用于存储故障信息表。
有关上述配置模块801、发送模块802、获取模块803和存储模块804更详细的描述可以 参考图2所示的方法实施例中相关描述直接得到,这里不加赘述。
图9提供了一种计算机设备。图9所示的计算机设备900具体可以用于实现上述图8所示实施例中故障检测装置800的功能。
计算机设备900包括总线901、处理器902、管理控制器903、通信接口904和存储器905。处理器902、管理控制器903、存储器905和通信接口904之间通过总线901通信。总线901可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。通信接口904用于与外部通信,例如接收用户指令。
其中,处理器902可以为中央处理器(central processing unit,CPU),处理器902用于获取故障信息表;指示多个硬件进行故障自检,以及多个硬件中每个硬件对应的寄存器;获取第一硬件对应的寄存器反馈的第一硬件的故障信息。管理控制器903用于生成多个硬件和寄存器的对应关系,每个硬件对应的寄存器关联至少一个硬件的故障信息;将多个硬件和寄存器的对应关系配置到处理器。管理控制器903可以包括计算机设备外部的监控管理单元、处理器外的管理芯片中的管理***、计算机设备基板管理单元(baseboard management controller,BMC)、***管理模块(system management mode,SMM)。存储器906可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器909还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,HDD或SSD。
存储器905中存储有可执行代码,处理器902和管理控制器903执行该可执行代码以执行前述故障检测的方法。
具体地,在实现图8所示实施例的情况下,且图8实施例中所描述的各模块为通过软件实现的情况下,存储器905存储执行图8中的配置模块801、发送模块802、获取模块803的功能所需的软件或程序代码,处理器902和管理控制器903用于执行存储器905中的指令,执行应用于故障检测装置800的故障检测的方法。
本申请实施例还提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述应用于故障检测装置800的故障检测的方法。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品被计算机执行时,所述计算机执行前述方法的任一方法。该计算机程序产品可以为一个软件安装包,在需要使用前述方法的任一方法的情况下,可以下载该计算机程序产品并在计算机上执行该计算机程序产品。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请实施例提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请实施例可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样 的,例如模拟电路、数字电路或专用电路等。但是,对本申请实施例而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请实施例各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (10)

  1. 一种故障检测方法,其特征在于,包括:
    获取故障信息表,所述故障信息表用于指示多个硬件和寄存器的对应关系,每个硬件对应的寄存器关联至少一个所述硬件的故障信息;
    根据所述故障信息表,获取第一硬件对应的寄存器反馈的所述第一硬件的故障信息,所述第一硬件的故障信息存储于所述第一硬件对应的寄存器,所述第一硬件为所述多个硬件中任意一个硬件。
  2. 根据权利要求1所述的方法,其特征在于,获取故障信息表,包括:
    获取所述故障信息表和标志位,所述标志位用于校验所述故障信息表;
    当所述故障信息表校验成功时,获取第一硬件对应的寄存器反馈的所述第一硬件的故障信息。
  3. 根据权利要求1或2所述的方法,其特征在于,应用于计算机设备,所述方法还包括:
    判断所述故障信息表与存储于所述计算机设备的第一故障信息表是否一致;
    当不一致时,将所述故障信息表更新至所述计算机设备中。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述故障信息表还包括所述寄存器的信息,所述寄存器的信息包括寄存器类型、寄存器位宽和寄存器参数。
  5. 一种故障检测方法,其特征在于,计算机设备包括管理控制器和处理器,所述方法由管理控制器执行,所述方法包括:
    生成多个硬件和寄存器的对应关系,形成故障信息表,每个硬件对应的寄存器关联至少一个所述硬件的故障信息;
    将所述故障信息表发送给所述处理器。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    根据用户指示第一硬件的故障信息,更新所述第一硬件对应的寄存器关联故障信息,得到更新后对应关系,所述第一硬件为所述多个硬件中任意一个硬件;
    将所述更新后对应关系发送给所述处理器。
  7. 一种计算机设备,其特征在于,所述计算机设备包括管理控制器和处理器,所述管理控制器用于:
    生成多个硬件和寄存器的对应关系,形成故障信息表,每个硬件对应的寄存器关联至少一个所述硬件的故障信息;
    将所述多个硬件和寄存器的对应关系发送至所述处理器;
    所述处理器用于:从所述管理控制器获取所述故障信息表;
    根据所述故障信息表,获取第一硬件对应的寄存器反馈的所述第一硬件的故障信息,所述第一硬件的故障信息存储于所述第一硬件对应的寄存器,所述第一硬件为所述多个硬件中任意一个硬件。
  8. 根据权利要求7所述的计算机设备,其特征在于,所述管理控制器还用于:
    根据用户指示第一硬件的故障信息,更新所述第一硬件的对应的寄存器关联故障信息,得到更新后对应关系;
    将所述更新后对应关系配置到所述处理器。
  9. 根据权利要求7或8所述的计算机设备,其特征在于,所述处理器还用于:
    获取所述故障信息表和标志位,所述标志位用于校验所述故障信息表;
    当所述故障信息表校验成功时,获取第一硬件对应的寄存器反馈的所述第一硬件的故障信息。
  10. 根据权利要求7-9中任一项所述的计算机设备,其特征在于,所述处理器还用于:
    判断所述故障信息表与存储于所述计算机设备的第一故障信息表是否一致;
    当不一致时,将所述故障信息表更新至所述计算机设备中。
PCT/CN2023/118911 2022-12-29 2023-09-14 故障检测方法及计算机设备 WO2024139423A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211715921.7 2022-12-29
CN202211715921.7A CN116048896A (zh) 2022-12-29 2022-12-29 故障检测方法及计算机设备

Publications (1)

Publication Number Publication Date
WO2024139423A1 true WO2024139423A1 (zh) 2024-07-04

Family

ID=86123102

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/118911 WO2024139423A1 (zh) 2022-12-29 2023-09-14 故障检测方法及计算机设备

Country Status (2)

Country Link
CN (1) CN116048896A (zh)
WO (1) WO2024139423A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116048896A (zh) * 2022-12-29 2023-05-02 超聚变数字技术有限公司 故障检测方法及计算机设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100037044A1 (en) * 2008-08-11 2010-02-11 Chih-Cheng Yang Method and system for using a server management program for an error configuration table
US20120229155A1 (en) * 2011-03-08 2012-09-13 Kabushiki Kaisha Toshiba Semiconductor integrated circuit, failure diagnosis system and failure diagnosis method
CN111767184A (zh) * 2020-09-01 2020-10-13 苏州浪潮智能科技有限公司 一种故障诊断方法、装置及电子设备和存储介质
CN111901683A (zh) * 2020-07-24 2020-11-06 海信视像科技股份有限公司 一种故障告警信息的显示方法及显示设备
CN115221015A (zh) * 2022-07-15 2022-10-21 苏州浪潮智能科技有限公司 硬盘故障预警方法、***、终端及存储介质
CN116048896A (zh) * 2022-12-29 2023-05-02 超聚变数字技术有限公司 故障检测方法及计算机设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100037044A1 (en) * 2008-08-11 2010-02-11 Chih-Cheng Yang Method and system for using a server management program for an error configuration table
US20120229155A1 (en) * 2011-03-08 2012-09-13 Kabushiki Kaisha Toshiba Semiconductor integrated circuit, failure diagnosis system and failure diagnosis method
CN111901683A (zh) * 2020-07-24 2020-11-06 海信视像科技股份有限公司 一种故障告警信息的显示方法及显示设备
CN111767184A (zh) * 2020-09-01 2020-10-13 苏州浪潮智能科技有限公司 一种故障诊断方法、装置及电子设备和存储介质
CN115221015A (zh) * 2022-07-15 2022-10-21 苏州浪潮智能科技有限公司 硬盘故障预警方法、***、终端及存储介质
CN116048896A (zh) * 2022-12-29 2023-05-02 超聚变数字技术有限公司 故障检测方法及计算机设备

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "S7-200-SMART-Hardware Diagnostics", 3 September 2022 (2022-09-03), XP093187478, Retrieved from the Internet <URL:https://www.doc88.com/p-58039609934624.html> *
RONGJIU TECHNOLOGY: "How to determine whether Siemens S7-200SMART has hardware failure", 10 May 2019 (2019-05-10), XP093187480, Retrieved from the Internet <URL:https://baijiahao.***.com/s?id=1633138015788883298&wfr=spider&for=pc> *

Also Published As

Publication number Publication date
CN116048896A (zh) 2023-05-02

Similar Documents

Publication Publication Date Title
EP3893114B1 (en) Fault processing method, related device, and computer storage medium
WO2022160756A1 (zh) 服务器故障定位方法、装置、***及计算机可读存储介质
EP3352083B1 (en) Debugging method, multi-core processor, and debugging equipment
TWI229796B (en) Method and system to implement a system event log for system manageability
JP6124994B2 (ja) レガシーos環境から統合拡張可能ファームウェア・インターフェース(uefi)ブート前環境への復元を行うための方法およびシステム、ならびにコンピュータ・プログラム
US10896087B2 (en) System for configurable error handling
WO2021135272A1 (zh) 一种内存异常的处理方法、***、电子设备及存储介质
US7984219B2 (en) Enhanced CPU RASUM feature in ISS servers
WO2024139423A1 (zh) 故障检测方法及计算机设备
TW201346530A (zh) 機器檢查摘要暫存器
US10514972B2 (en) Embedding forensic and triage data in memory dumps
WO2023109880A1 (zh) 一种业务恢复方法、数据处理单元及相关设备
WO2024120169A1 (zh) 一种服务器及其资产信息获取方法、提供方法和装置
US20160292108A1 (en) Information processing device, control program for information processing device, and control method for information processing device
KR20050016545A (ko) 시스템을 부팅하기 위해 프로세서들 및 관련 펌웨어의양호한 그룹을 결정하는 시스템 및 방법
TWI772024B (zh) 減少停機時間的方法及系統
CN114003416B (zh) 内存错误动态处理方法、***、终端及存储介质
TWI554876B (zh) 節點置換處理方法與使用其之伺服器系統
CN108874578B (zh) 用于监管和初始化端口的设备
CN113645056B (zh) 一种定位智能网卡故障的方法及***
TWI840907B (zh) 偵測偏差的電腦系統及方法,及非暫態電腦可讀取媒體
TWI715005B (zh) 用於監控基板管理控制器之常駐程序的方法
CN117873853B (zh) 数据记录方法、装置、电子设备及介质
CN115291957B (zh) 多处理器主板的初始化方法和装置
TWI582586B (zh) 輸出電腦系統的機器檢查例外資訊的方法