CN116991612A

CN116991612A - System management interrupt storm restraining method and related device

Info

Publication number: CN116991612A
Application number: CN202211200205.5A
Authority: CN
Inventors: 严鑫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2023-11-03

Abstract

The application discloses a method and a related device for suppressing a system management interrupt storm, which are used for carrying out remote fault operation and processing of computer equipment. When the target hardware generates a correctable error based on hardware failure, a first system management interrupt signal is triggered, wherein the first system management interrupt signal is generated based on the correctable error, and each correctable error generates a first system management interrupt signal. If the occurrence of the system management interrupt storm is determined according to the generation frequency of the first system management interrupt signal, the triggering of the first system management interrupt signal is forbidden, then the triggering of the second system management interrupt signal is performed according to a preset period, and the scanning hardware faults of the basic input and output system are triggered through the second system management interrupt signal. If the hardware fault does not exist on the target hardware according to the scanning result, the second system management interrupt signal is switched back to the first system management interrupt signal, so that the balance of the performance and the operability of the computer equipment is realized.

Description

System management interrupt storm restraining method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for suppressing a system management interrupt storm.

Background

Many computer devices may generate correctable errors (Correctable Error, CE) after hardware failure. A correctable error means that the error can be corrected and the computer device that occurred the correctable error can continue to operate.

When a hardware failure generates a CE, a system management interrupt (System Management Interrupt, SMI), which may be referred to as a hardware failure based generated SMI (i.e., CE SMI), is triggered to handle the hardware failure. Specifically, after receiving the CE SMI signal, a Basic Input/output System (BIOS) determines fault information through a handler (handler), and reports the hardware fault information to recover the fault.

However, if the frequency of use of the hardware that fails during the operation of the computer device is very high, a large amount of CE SMIs are likely to be triggered in a short time, so as to form an SMI storm, which further causes the computer device to be stuck, and affects the performance of the computer device.

Disclosure of Invention

In order to solve the technical problems, the application provides a method and a related device for suppressing a system management interrupt storm, which avoid the problems of computer equipment blocking and performance reduction of the computer equipment caused by a large amount of SMI storm to a certain extent, ensure continuous reporting of fault information, realize continuous operation and maintenance of the computer equipment, and further realize balance of the performance and the operation and maintenance of the computer equipment.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides a method for suppressing a system management interrupt storm, where the method includes:

when the target hardware generates a correctable error based on hardware faults, triggering a first system management interrupt signal, wherein the first system management interrupt signal is a system management interrupt signal generated based on the correctable error and is used for triggering a basic input and output system to report fault information, and generating a first system management interrupt signal when one correctable error occurs;

if the occurrence of the system management interrupt storm is determined according to the generation frequency of the first system management interrupt signal, the triggering of the first system management interrupt signal is forbidden;

triggering a second system management interrupt signal according to a preset period, triggering the basic input/output system to scan hardware faults through the second system management interrupt signal, wherein the generation frequency of the second system management interrupt signal is smaller than that of the first system management interrupt signal, and the generation frequency of the second system management interrupt signal is determined based on the preset period;

and if the target hardware is determined to have no hardware fault according to the scanning result, switching the second system management interrupt signal back to the first system management interrupt signal.

scanning hardware faults according to a preset period through a fault management system;

and if the fact that the hardware fault does not exist on the target hardware is determined according to the scanning result, enabling the first system management interrupt signal.

In one aspect, an embodiment of the present application provides a device for suppressing a system management interrupt storm, where the device includes a trigger unit, a prohibition unit, a scanning unit, and a switching unit:

the triggering unit is used for triggering a first system management interrupt signal when the target hardware generates a correctable error based on hardware faults, wherein the first system management interrupt signal is a system management interrupt signal generated based on the correctable error and is used for triggering a basic input and output system to report fault information, and a first system management interrupt signal is generated when one correctable error occurs;

The prohibiting unit is configured to prohibit triggering the first system management interrupt signal if it is determined that a system management interrupt storm occurs according to the generation frequency of the first system management interrupt signal;

the scanning unit is used for triggering a second system management interrupt signal according to a preset period, triggering the basic input/output system to scan hardware faults through the second system management interrupt signal, wherein the generation frequency of the second system management interrupt signal is smaller than that of the first system management interrupt signal, and the generation frequency of the second system management interrupt signal is determined based on the preset period;

and the switching unit is used for switching the second system management interrupt signal back to the first system management interrupt signal if the fact that the target hardware has no hardware fault is determined according to the scanning result.

In one aspect, an embodiment of the present application provides a device for suppressing a system management interrupt storm, where the device includes a trigger unit, a disable unit, a scan unit, and an enable unit:

the scanning unit is used for scanning hardware faults according to a preset period through the fault management system;

and the starting unit is used for starting the first system management interrupt signal if the target hardware is determined to have no hardware fault according to the scanning result.

In one aspect, an embodiment of the present application provides a computer device including a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method of any of the preceding aspects according to instructions in the program code.

In one aspect, embodiments of the present application provide a computer readable storage medium for storing program code which, when executed by a processor, causes the processor to perform the method of any one of the preceding aspects.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the preceding aspects.

According to the technical scheme, when the target hardware generates a correctable error based on hardware faults, the first system management interrupt signal is triggered, wherein the first system management interrupt signal is generated based on the correctable error and is used for triggering the basic input and output system to report fault information, and one first system management interrupt signal is generated when one correctable error occurs. And if the occurrence of the system management interrupt storm is determined according to the generation frequency of the first system management interrupt signal, the triggering of the first system management interrupt signal is forbidden. After the first system management interrupt signal is disabled, the hardware fault may continue to occur, so as to avoid failure to report fault information to the fault management system continuously, the second system management interrupt signal may be triggered according to a preset period, and the basic input/output system is triggered to scan the hardware fault through the second system management interrupt signal, so that the fault information can be reported continuously. Since the frequency of the second system management interrupt signal is smaller than the frequency of the first system management interrupt signal, which is a periodic low frequency system management interrupt signal, even if a large number of correctable errors continue to be generated, the number of generated second system management interrupt signals is very small, so that the system management interrupt storm is suppressed while maintaining the operation and maintenance. If the target hardware is determined to have no hardware fault according to the scanning result, switching the second system management interrupt signal back to the first system management interrupt signal, and continuing normal fault management. The application avoids the problems of computer equipment blocking and performance reduction of the computer equipment caused by a large amount of SMI storm to a certain extent by switching the triggering modes of the first system management interrupt signal and the second system management interrupt signal while carrying out fault management through the SMI, simultaneously ensures continuous report of fault information, realizes continuous operation and maintenance of the computer equipment, and further realizes balance of the performance and the operation and maintenance of the computer equipment.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is an application scenario architecture diagram of a method for suppressing a system management interrupt storm according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for suppressing a system management interrupt storm according to an embodiment of the present application;

FIG. 3 is a schematic diagram of storm suppression effect according to an embodiment of the present application;

FIG. 4 is a detailed flowchart of a method for suppressing a system management interrupt storm according to an embodiment of the present application;

FIG. 5 is a flowchart of another method for suppressing a system management interrupt storm according to an embodiment of the application;

FIG. 6 is a block diagram of a suppression device for a system management interrupt storm according to an embodiment of the present application;

FIG. 7 is a block diagram of another suppression device for a system management interrupt storm according to an embodiment of the application;

Fig. 8 is a block diagram of a terminal according to an embodiment of the present application;

fig. 9 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

When the hardware fault generates a CE, an SMI signal is triggered to process the hardware fault, after the BIOS receives the SMI signal, fault information is determined through a handler of the BIOS, and the hardware fault information is reported to a fault management system so as to carry out remote fault operation and processing on the computer equipment to recover the fault.

However, for CE SMIs, each CE triggers a CE SMI, and if the hardware that fails during the operation of the computer device is used very frequently, a large number of CEs are likely to be generated in a short time, and thus trigger a large number of CE SMIs to form an SMI storm. Each CE SMI may cause hardware that fails to enter a system management mode (System Management Mode, SMM), and since the hardware that fails cannot handle traffic when entering SMM, if an SMI storm forms, this may cause problems such as system stuck, traffic interruption, and the like, affecting the performance of the computer device.

In order to solve the technical problem, the related technology adopts a mode of disabling the CE SMI to inhibit the SMI storm, and after the CE SMI is disabled, even if the CE generated by the subsequent hardware fault continuously happens, the handler of the BIOS cannot be triggered because the CE SMI is disabled, the fault information cannot be reported to the fault management system any more, the fault management system cannot continuously sense the fault information, and the hardware fault cannot be continuously detected, so that the remote operation and maintenance fail.

Therefore, the embodiment of the application provides a method for suppressing a system management interrupt storm, which can still disable a first system management interrupt signal (namely, a CE SMI) after the occurrence of the SMI storm, and continuously trigger a periodic low-frequency SMI (namely, a second system management interrupt signal) to perform fault detection after the CE SMI is disabled. When the periodic low frequency SMI fails to detect a hardware failure, the periodic SMI is disabled and the CE SMI is switched back. Through intelligent dynamic adjustment and switching of two SMI modes, balance of business influence and system operability is achieved, the problems of computer equipment blocking and performance reduction of the computer equipment caused by a large number of SMI storms are avoided to a certain extent, continuous reporting of fault information is guaranteed, and continuous operability and maintenance of the computer equipment are achieved.

It should be noted that, the method for suppressing the system management interrupt storm provided by the embodiment of the present application is applicable to SMI storms of various hardware of the computer device, and at this time, the method for suppressing the system management interrupt storm can be executed by the computer device. The various hardware on the computer device may include, for example, memory, high-speed serial computer expansion bus standard (Peripheral Component Interconnect Express, PCIe), central processing unit (central processing unit, CPU), etc. The computer device may be, for example, a server or a terminal. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. Terminals include, but are not limited to, smart phones, computers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, and the like.

As shown in fig. 1, fig. 1 shows an application scenario architecture diagram of a method for suppressing a system management interrupt storm, where an application scenario is described by taking a computer device as an example of a server, and the method for suppressing a system management interrupt storm provided by the embodiment of the application is described by way of example.

A server 100 may be included in the application scenario, and a first system management interrupt signal may be triggered when certain hardware of the server 100, such as target hardware, generates a hardware failure based correctable error. The first system management interrupt signal may be a system management interrupt signal generated based on a correctable error, and may be used to trigger the bios to report fault information, and one first system management interrupt signal is generated when a correctable error occurs, so the first system management interrupt signal may be referred to as CE SMI.

If the server 100 determines that the system management interrupt storm (i.e., SMI storm) occurs according to the generation frequency of the first system management interrupt signal, the triggering of the first system management interrupt signal (i.e., disabling the CE SMI) is prohibited. After disabling the first system management interrupt signal, the hardware fault may continue to occur, so as to avoid failure to report the fault information to the fault management system continuously, the server 100 may trigger the second system management interrupt signal according to a preset period, and trigger the basic input/output system to scan the hardware fault through the second system management interrupt signal, so as to ensure that the fault information can be reported continuously.

The second system management interrupt signal is also an SMI, except that the frequency of generation of the second system management interrupt signal is less than the frequency of generation of the first system management interrupt signal, and is a periodic low-frequency system management interrupt signal (i.e., periodic low-frequency SMI). Even if the target hardware continues to generate a large number of correctable errors, the number of generated second system management interrupt signals is very small, thereby suppressing the system management interrupt storm while maintaining the operation and maintenance.

If the server 100 determines that there is no hardware fault on the target hardware according to the scan result, the second system management interrupt signal may be switched back to the first system management interrupt signal, and normal fault management may be continued.

The application avoids the problems of computer equipment blocking and performance reduction of the computer equipment caused by a large amount of SMI storm to a certain extent by switching the triggering modes of the first system management interrupt signal and the second system management interrupt signal while carrying out fault management through the SMI, simultaneously ensures continuous report of fault information, realizes continuous operation and maintenance of the computer equipment, and further realizes balance of the performance and the operation and maintenance of the computer equipment.

It should be noted that, the method provided by the embodiment of the present application mainly relates to artificial intelligence, where the artificial intelligence (Artificial Intelligence, AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, sense an environment, acquire knowledge and use knowledge to obtain an optimal result. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, car networking, autopilot, smart transportation, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will be of increasing importance. The embodiment of the application mainly realizes intelligent dynamic adjustment and switching of two SMI modes by using an artificial intelligence technology.

The embodiment of the application also relates to the field of Cloud technology (Cloud technology), wherein the Cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing. The embodiment of the application realizes remote fault operation and treatment of the computer equipment through cloud technology.

Next, a method for suppressing the system management interrupt storm by using the server as an example will be described in detail with reference to the accompanying drawings. Referring to fig. 2, fig. 2 shows a flowchart of a method for suppressing a system management interrupt storm, the method comprising:

S201, when the target hardware generates a correctable error based on hardware faults, triggering a first system management interrupt signal.

The server may include various hardware, such as memory, PCIe, CPU, etc., and generally, CEs generated by hardware faults of these three types of hardware are relatively common on the server.

The server may trigger the first system management interrupt signal when some hardware on the server, such as the target hardware, generates a correctable error based on the hardware failure. The target hardware may be at least one of memory, PCIe, and CPU. The first system management interrupt signal may be a system management interrupt signal generated based on a correctable error, and one first system management interrupt signal is generated every time a correctable error occurs, so the first system management interrupt signal may be referred to as a CE SMI. The first system management interrupt signal can be used for triggering the basic input and output system to report fault information, and the fault information can be reported to the fault management system under normal conditions, so that the fault management system can conveniently carry out remote fault operation and processing on the server.

Specifically, when the target hardware generates a CE based on hardware fault, the CE SMI is started, the hardware fault is processed by triggering the CE SMI, the handler of the BIOS acquires fault information by scanning the hardware fault, and the corresponding fault information is reported to the fault management system. The fault information may be, for example, a fault source, a fault type, and the like. The fault management System may be a small Operating System, which is used to facilitate remote management, installation, restarting, etc. of the server, and may be at least one of a baseboard management controller (Baseboard Management Controller, BMC) or an Operating System (OS).

It should be noted that, the mode of triggering the fault handling by the CE SMI may be an SMI mode 1, i.e. an interrupt mode, in which each CE triggers a CE SMI, and if the target hardware generates a large number of CEs, the target hardware triggers a large number of CE SMIs.

In one possible implementation manner, triggering of the CE SMI may be implemented by configuring an interrupt register corresponding to a hardware fault, for example, the interrupt register corresponding to the hardware fault may be configured during BIOS initialization (server start-up), and an interrupt mode of the interrupt register is set to an enabled state, that is, the hardware fault is set to a mode 1 (interrupt mode), where each CE generates a CE SMI.

Since multiple hardware may be included in the server, different registers may be configured as interrupt registers for different hardware to trigger the first system management interrupt signal (i.e., CE SMI). For example, the CPU or memory may trigger the CE SMI by configuring a machine check architecture (Machine Check Architecture, MCA) register, and the PCIe may trigger the CE SMI by configuring an advanced error report (Advanced Error Reporting, AER) register of the PCIe. In this case, when the triggering of the CE SMI is implemented by configuring the interrupt register, the interrupt register corresponding to the target hardware may be found first, and then the interrupt mode of the interrupt register corresponding to the target hardware may be set to an enabled state.

The first system management interrupt signal is triggered by the interrupt register, so that the triggering mode of the first system management interrupt signal can be simplified, and the control and management are convenient.

S202, if the occurrence of the system management interrupt storm is determined according to the generation frequency of the first system management interrupt signal, the triggering of the first system management interrupt signal is forbidden.

After triggering the first system management interrupt signal, the server may detect whether the first system management interrupt signal has reached a system management interrupt storm, and in general, the server may determine whether a system management interrupt storm occurs according to the generation frequency of the first system management interrupt signal, and if the generation frequency of the first system management interrupt signal reaches a preset threshold, may determine that a system management interrupt storm occurs.

In one possible implementation, the server may detect in the handler of the BIOS whether the first system management interrupt signal has reached a system management interrupt storm. In this case, the handler of the BIOS may count the number of CE SMIs received within a preset period of time (e.g., 1S) to determine the frequency of generation of CE SMIs to determine whether an SMI storm is reached. If the number of CE SMIs received in 1S reaches a number threshold, then it may be determined that an SMI storm is reached. The number threshold may be set according to the service requirement, and in the embodiment of the present application, the number threshold may be set to 5.

If the server determines that the SMI storm is reached, the server needs to suppress the SMI storm in order to reduce the influence of the SMI storm, so in the embodiment of the application, the server may prohibit triggering the first system management interrupt signal. That is, even if the CE continues to be generated on the target hardware of the server, the first system management interrupt signal is not triggered any more.

In some cases, the CE SMI may be triggered by configuring an interrupt mode of the interrupt register, so in one possible implementation, the manner of prohibiting the triggering of the first system management interrupt signal may be to determine an interrupt register corresponding to the target hardware, and further switch the interrupt mode of the interrupt register corresponding to the target hardware to a disabled state.

Taking the target hardware as the CPU as an example, if it is confirmed in the handler of the BIOS that the SMI storm has occurred, the interrupt register corresponding to the CPU may be determined first, and then the interrupt mode of the interrupt register corresponding to the CPU is configured to be in a disabled state, that is, the CE SMI is disabled in a disabled mode 1 (interrupt mode), and the interrupt mode is exited, where even if the CE of the CPU continues to be generated, the server does not trigger the CE SMI any more.

And S203, triggering a second system management interrupt signal according to a preset period, and triggering the basic input/output system to scan hardware faults through the second system management interrupt signal.

After the first system management interrupt signal is disabled, the hardware fault may continue to occur, so that in order to continuously report the fault information to the fault management system, the server can trigger the second system management interrupt signal according to a preset period, and trigger the basic input/output system to scan the hardware fault through the second system management interrupt signal, thereby ensuring that the fault information can be continuously reported, and enabling the server to continuously operate and maintain. The preset period may be a period of time set according to the service requirement, and the preset period may be set to be the same as the preset period of time, for example, may be set to 1S.

Since the generation frequency of the second system management interrupt signal is smaller than that of the first system management interrupt signal, the generation frequency of the second system management interrupt signal is determined based on the preset period, and the second system management interrupt signal is a periodic low-frequency system management interrupt signal (i.e., a periodic low-frequency SMI). Even if the target hardware continues to generate a large number of correctable errors, the number of generated second system management interrupt signals is very small, thereby suppressing the system management interrupt storm while maintaining the operation and maintenance.

In the embodiment of the present application, the manner of triggering the periodic low-frequency SMI may be varied, for example, a periodic low-frequency SMI specific to the target hardware may be used, and for example, the periodic low-frequency SMI may be triggered through a general purpose input/Output Port (General Purpose Input/Output Port, GPIO), which is not limited in the embodiment of the present application.

It should be noted that, the mode of triggering the fault handling by the periodic low-frequency SMI may be an SMI mode 2, i.e. a patrol mode, in which each preset period triggers one periodic low-frequency SMI, and the target hardware triggers only one periodic low-frequency SMI in the preset period even if a large number of CEs are generated.

In one possible implementation, the triggering of the periodic low frequency SMI may be implemented by configuring a hardware register corresponding to the hardware. Typically, different registers may be configured for different hardware, corresponding to the hardware registers, in order to trigger the second system management interrupt signal. In this case, the manner of triggering the second system management interrupt signal according to the preset period may be to determine the hardware register corresponding to the target hardware first, and then switch the patrol mode of the hardware register corresponding to the target hardware to the enabled state, that is, set to the mode 2 (patrol mode).

The triggering mode of the second system management interrupt signal can be simplified by triggering the second system management interrupt signal through the hardware register, and the control and management are convenient.

It should be noted that, in the embodiment of the present application, the manner of triggering the bios to scan for hardware faults by the second system management interrupt signal may include various manners, and one manner may be that the periodic low-frequency SMI in the patrol mode triggers to scan for all hardware faults.

In some cases, the server includes multiple types of hardware, and the scanned hardware faults may come from different hardware, so, in order to enable accurate scanning of the hardware faults, another way may be to determine a fault register corresponding to the target hardware, and further trigger the basic input/output system to scan the fault register corresponding to the target hardware through a second system management interrupt signal, so as to obtain a scanning result.

By scanning the fault register corresponding to the target hardware with the SMI storm, the scanning result can be obtained directly according to whether the hardware fault is scanned, so that the resources are saved, the scanning efficiency is improved, and the subsequent judging efficiency is improved.

S204, if the target hardware is determined to have no hardware fault according to the scanning result, switching the second system management interrupt signal back to the first system management interrupt signal.

The server may determine whether a hardware fault exists on the target hardware according to the scan result, and if it is determined that a hardware fault does not exist on the target hardware according to the scan result, may switch the second system management interrupt signal back to the first system management interrupt signal. If the hardware fault exists on the target hardware according to the scanning result, triggering a second system management interrupt signal according to a preset period, and reporting fault information corresponding to the scanned hardware fault to a fault management system.

When the triggering of the CE SMI is realized by configuring an interrupt register corresponding to the hardware fault, and the triggering of the periodic low-frequency SMI is realized by configuring a hardware register corresponding to the hardware, the mode of switching the second system management interrupt signal back to the first system management interrupt signal is to switch the inspection mode of the hardware register corresponding to the target hardware from an enabling state to a disabling state, so that the inspection mode is withdrawn, and the interrupt mode of the interrupt register corresponding to the target hardware is switched from the disabling state to the enabling state and returns to the interrupt mode.

Specifically, when the periodic low-frequency SMI in the inspection mode scans for hardware faults, if the periodic low-frequency SMI still scans for hardware faults on the target hardware, the fact that the hardware faults of the target hardware are not recovered is indicated, the current inspection mode of the periodic low-frequency SMI is maintained, and fault information is reported to the BMC and the OS. If the periodic low frequency SMI does not rescan to the hardware failure of the target hardware, indicating that the hardware failure of the target hardware has been restored, exiting the patrol mode (i.e., mode 2), and restarting the interrupt mode, returning to mode 1 (i.e., interrupt mode).

It should be noted that, according to different manners of scanning hardware faults by the bios, the manner of determining whether there is a hardware fault on the target hardware according to the scanning result may be different. If the periodic low-frequency SMI triggers the basic input/output system to scan the fault register corresponding to the target hardware, the mode of determining whether the hardware fault exists on the target hardware according to the scanning result can be that the hardware fault exists on the target hardware only if the hardware fault is scanned, and the hardware fault does not exist on the target hardware only if the hardware fault is not scanned.

If the periodic low-frequency SMI triggers the basic input/output system to scan all hardware faults, determining whether the hardware faults exist on the target hardware according to the scanning result can be that if the scanning result indicates that the hardware faults are not scanned, determining that the hardware faults do not exist on the target hardware; or if the scanning result indicates that the hardware fault is scanned, further acquiring a hardware identifier included in the scanning result, and if the hardware identifier in the scanning result is inconsistent with the hardware identifier of the target hardware, indicating that the scanned hardware fault is not the hardware fault generated by the target hardware, thereby determining that the hardware fault does not exist on the target hardware. If the scanning result indicates that the hardware fault is scanned, and the hardware identification in the scanning result is consistent with the hardware identification of the target hardware, the hardware fault exists on the target hardware.

In the process of periodically scanning hardware faults, the scanned hardware faults are distinguished, so that the judgment accuracy of whether the hardware faults exist on target hardware can be improved, and the switching of the two SMIs can be controlled more accurately.

Referring to fig. 3, fig. 3 shows a schematic diagram of a storm suppressing effect, and as can be seen from fig. 3, 1-3 seconds, the server generates an SMI storm due to a CE with a hardware failure (in fig. 3, taking an example that 20 SMIs are generated within 1S to reach the SMI storm), and the server determines to disable the CE SMI after the SMI storm is generated. In the 4 th to 7 th seconds, the periodic low-frequency SMIs are triggered, the BIOS is triggered by the periodic low-frequency SMIs to scan hardware faults, and 1S is generated in the preset period of the periodic low-frequency SMIs in FIG. 3, namely, 1S. 8-9 seconds, the periodic low frequency SMI is not rescanned to a hardware failure, indicating a hardware failure recovery, BIOS disables the periodic low frequency SMI, and re-enables the CE SMI, since the hardware failure has recovered health, at which time the entire server generates 0 SMIs. Therefore, the CE SMIs are forbidden when the SMI storm is generated, so that the quantity of SMIs in the subsequent process is greatly reduced, the effect of suppressing the SMI storm is achieved, and the performance of the server is improved. Meanwhile, the periodic low-frequency SMI can trigger the BIOS to scan hardware faults and continuously report fault information, so that continuous operation and maintenance of the server can be realized.

Based on the description of the above embodiments, in order to achieve the balance between the performance and the operability of the computer device, in the embodiment of the present application, two modes are dynamically switched during the operation of the computer device, and specifically, see fig. 4:

S401, generating a correctable error based on hardware faults by target hardware.

S402, enabling CE SMIs and entering an interrupt mode.

S403, determining whether an SMI storm is generated, if yes, executing S404, and if not, executing S402.

S404, disabling CE SMI and exiting the interrupt mode.

S405, triggering a periodic low-frequency SMI, and entering a patrol mode.

S406, triggering hardware faults on the scanning target hardware by the periodic low-frequency SMI.

S407, reporting the fault information to a fault management system.

S408, determining whether the hardware fault on the target hardware is recovered, if so, executing S402, and if not, executing S406.

The embodiment of the application realizes the balance of the performance and the operation and maintenance of the computer equipment, avoids the system blocking caused by a large number of SMI storms or the influence on the performance, realizes sustainable operation, triggers the scanning hardware faults through the periodic low-frequency SMIs, reports the fault information to the BMC and the OS, and realizes the sustainable operation and maintenance of the computer equipment.

The corresponding embodiment of fig. 2 is mainly to implement storm suppression by switching CE SMIs and periodic low frequency SMIs, which in one possible implementation may be replaced by periodic scanning actively by the fault management system. To this end, an embodiment of the present application further provides another method for suppressing a system management interrupt storm, referring to fig. 5, where the method includes:

S501, when the target hardware generates a correctable error based on hardware faults, triggering a first system management interrupt signal.

The first system management interrupt signal is a system management interrupt signal generated based on a correctable error and is used for triggering the basic input output system to report fault information, and each time a correctable error occurs, the first system management interrupt signal is generated.

S502, if the occurrence of the system management interrupt storm is determined according to the generation frequency of the first system management interrupt signal, the triggering of the first system management interrupt signal is forbidden.

Wherein, S501-S502 can be described in S201-S202, and are not described herein.

S503, scanning hardware faults according to a preset period through a fault management system.

After the hardware fault is generated, the CE SMI triggers fault processing, and the BIOS reports fault information to the fault management system, so that in order to restrain the SMI storm, the fault management system can actively scan the hardware fault through the fault management system after the CE SMI is disabled, so that the fault management system can acquire the fault information without generating the CE SMI, the operation and maintenance are ensured, the CE SMI is not required to be generated, and the storm restraining effect is realized.

The fault management system may be a small operating system, which is used to facilitate remote management, installation, restarting, etc. of the server, and may be at least one of a BMC or an OS, for example.

S504, if the fact that the hardware fault does not exist on the target hardware is determined according to the scanning result, enabling the first system management interrupt signal.

The computer device may determine, according to the scan result, whether a hardware fault exists on the target hardware, and if it is determined, according to the scan result, that no hardware fault exists on the target hardware, may enable the first system management interrupt signal. If the hardware fault exists on the target hardware according to the scanning result, the hardware fault is actively scanned by the fault management system continuously so as to acquire fault information.

It should be noted that, based on the implementation manner provided in the above aspects, further combinations may be further performed to provide further implementation manners.

Based on the method for suppressing the system management interrupt storm provided in the corresponding embodiment of fig. 2, the embodiment of the application further provides a device 600 for suppressing the system management interrupt storm. Referring to fig. 6, the system management interrupt storm suppressing apparatus 600 includes a triggering unit 601, a prohibiting unit 602, a scanning unit 603, and a switching unit 604:

The triggering unit 601 is configured to trigger a first system management interrupt signal when the target hardware generates a correctable error based on a hardware failure, where the first system management interrupt signal is a system management interrupt signal generated based on the correctable error, and is configured to trigger a basic input/output system to report failure information, and generate a first system management interrupt signal when one correctable error occurs;

the prohibiting unit 602 is configured to prohibit triggering the first system management interrupt signal if it is determined that a system management interrupt storm occurs according to the generation frequency of the first system management interrupt signal;

the scanning unit 603 is configured to trigger a second system management interrupt signal according to a preset period, and trigger the bios to scan for a hardware fault through the second system management interrupt signal, where a frequency of generating the second system management interrupt signal is less than a frequency of generating the first system management interrupt signal, and the frequency of generating the second system management interrupt signal is determined based on the preset period;

the switching unit 604 is configured to switch the second system management interrupt signal back to the first system management interrupt signal if it is determined that there is no hardware failure on the target hardware according to the scan result.

In one possible implementation manner, the prohibiting unit 602 is specifically configured to:

determining an interrupt register corresponding to the target hardware;

and switching the interrupt mode of the interrupt register corresponding to the target hardware into a disabled state.

In a possible implementation manner, the scanning unit 603 is specifically configured to:

determining a hardware register corresponding to the target hardware;

and switching the patrol mode of the hardware register corresponding to the target hardware into an enabling state.

In one possible implementation manner, the switching unit 604 is specifically configured to:

switching the patrol mode of the hardware register corresponding to the target hardware from an enabling state to a disabling state;

and switching the interrupt mode of the interrupt register corresponding to the target hardware from the disabled state to the enabled state.

if the scanning result indicates that the hardware fault is not scanned, determining that the hardware fault does not exist on the target hardware;

or if the scanning result indicates that the hardware fault is scanned, acquiring a hardware identifier included in the scanning result;

and if the hardware identification in the scanning result is inconsistent with the hardware identification of the target hardware, determining that no hardware fault exists on the target hardware.

determining a fault register corresponding to the target hardware;

and triggering the basic input/output system to scan the fault register corresponding to the target hardware through the second system management interrupt signal so as to obtain the scanning result.

In one possible implementation manner, the apparatus further includes a reporting unit:

and the reporting unit is used for keeping triggering the second system management interrupt signal according to the preset period and reporting the fault information corresponding to the scanned hardware fault to the fault management system if the hardware fault exists on the target hardware according to the scanning result.

Based on the method for suppressing the system management interrupt storm provided in the corresponding embodiment of fig. 5, the embodiment of the application further provides a device 700 for suppressing the system management interrupt storm. Referring to fig. 7, the suppression device 700 for the system management interrupt storm includes a triggering unit 701, a prohibiting unit 702, a scanning unit 703, and an enabling unit 704:

the triggering unit 701 is configured to trigger a first system management interrupt signal when the target hardware generates a correctable error based on a hardware failure, where the first system management interrupt signal is a system management interrupt signal generated based on the correctable error, and is configured to trigger a basic input/output system to report failure information, and generate a first system management interrupt signal when one correctable error occurs;

the prohibiting unit 702 is configured to prohibit triggering the first system management interrupt signal if it is determined that a system management interrupt storm occurs according to the generation frequency of the first system management interrupt signal;

the scanning unit 703 is configured to scan, by using the fault management system, for hardware faults according to a preset period;

the enabling unit 704 is configured to enable the first system management interrupt signal if it is determined that there is no hardware failure on the target hardware according to the scan result.

The embodiment of the application also provides computer equipment which can execute the method for suppressing the system management interrupt storm. The computer device may be, for example, a terminal, taking a smart phone as an example:

fig. 8 is a block diagram illustrating a part of a structure of a smart phone according to an embodiment of the present application. Referring to fig. 8, a smart phone includes: radio Frequency (RF) circuitry 810, memory 820, input unit 830, display unit 840, sensor 850, audio circuitry 860, wireless fidelity (WiFi) module 870, processor 880, and power supply 890. The input unit 830 may include a touch panel 831 and other input devices 832, the display unit 840 may include a display panel 841, and the audio circuit 860 may include a speaker 861 and a microphone 862. It will be appreciated that the smartphone structure shown in fig. 8 is not limiting of the smartphone, and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The memory 820 may be used to store software programs and modules, and the processor 880 performs various functional applications and data processing of the smart phone by running the software programs and modules stored in the memory 820. The memory 820 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data (such as audio data, phonebooks, etc.) created according to the use of the smart phone, etc. In addition, memory 820 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 880 is a control center of the smart phone, connects various parts of the entire smart phone using various interfaces and lines, performs various functions of the smart phone and processes data by running or executing software programs and/or modules stored in the memory 820, and calling data stored in the memory 820. In the alternative, processor 880 may include one or more processing units; preferably, the processor 880 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 880.

In this embodiment, the processor 880 in the smartphone may perform the following steps:

Or alternatively, the first and second heat exchangers may be,

The computer device provided in the embodiment of the present application may also be a server, as shown in fig. 9, fig. 9 is a block diagram of a server 900 provided in the embodiment of the present application, where the server 900 may have a relatively large difference due to different configurations or performances, and may include one or more processors, such as a central processing unit (Central Processing Units, abbreviated as CPU) 922, and a memory 932, and one or more storage media 930 (such as one or more mass storage devices) storing application programs 942 or data 944. Wherein the memory 932 and the storage medium 930 may be transitory or persistent. The program stored in the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 922 may be arranged to communicate with a storage medium 930 to execute a series of instruction operations in the storage medium 930 on the server 900.

The Server 900 may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input/output interfaces 958, and/or one or more operating systems 941, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

In this embodiment, the central processor 922 in the server 900 may perform the following steps:

Or alternatively, the first and second heat exchangers may be,

According to an aspect of the present application, there is provided a computer readable storage medium for storing a program code for executing the system management interrupt storm suppressing method according to the foregoing embodiments.

According to one aspect of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the methods provided in the various alternative implementations of the above embodiments.

The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for suppressing a system management interrupt storm, the method comprising:

2. The method of claim 1, wherein the disabling the triggering of the first system management interrupt signal comprises:

determining an interrupt register corresponding to the target hardware;

3. The method of claim 2, wherein triggering the second system management interrupt signal according to the preset period comprises:

determining a hardware register corresponding to the target hardware;

4. A method according to claim 3, wherein said switching said second system management interrupt signal back to said first system management interrupt signal comprises:

5. The method of claim 1, wherein determining that there is no hardware failure on the target hardware based on the scan result comprises:

6. The method of claim 1, wherein triggering the bios scan hardware fault via the second system management interrupt signal comprises:

determining a fault register corresponding to the target hardware;

7. The method according to any one of claims 1-6, further comprising:

if the hardware fault exists on the target hardware according to the scanning result, triggering the second system management interrupt signal according to the preset period, and reporting the fault information corresponding to the scanned hardware fault to a fault management system.

8. A method for suppressing a system management interrupt storm, the method comprising:

9. A suppression device for a system management interrupt storm, which is characterized by comprising a triggering unit, a prohibiting unit, a scanning unit and a switching unit:

10. The apparatus according to claim 9, wherein the disabling unit is specifically configured to:

determining an interrupt register corresponding to the target hardware;

11. The apparatus according to claim 10, wherein the scanning unit is specifically configured to:

Determining a hardware register corresponding to the target hardware;

12. A suppression device for a system management interrupt storm, which is characterized by comprising a triggering unit, a prohibiting unit, a scanning unit and an enabling unit:

13. A computer device, the computer device comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-8 according to instructions in the program code.

14. A computer readable storage medium for storing program code which, when executed by a processor, causes the processor to perform the method of any of claims 1-8.

15. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any of claims 1-8.