CN115858211A - Method and device for processing machine check errors - Google Patents

Method and device for processing machine check errors Download PDF

Info

Publication number
CN115858211A
CN115858211A CN202211469570.6A CN202211469570A CN115858211A CN 115858211 A CN115858211 A CN 115858211A CN 202211469570 A CN202211469570 A CN 202211469570A CN 115858211 A CN115858211 A CN 115858211A
Authority
CN
China
Prior art keywords
machine check
cpu
register
error
check error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211469570.6A
Other languages
Chinese (zh)
Inventor
崔毕轩
毛文安
薛帅
曾勇
王志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211469570.6A priority Critical patent/CN115858211A/en
Publication of CN115858211A publication Critical patent/CN115858211A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Detection And Correction Of Errors (AREA)

Abstract

One or more embodiments of the present specification provide a method and an apparatus for processing a machine check error, which are applied to an operating system running on an electronic device; a target CPU carried on the electronic equipment supports a machine check architecture; the method comprises the following steps: reading data stored in a machine check register in response to a machine check exception; wherein the machine check exception is triggered by an uncorrectable machine check error detected by the target CPU; determining whether the machine check error is a recoverable machine check error based on a preset machine check error classification rule and the read data; if the machine check error is determined to be a recoverable machine check error, further determining whether a recovery policy corresponding to the machine check error is to isolate the CPU; and if the recovery strategy corresponding to the machine check error is an isolation CPU, performing isolation processing on the target CPU.

Description

Method and device for processing machine check errors
Technical Field
One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and an apparatus for processing a machine check error.
Background
Cloud computing is the basis of the fields of internet, internet of things, artificial intelligence, big data and the like, and with the development of the fields, various cloud computing applications and services emerge endlessly and are changing day by day. In the process of continuously changing cloud computing applications and services, the scale of the cloud computing platform is larger and larger, so that more and more servers bearing the cloud computing platform are provided, and the operation stability of the servers is more and more important.
In addition to servers in cloud computing scenarios, operational stability of electronic devices such as servers and personal computers in non-cloud scenarios is also very important. These electronic devices are usually equipped with a CPU (Central Processing Unit) on a hardware level, and run an Operating System (OS) on a software level; where the operating system may be translated into executable instructions for execution by the CPU and may manage memory space, networks, and the like.
During the operation of the electronic device, a hardware error is one of the important causes of downtime. Therefore, how to process the hardware error in the electronic device to improve the operation stability of the electronic device becomes a problem to be solved urgently.
Disclosure of Invention
One or more embodiments of the present disclosure provide the following:
the present specification provides a processing method for machine check error, which is applied to an operating system running on an electronic device; a target CPU carried on the electronic equipment supports a machine check architecture; the method comprises the following steps:
reading data stored in a machine check register in response to a machine check exception; wherein the machine check exception is triggered by an uncorrectable machine check error detected by the target CPU;
determining whether the machine check error is a recoverable machine check error based on a preset machine check error classification rule and the read data;
if the machine check error is determined to be a recoverable machine check error, further determining whether a recovery policy corresponding to the machine check error is to isolate the CPU;
and if the recovery strategy corresponding to the machine check error is an isolation CPU, performing isolation processing on the target CPU.
The present specification also provides a processing apparatus for machine check error, which is applied to an operating system running on an electronic device; a target CPU carried on the electronic equipment supports a machine check architecture; the device comprises:
the reading module is used for responding to the machine check exception and reading the data stored in the machine check register; wherein the machine check exception is triggered by an uncorrectable machine check error detected by the target CPU;
the first determination module is used for determining whether the machine check error is a recoverable machine check error or not based on a preset machine check error classification rule and the read data;
a second determination module, configured to further determine whether a recovery policy corresponding to the machine check error is an isolated CPU if it is determined that the machine check error is a recoverable machine check error;
and the processing module is used for carrying out isolation processing on the target CPU under the condition that the recovery strategy corresponding to the machine check error is an isolation CPU.
The present specification also provides an electronic device comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the steps of the method as described in any one of the above by executing the executable instructions.
The present specification also provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method according to any one of the preceding claims.
In the above technical solution, a machine check exception may be triggered by an uncorrectable machine check error detected by a CPU supporting a machine check architecture, and an operating system may read data stored in a machine check register in response to the machine check exception, and determine whether the machine check error is a recoverable machine check error based on the read data, and if so, the operating system may further determine whether a recovery policy corresponding to the machine check error is an isolated CPU, and if so, the operating system may perform an isolation process on the CPU.
By adopting the mode, for uncorrectable and recoverable machine inspection errors, on the basis of the general recovery strategy, the recovery strategy for isolating the CPU is added, namely, the operating system can execute the general recovery operation corresponding to one part of specific machine inspection errors to ensure the continuous operation, and can also execute the CPU isolation operation corresponding to another part of specific machine inspection errors to ensure the continuous operation, so that the types of the machine inspection errors which can ensure the continuous operation by executing the recovery operation by the operating system are increased, the downtime probability is reduced, and the operation stability of the electronic equipment can be improved.
Drawings
FIG. 1 is a schematic diagram of a machine check register, as shown in an exemplary embodiment of the present description.
Fig. 2 is a schematic diagram of an IA32_ MCi _ STATUS register shown in an exemplary embodiment of the present description.
FIG. 3 is a flow chart illustrating a method for processing a machine check error in accordance with an exemplary embodiment of the present disclosure.
FIG. 4 is a flowchart illustrating a method for determining a recovery policy in accordance with an exemplary embodiment of the present description.
Fig. 5 is a hardware configuration diagram of an electronic device in which a processing apparatus for machine-checking errors is provided according to an exemplary embodiment of the present disclosure.
Fig. 6 is a block diagram of a processing apparatus for machine error checking according to an exemplary embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.
It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.
In practical applications, a machine check architecture and a machine check exception mechanism may be employed to handle hardware errors in an electronic device.
Machine-check architecture (machine-check architecture) provides mechanisms to detect and report hardware (i.e., machine) errors, such as: system bus errors, ECC errors, parity errors, cache errors, and TLB errors. The machine check architecture includes a set of model-specific registers (MSRs) that are used to set the machine check, and an additional set of MSRs that are used to record detected errors.
The processor indicates the detection of an uncorrectable machine-check error (uncorrected machine-check error) by generating a machine-check exception (# MC), which is an abort class exception. The implementation of machine check architectures generally does not allow a processor to reliably restart after a machine check exception is generated. However, a machine check exception handler (typically a program in the operating system) may collect information related to machine check errors from machine check registers.
Taking an electronic device with a pentium 4, intel atom, intel to strong, or P6 family processor as an example, as shown in fig. 1, in the electronic device, the machine check register includes a set of global control registers (global control MSRs) and several error-reporting register banks (error-reporting register banks). Each error reporting register set is associated with a particular hardware unit or group of hardware units in the processor.
The global control registers may include an IA32_ MCG _ CAP register, an IA32_ MCG _ STATUS register, an IA32_ MCG _ CTL register, and an IA32_ MCG _ EXT _ CTL register. Wherein:
the IA32_ MCG _ CAP register is a read-only register that provides information related to the machine check architecture.
The IA32_ MCG _ STATUS register describes the current state of the processor after the machine check exception occurs.
If the capability flag MCG _ CTL _ P is set in the IA32_ MCG _ CAP register, the IA32_ MCG _ CTL register exists. The IA32_ MCG _ CTL register controls the reporting of machine check exceptions; if so, writing all 1's to the register enables the machine check function and writing all 0's disables the machine check function.
If the capability flag MCG _ LMCE _ P is set in the IA32_ MCG _ CAP register, then the IA32_ MCG _ EXT _ CTL register exists. The LMCE _ EN flag in IA32_ MCG _ EXT _ CTL allows a processor to send some machine check error signals to only a single logical processor. Any attempt to write or read the IA32_ MCG _ EXT _ CTL register results in # GP (general protection indication) if MCG _ LMCE _ P is not set in the register in IA32_ MCG _ CAP or LMCE is not ENABLED by setting the LMCE _ ENABLED flag in the IA32_ FEATURE _ CONTROL register.
Each error reporting register set may include an IA32_ MCi _ CTL register, an IA32_ MCi _ STATUS register, an IA32_ MCi _ ADDR register, and an IA32_ MCi _ MISC register. Wherein:
the IA32_ MCi _ CTL register controls # MC corresponding to an error generated by a particular hardware unit or group of hardware units. Each of the 64 flags represents a potential error. Setting a flag enables the # MC corresponding to the error represented by the flag, and clearing the flag disables the # MC.
If the VAL flag in the IA32_ MCi _ STATUS register is set, the IA32_ MCi _ STATUS register is considered valid, i.e., the IA32_ MCi _ STATUS register contains information related to machine check errors. The flag in the IA32_ MCi _ STATUS register is shown in FIG. 2.
If the ADDRV flag in the IA32_ MCi _ STATUS register is set, the IA32_ MCi _ ADDR register contains the address of the code or data storage unit that generated the machine check error. If the ADDRV flag in the IA32_ MCi _ STATUS register is cleared, the IA32_ MCi _ ADDR register is either not implemented or contains no such address.
If the MISCV flag in the IA32_ MCi _ STATUS register is set, the IA32_ MCi _ MISC register contains additional information describing the machine check error. If the MISCV flag in the IA32_ MCi _ STATUS register is cleared, the IA32_ MCi _ MISC register is either not implemented or contains no such additional information.
For uncorrectable machine check errors, recoverable machine check errors (uncorrected machine-check errors) may be further classified therefrom. Typically, such machine check errors are also referred to as UCR errors (uncorrected recoverable errors).
Recovery of UCR errors is an enhancement in machine check architectures that allows the operating system to perform recovery operations and continue to run for certain specific uncorrectable machine check errors.
The MCG _ SER _ P flag (bit 24) in the IA32_ MCG _ CAP register is typically used to detect whether error recovery is supported by the operating system. If the MCG _ SER _ P flag in the IA32_ MCG _ CAP register is set, this indicates that the processor supports error recovery by the operating system. If the MCG _ SER _ P flag in the IA32_ MCG _ CAP register is cleared, this indicates that the processor does not support error recovery by the operating system, and the main responsibility of the machine check handler is to record machine check error information and shut down the system. In general, shutting down the system also means downtime.
UCR errors are uncorrectable machine check errors that have been detected and signaled, but do not corrupt the processor context. For certain UCR errors, the operating system may continue to run on the processor once it has performed a recovery operation. The machine check handler will use the error log information in the error reporting register to analyze and implement the particular error recovery operation corresponding to the UCR error.
IA32_ MCi _ STATUS MSR is used to report UCR errors and existing correctable or uncorrectable errors.
With the MCG _ SER _ P flag in the IA32_ MCG _ CAP register set, the following flags in the IA32_ MCi _ STATUS register indicate UCR error: VAL flag (bit 63) =1, uc (bit 61) =1, pcc (bit 57) =0.
In the case where the ADDR flag and the MISCV flag in the IA32_ MCi _ STATUS register are set, additional information of UCR errors stored in the IA32_ MCi _ MISC register and the IA32_ MCi _ ADDR register is available. The machine check architecture error code field stored in the IA32_ MCi _ STATUS register indicates the type of UCR error. The operating system may parse the machine check architecture error code field to analyze and identify the necessary recovery operations corresponding to a given UCR error.
In addition, the S flag (bit 56) and AR flag (bit 55) in the IA32_ MCi _ STATUS register may provide additional information to help the operating system correctly identify the necessary recovery operations corresponding to UCR errors.
The types of UCR errors are shown in table 1 below:
Figure BDA0003957990760000051
/>
TABLE 1
Wherein: UC denotes an uncorrected error. CE denotes corrected error. UCNA denotes an uncacted no action required, and is a UCR error that is not signaled by a machine check exception, but is reported to the operating system as a correctable machine check error. SRAO represents a software recoverable operation option, a UCR error that is signaled by a machine check exception or CMCI (corrected machine-check error). SRAR denotes software recoverable action required, a UCR error that requires the operating system to perform a recovery operation on the processor before another execution stream is scheduled on the processor.
Taking an SRAR type UCR error as an example, a machine check error address (i.e. an address of a code or a data storage unit generating a machine check error) needs to be acquired, and an operating system can perform a recovery operation corresponding to the UCR error.
The present specification aims to provide a solution for handling machine check errors to optimize the handling of UCR errors. In the technical scheme, a machine check exception may be triggered by an uncorrectable machine check error detected by a CPU supporting a machine check architecture, and an operating system may read data stored in a machine check register in response to the machine check exception, and determine whether the machine check error is a recoverable machine check error based on the read data, and if so, the operating system may further determine whether a recovery policy corresponding to the machine check error is an isolated CPU, and if so, the operating system may perform an isolation process on the CPU.
In a specific implementation, in an electronic device that is equipped with a target CPU supporting a machine check architecture and runs an operating system, the target CPU may generate a machine check exception corresponding to a detected uncorrectable machine check error and report the machine check exception to the operating system. The operating system may run a machine check exception handler corresponding to the machine check exception in response to the machine check exception. Specifically, the operating system may first read the data stored in the machine check register.
The operating system may then determine whether the machine check error is a recoverable machine check error, i.e., whether the machine check error is a UCR error, based on predetermined machine check error classification rules and data read from the machine check register. Accordingly, the machine check error classification rule may be a rule for determining whether an uncorrectable machine check error is a UCR error.
The operating system may further determine whether a recovery policy corresponding to the machine check error is the sequestered CPU, in a case where the machine check error is a recoverable machine check error.
When the recovery policy corresponding to the machine check error is an isolated CPU, the CPU that reports the machine check error is the target CPU, and therefore the operating system can perform isolation processing on the target CPU.
By adopting the mode, for uncorrectable and recoverable machine check errors (such as UCR errors), on the basis of a general recovery strategy, a recovery strategy for isolating a CPU is added, namely, an operating system can execute a CPU isolation operation corresponding to another part of specific machine check errors to ensure continuous operation besides executing general recovery operation corresponding to one part of specific machine check errors to ensure continuous operation, so that the type of the machine check errors which can ensure continuous operation by executing the recovery operation by the operating system is increased, the downtime probability is reduced, and the operation stability of the electronic equipment can be improved.
Referring to fig. 3, fig. 3 is a flowchart illustrating a processing method for checking an error by a machine according to an exemplary embodiment of the present disclosure.
In this embodiment, the method for processing the machine check error may be applied to an operating system running on the electronic device; the electronic device may be equipped with a CPU (which may be referred to as a target CPU) supporting a machine check architecture.
In practical applications, the electronic device may be a server in a cloud computing scenario (e.g., a cloud server), a server in a non-cloud scenario, a personal computer, or other electronic devices on which a CPU is installed and an operating system is running.
The processing method for the machine check error can comprise the following steps:
step 302: reading data stored in a machine check register in response to a machine check exception; wherein the machine check exception is triggered by an uncorrectable machine check error detected by the target CPU.
In this embodiment, the target CPU may generate a machine check exception corresponding to the machine check error triggered by the detected uncorrectable machine check error, and report the machine check exception to the operating system. The operating system may run a machine check exception handler corresponding to the machine check exception in response to the machine check exception. Specifically, the operating system may first read the data stored in the machine check register.
Step 304: and determining whether the machine check error is a recoverable machine check error based on a preset machine check error classification rule and the read data.
In this embodiment, the os may subsequently determine whether the machine check error is a recoverable machine check error, that is, whether the machine check error is a UCR error, based on a preset machine check error classification rule and data read from the machine check register. Accordingly, the machine check error classification rule may be a rule for determining whether an uncorrectable machine check error is a UCR error.
It should be noted that, reference may be made to the foregoing content related to the UCR error for a manner of determining whether the uncorrectable machine check error is the UCR error, and details of this specification are not described herein again.
In practical application, if the machine check error is not a UCR error, a downtime may be caused; if the machine check error is a UCR error, the operating system can execute a recovery operation corresponding to the machine check error without causing downtime.
Step 306: if the machine check error is determined to be a recoverable machine check error, it is further determined whether a recovery policy corresponding to the machine check error is to isolate the CPU.
In this embodiment, when the machine check error is a recoverable machine check error, the operating system may further determine whether a recovery policy corresponding to the machine check error is to isolate the CPU.
The recovery policy corresponding to the machine check error may indicate that the operating system needs to perform a recovery operation corresponding to the machine check error.
Step 308: and if the recovery strategy corresponding to the machine check error is an isolation CPU, performing isolation processing on the target CPU.
In this embodiment, when the recovery policy corresponding to the machine check error is the isolated CPU, the CPU that reports the machine check error is the target CPU, and therefore the isolation process can be performed on the target CPU.
In the above technical solution, a machine check exception may be triggered by an uncorrectable machine check error detected by a CPU supporting a machine check architecture, and an operating system may read data stored in a machine check register in response to the machine check exception, and determine whether the machine check error is a recoverable machine check error based on the read data, and if so, the operating system may further determine whether a recovery policy corresponding to the machine check error is an isolated CPU, and if so, the operating system may perform an isolation process on the CPU.
By adopting the mode, for uncorrectable and recoverable machine check errors (such as UCR errors), on the basis of a general recovery strategy, a recovery strategy for isolating a CPU is added, namely, an operating system can execute a CPU isolation operation corresponding to another part of specific machine check errors to ensure continuous operation besides executing general recovery operation corresponding to one part of specific machine check errors to ensure continuous operation, so that the type of the machine check errors which can ensure continuous operation by executing the recovery operation by the operating system is increased, the downtime probability is reduced, and the operation stability of the electronic equipment can be improved.
The following describes a specific implementation of determining whether the recovery policy corresponding to the machine check error is to isolate the CPU.
In some embodiments, the machine check registers may include a register (which may be referred to as a first register) for storing information related to a machine check error, and a register (which may be referred to as a second register) for storing a machine check error address.
The machine check error address may be an address of a code or a data storage unit that generates the machine check error.
When determining whether the recovery policy corresponding to the machine check error is an isolated CPU, the operating system may specifically determine whether the type of the machine check error is an SRAR, and if the type of the machine check error is an SRAR, further determine whether the data stored in the first register and the second register satisfies any one of the following conditions:
(1) The data stored in the first register indicates that the data stored in the second register is invalid;
(2) The data stored in the first register indicates that the data stored in the second register is valid, but the second register read fails;
(3) The data stored in the first register indicates that the data stored in the second register is valid and the second register read was successful, but the machine check architecture error code field in the first register read failed.
Taking an electronic device with a Pentium 4, intel atom, intel to Strong, or P6 family processor as an example, the first register may be an IA32_ MCi _ STATUS register, and the second register may be an IA32_ MCi _ ADDR register. If the addr flag =0 in the IA32_ MCi _ STATUS register, the data stored in the first register may be considered to indicate that the data stored in the second register is invalid. If the addr flag =1 in the IA32_ MCi _ STATUS register, the data stored in the first register may be considered to indicate that the data stored in the second register is valid.
The operating system may determine that the recovery policy corresponding to the machine check error is to isolate the CPU if it is determined that the data stored in the first register and the second register satisfies the condition (1), the condition (2), or the condition (3).
The condition (1), the condition (2), and the condition (3) all indicate that the machine check error address cannot be acquired.
That is, for an SRAR type UCR error, if the machine check error address is successfully retrieved, the operating system may perform a general recovery operation corresponding to the UCR error. However, if the machine check error address cannot be obtained, a downtime may be avoided by isolating the CPU by the operating system. By adopting the method, the UCR error of the SRAR type becomes the type of the machine check error which can be continuously operated through executing the recovery operation by the operating system no matter whether the machine check error address is successfully acquired or not, the downtime probability is reduced, and the operation stability of the electronic equipment can be improved.
Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a process of determining a recovery policy according to an exemplary embodiment of the present disclosure.
In order to determine the computation amount of the recovery policy, the process of determining the recovery policy may include the following steps:
step 306a: determining whether the type of the machine check error is SRAR. If so, step 306b is performed.
Step 306b: determining whether the data stored in the first register indicates that the data stored in the second register is valid. If so, step 306c is performed, otherwise step 306f is performed.
Step 306c: determining whether the second register read was successful. If so, step 306d is performed, otherwise step 306f is performed.
Step 306d: determining whether a machine check architecture error code field in the first register reads successfully. If so, step 306e is performed, otherwise step 306f is performed.
Step 306e: a general recovery operation is performed.
Step 306f: the CPU is isolated.
That is, in the case where the operating system determines that the type of the machine check error is SRAR, first, it may be determined whether the data stored in the first register indicates that the data stored in the second register is valid.
Secondly, if the data stored in the first register indicates that the data stored in the second register is valid, it may be further determined whether the second register is successfully read. The CPU may be isolated if the data stored in the first register indicates that the data stored in the second register is invalid.
Again, if the second register read is successful, a further determination may be made as to whether the machine check architecture error code field in the first register read was successful. If the second register read fails, the CPU may be isolated.
Finally, if the machine check architecture error code field in the first register is read successfully, a general recovery operation corresponding to the machine check error may be performed. If the machine check architecture error code field in the first register fails to read, the CPU may be isolated.
Next, a specific implementation of the isolation processing performed by the target CPU will be described.
In some embodiments, if the CPU detects an uncorrectable machine check error, if the address of the code or data storage unit generating the machine check error is in the kernel mode, it indicates that the machine check error may affect the operation of the system process, and the system generally needs to be shut down.
In the foregoing case, when the operating system performs the isolation processing on the target CPU, it may specifically be determined whether the process context of the target CPU is in the user state, and if the process context of the target CPU is in the user state, it may be considered that, for the target CPU, a code that causes a machine check error or an address of the data storage unit is in the user state, so that the isolation processing may be performed on the target CPU.
In practical application, if the operating system cannot acquire the target CPU, or the operating system detects that the process context of the target CPU is in a kernel state, and the isolation processing of the target CPU by the operating system fails, a downtime may be triggered.
In some embodiments, a CPU group including the target CPU may be mounted on the electronic device. There may be an association between different CPUs in the CPU group. If any of the associated multiple CPUs detect a machine check error, other ones of the multiple CPUs may also be affected, resulting in error propagation. Therefore, if it is determined that the isolation processing needs to be performed on any one of the associated plurality of CPUs, the isolation processing may be performed on the other CPUs of the plurality of CPUs so as to prevent error diffusion.
In the above case, when the process context of the target CPU is in the user mode, the operating system may first query the associated CPU associated with the target CPU, and then perform isolation processing on the target CPU and the associated CPU.
In some embodiments, the CPU group may include a single multi-core CPU or a plurality of single-core CPUs. In practical applications, each core in a single multi-core CPU can be considered as one CPU, so that a single multi-core CPU and multiple single-core CPUs actually contain multiple CPUs.
In some embodiments, the associated plurality of CPUs may be CPUs that share a portion of the hardware. For example, multiple CPUs sharing the same cache (cache) may be considered as associated multiple CPUs; alternatively, multiple logical CPUs sharing the same physical CPU may be considered as associated multiple CPUs.
In practical applications, the target CPU detects an uncorrectable machine check error due to a hardware error or hardware errors, and thus an associated CPU associated with the target CPU may share hardware corresponding to the detected uncorrectable machine check error with the target CPU.
In some embodiments, when the operating system performs isolation processing on the target CPU and the associated CPU associated with the target CPU, any one or a combination of the following processes may be specifically performed on the target CPU and the associated CPU:
(1) Closing a watchdog (watchdog) of the target CPU and associated CPU;
(2) Clearing the process context of the target CPU and the associated CPU;
(3) Taking the target CPU and associated CPUs offline, for example: setting the online state of the CPU as false and the idle state as true;
(4) Deleting a target CPU and an associated CPU from the CPU topology;
(5) Deleting a target CPU and a related CPU from a Non Uniform Memory Access (NUMA) node;
(6) A timer (timer) and a high precision timer (hrtimer) that shut down the target CPU and associated CPUs;
(7) Closing RCU (Read-Copy Update) synchronization function on the target CPU and the associated CPU;
(8) Close Inter-Processor Interrupt (IPI) functions on the target CPU and the associated CPU;
(9) A dead-cycle kernel-state process is executed on the target CPU and the associated CPU.
In some embodiments, the operating system may determine whether the isolation process is successful after performing the isolation process on the target CPU (or the target CPU and an associated CPU associated with the target CPU).
If the isolation process for the target CPU is successful, the operating system may output the information of the machine check error and the information indicating that the isolation of the target CPU is successful, so that a user may know that a hardware error occurs and may know the information of the hardware error that occurs and a recovery policy of the isolated CPU.
If the isolation processing on the target CPU fails, the operating system may record information of the machine check error and information indicating that the target CPU fails to isolate, and trigger a downtime. Subsequently, the operating system may output the information of the machine check error and the information indicating the target CPU isolation failure after the restart, so that the user may know that the hardware error occurs, and may know the information of the hardware error that occurs, and the recovery policy of the isolated CPU.
Corresponding to the embodiments of the processing method for machine check errors, the present specification also provides embodiments of a processing apparatus for machine check errors.
FIG. 5 is a schematic block diagram of an apparatus provided in an exemplary embodiment. Referring to fig. 5, at the hardware level, the apparatus includes a processor 502, an internal bus 504, a network interface 506, a memory 508, and a non-volatile storage 510, although other hardware may be required. One or more embodiments of the present description may be implemented in software, such as by processor 502 reading corresponding computer programs from non-volatile storage 510 into memory 508 and then running. Of course, besides software implementation, the one or more embodiments of the present disclosure do not exclude other implementations, such as logic devices or combinations of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic module, and may also be hardware or logic devices.
Referring to fig. 6, fig. 6 is a block diagram of a processing apparatus for checking an error by a machine according to an exemplary embodiment of the present disclosure.
The processing device for machine check error can be applied to the device shown in fig. 5 to implement the technical solution of the present specification. The device can be applied to an operating system running on the equipment; the target CPU mounted on the device supports a machine check architecture.
The processing device for machine check error may include:
a reading module 602, configured to read data stored in a machine check register in response to a machine check exception; wherein the machine check exception is triggered by an uncorrectable machine check error detected by the target CPU;
a first determining module 604, configured to determine whether the machine check error is a recoverable machine check error based on a preset machine check error classification rule and the read data;
a second determining module 606, configured to, in a case that the machine check error is determined to be a recoverable machine check error, further determine whether a recovery policy corresponding to the machine check error is an isolated CPU;
a processing module 608, configured to perform isolation processing on the target CPU if the recovery policy corresponding to the machine check error is an isolation CPU.
Optionally, the machine check register comprises a first register for storing information relating to machine check errors, and a second register for storing machine check error addresses;
the second determining module 606 is specifically configured to:
determining whether the type of the machine check error is an SRAR;
if the type of the machine check error is SRAR, further determining whether the data stored in the first register and the second register satisfies any one of the following conditions:
the data stored in the first register indicates that the data stored in the second register is invalid;
the data stored in the first register indicates that the data stored in the second register is valid, and the second register fails to read;
the data stored in the first register indicates that the data stored in the second register is valid, the second register reads successfully, and the machine check architecture error code field in the first register reads failed.
Optionally, the processing module 608 is specifically configured to:
determining whether the process context of the target CPU is in a user state;
and if the process context of the target CPU is in the user state, carrying out isolation processing on the target CPU.
Optionally, a CPU group including the target CPU is mounted on the electronic device;
if the process context of the target CPU is in the user state, the isolation processing is carried out on the target CPU, and the isolation processing comprises the following steps:
if the process context of the target CPU is in a user mode, inquiring an associated CPU associated with the target CPU;
and carrying out isolation processing on the target CPU and the associated CPU.
Optionally, the CPU group includes a single multi-core CPU or a plurality of single-core CPUs.
Optionally, the associated CPU and the target CPU share hardware corresponding to the machine check error.
Optionally, the processing module 608 is specifically configured to:
performing any one or combination of the following processes on the target CPU and the associated CPU:
closing the watchdog of the target CPU and the associated CPU;
clearing the process context of the target CPU and the associated CPU;
taking the target CPU and the associated CPU off line;
deleting the target CPU and the associated CPU from the CPU topology;
deleting the target CPU and the associated CPU from the non-uniform memory access node;
closing timers and high-precision timers of the target CPU and the associated CPU;
closing RCU synchronization functions on the target CPU and the associated CPU;
closing inter-processor interrupt functionality on the target CPU and the associated CPU;
executing a dead-cycle kernel-state process on the target CPU and the associated CPU.
Optionally, the processing module 608 is further configured to:
if the isolation processing of the target CPU is successful, outputting the information of the machine check error and the information indicating the successful isolation of the target CPU;
and if the isolation processing of the target CPU fails, recording the information of the machine check error and the information indicating the isolation failure of the target CPU, and triggering downtime.
The implementation process of the functions and actions of each module in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, they substantially correspond to the method embodiments, and so reference may be made to some of the descriptions of the method embodiments for their relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
The systems, apparatuses, modules or units described in the above embodiments may be specifically implemented by a computer chip or an entity, or implemented by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at" \8230; "or" when 8230; \8230; "or" in response to a determination ", depending on the context.
The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims (11)

1. A processing method for machine check errors is applied to an operating system running on electronic equipment; a target CPU carried on the electronic equipment supports a machine check architecture; the method comprises the following steps:
reading data stored in a machine check register in response to a machine check exception; wherein the machine check exception is triggered by an uncorrectable machine check error detected by the target CPU;
determining whether the machine check error is a recoverable machine check error based on a preset machine check error classification rule and the read data;
if the machine check error is determined to be a recoverable machine check error, further determining whether a recovery policy corresponding to the machine check error is to isolate the CPU;
and if the recovery strategy corresponding to the machine check error is an isolation CPU, performing isolation processing on the target CPU.
2. The method of claim 1, the machine check register comprising a first register for storing information related to a machine check error, and a second register for storing a machine check error address;
the determining whether the recovery policy corresponding to the machine check error is to isolate the CPU comprises:
determining whether the type of the machine check error is an SRAR;
if the type of the machine check error is SRAR, further determining whether the data stored in the first register and the second register satisfies any one of the following conditions:
the data stored in the first register indicates that the data stored in the second register is invalid;
the data stored in the first register indicates that the data stored in the second register is valid, the second register fails to read;
the data stored in the first register indicates that the data stored in the second register is valid, the second register reads successfully, and the machine check architecture error code field in the first register reads failed.
3. The method of claim 1, the quarantining the target CPU comprising:
determining whether the process context of the target CPU is in a user state;
and if the process context of the target CPU is in the user state, carrying out isolation processing on the target CPU.
4. The method of claim 3, wherein a CPU group including the target CPU is mounted on the electronic device;
if the process context of the target CPU is in the user state, the isolation processing is carried out on the target CPU, and the isolation processing comprises the following steps:
if the process context of the target CPU is in a user mode, inquiring an associated CPU associated with the target CPU;
and carrying out isolation processing on the target CPU and the associated CPU.
5. The method of claim 4, the set of CPUs comprising a single multi-core CPU or a plurality of single-core CPUs.
6. The method of claim 4, the associated CPU sharing hardware corresponding to the machine check error with the target CPU.
7. The method of claim 4, the isolating the target CPU and the associated CPU comprising:
performing any one or combination of the following processes on the target CPU and the associated CPU:
closing the watchdog of the target CPU and the associated CPU;
clearing the process context of the target CPU and the associated CPU;
taking the target CPU and the associated CPU off line;
deleting the target CPU and the associated CPU from the CPU topology;
deleting the target CPU and the associated CPU from the non-uniform memory access node;
closing the target CPU and the timer and the high-precision timer of the associated CPU;
closing RCU synchronization functions on the target CPU and the associated CPU;
closing inter-processor interrupt functionality on the target CPU and the associated CPU;
executing a dead-cycle kernel-state process on the target CPU and the associated CPU.
8. The method of claim 1, further comprising:
if the isolation processing of the target CPU is successful, outputting the information of the machine check error and the information indicating that the isolation of the target CPU is successful;
and if the isolation processing of the target CPU fails, recording the information of the machine check error and the information indicating the isolation failure of the target CPU, and triggering downtime.
9. A processing device for machine check error is applied to an operating system running on an electronic device; a target CPU carried on the electronic equipment supports a machine check architecture; the device comprises:
the reading module is used for responding to the machine check exception and reading the data stored in the machine check register; wherein the machine check exception is triggered by an uncorrectable machine check error detected by the target CPU;
the first determination module is used for determining whether the machine check error is a recoverable machine check error or not based on a preset machine check error classification rule and the read data;
a second determination module, configured to further determine whether a recovery policy corresponding to the machine check error is an isolated CPU if it is determined that the machine check error is a recoverable machine check error;
and the processing module is used for carrying out isolation processing on the target CPU under the condition that the recovery strategy corresponding to the machine check error is an isolation CPU.
10. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the method of any one of claims 1 to 8 by executing the executable instructions.
11. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 8.
CN202211469570.6A 2022-11-22 2022-11-22 Method and device for processing machine check errors Pending CN115858211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211469570.6A CN115858211A (en) 2022-11-22 2022-11-22 Method and device for processing machine check errors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211469570.6A CN115858211A (en) 2022-11-22 2022-11-22 Method and device for processing machine check errors

Publications (1)

Publication Number Publication Date
CN115858211A true CN115858211A (en) 2023-03-28

Family

ID=85665132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211469570.6A Pending CN115858211A (en) 2022-11-22 2022-11-22 Method and device for processing machine check errors

Country Status (1)

Country Link
CN (1) CN115858211A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117608910A (en) * 2024-01-24 2024-02-27 苏州元脑智能科技有限公司 Determination method, device and system for machine inspection exception error type of processor

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117608910A (en) * 2024-01-24 2024-02-27 苏州元脑智能科技有限公司 Determination method, device and system for machine inspection exception error type of processor
CN117608910B (en) * 2024-01-24 2024-04-12 苏州元脑智能科技有限公司 Determination method, device and system for machine inspection exception error type of processor

Similar Documents

Publication Publication Date Title
US9317360B2 (en) Machine check summary register
US7945815B2 (en) System and method for managing memory errors in an information handling system
US9389973B2 (en) Memory error propagation for faster error recovery
US11080135B2 (en) Methods and apparatus to perform error detection and/or correction in a memory device
US20130036332A1 (en) Maximizing encodings of version control bits for memory corruption detection
US10318455B2 (en) System and method to correlate corrected machine check error storm events to specific machine check banks
US11138055B1 (en) System and method for tracking memory corrected errors by frequency of occurrence while reducing dynamic memory allocation
US11868238B2 (en) Method and apparatus for fuzz testing based on resource access feedback
JP2015529927A (en) Notification of address range with uncorrectable errors
US9037788B2 (en) Validating persistent memory content for processor main memory
CN109753378A (en) A kind of partition method of memory failure, device, system and readable storage medium storing program for executing
CN115858211A (en) Method and device for processing machine check errors
Du et al. Predicting uncorrectable memory errors from the correctable error history: No free predictors in the field
Li et al. From correctable memory errors to uncorrectable memory errors: What error bits tell
Radojkovic et al. Towards resilient EU HPC systems: A blueprint
US20150309913A1 (en) Identifying potentially uninitialized source code variables
CN104461759B (en) A kind of processing method and processing device of memory multi-bit error
US9489255B2 (en) Dynamic array masking
US8645796B2 (en) Dynamic pipeline cache error correction
US9921906B2 (en) Performing a repair operation in arrays
US11537468B1 (en) Recording memory errors for use after restarts
US11989572B2 (en) Computer system enabled with runtime software module tracking
US8780471B2 (en) Linking errors to particular tapes or particular tape drives
US10613951B2 (en) Memory mirror invocation upon detecting a correctable error
CN116382958A (en) Memory error processing method and computing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination