CN116775351A

CN116775351A - Memory detection method and computing device

Info

Publication number: CN116775351A
Application number: CN202310327909.7A
Authority: CN
Inventors: 李胜
Original assignee: XFusion Digital Technologies Co Ltd
Current assignee: XFusion Digital Technologies Co Ltd
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-09-19

Abstract

The embodiment of the application provides a memory detection method and computing equipment, wherein the method comprises the following steps: determining a first memory location and a second memory location in a memory, wherein the fault severity of the first memory location and the fault severity of the second memory location are different; determining a memory detection policy based on the severity of the fault in the first memory location and the second memory location, the memory detection policy indicating that the first memory location is fault-detected based on a first detection scheme and indicating that the second memory location is fault-detected based on a second detection scheme. By the method, the detection practicability of the first detection mode and the second detection mode can be improved.

Description

Memory detection method and computing device

Technical Field

The embodiment of the application relates to the technical field of computing equipment, in particular to a memory detection method and computing equipment.

Background

In order to improve the reliability of the memory in the computing device, the computing device supports a memory inspection technology. For example, the memory Patrol technique may be a poll check (Patrol virus) technique, or a fault detection Patrol (Error Check and Scrub, ECS) technique. Wherein, the single-bit faults existing in the memory can be detected by the ECS technology, and the multi-bit faults and uncorrectable faults existing in the memory can be detected by the Patrol Scrub technology.

At present, after the computing device is started, a polling check engine and an ECS engine are enabled by default to carry out polling on all memories in the computing device according to a fixed period. However, the patrol requirements for different memory regions in a computing device are different. For example, a single bit failure zone may have a higher inspection requirement for an ECS engine and a multiple bit failure zone may have a higher inspection requirement for a polling inspection engine. Therefore, the inspection mode cannot better exert the detection function of each inspection engine, so that the memory detection practicability of the inspection engine is lower, and the waste of detection resources is caused.

Disclosure of Invention

The embodiment of the application provides a memory detection method and a computing device, wherein the method can dynamically formulate a detection scheme according to the detection characteristics of different inspection engines, and better plays the detection roles of the polling inspection engine and the ECS engine, so that the memory detection practicability of the polling inspection engine and the ECS engine is higher, and detection resources are saved.

In a first aspect, an embodiment of the present application provides a memory detection method, where the method includes:

determining a first memory location and a second memory location in a memory, wherein the fault severity of the first memory location and the fault severity of the second memory location are different;

Determining a memory detection policy based on the severity of the fault in the first memory location and the second memory location, the memory detection policy indicating that the first memory location is fault-detected based on a first detection scheme and indicating that the second memory location is fault-detected based on a second detection scheme.

In the above scheme, the first memory location and the second memory location may be determined in the memory, where the severity of the fault in the first memory location and the severity of the fault in the second memory location are different; and determining a memory detection policy based on the severity of the failure of the first memory location and the second memory location; the memory detection policy indicates that the first memory location is to be failure-detected based on the first detection scheme, and indicates that the second memory location is to be failure-detected based on the second detection scheme. Through the method, the first memory section with smaller fault severity can be subjected to fault detection processing in a first detection mode; and performing fault detection processing on the second memory section with larger fault severity by a second detection mode. The detection effect of the first detection mode and the second detection mode is better exerted, so that the memory detection practicability of the first detection mode and the second detection mode is higher, and detection resources are saved.

In a possible implementation manner, determining a memory detection policy based on severity of failure of the first memory location and the second memory location includes:

when the fault severity of the first memory location is less than or equal to a first threshold value, determining that a memory detection strategy corresponding to the first memory location is a first detection mode;

and when the fault severity of the second memory interval is greater than or equal to a second threshold value, determining that the memory detection strategy corresponding to the second memory position is a second detection mode.

In the above scheme, the memory detection policy corresponding to the first memory location is determined to be the first detection mode and the memory detection policy corresponding to the second memory location is determined to be the second detection mode according to the severity of the fault of the memory location to be detected, so that the purpose of determining the memory detection policy according to the severity of the fault is achieved.

In one possible implementation, determining a first memory location and a second memory location in a memory includes:

determining at least one memory location to be detected and the fault severity of each memory location to be detected in the memory;

and determining the first memory location and the second memory location in the memory according to the fault severity of each memory location to be detected.

In the above scheme, the first memory location and the second memory location can be determined according to the severity of the fault of the memory location to be detected, thereby achieving the purpose of determining the first memory location and the second memory location.

In a possible implementation manner, determining the at least one memory location to be detected according to the at least one memory failure location includes:

determining a storage array in which each memory fault position is located in the memory to obtain at least one storage array;

and determining the at least one storage array as the at least one memory location to be detected.

In the scheme, the memory location to be detected can be determined according to the memory fault location, and fault detection processing of all memories is avoided.

In a possible implementation manner, determining at least one memory location to be detected and a fault severity of each memory location to be detected in the memory includes:

obtaining a fault log of a first memory fault, wherein the first memory fault is a fault generated when the computing equipment performs service operation on the memory, and the fault log comprises at least one memory fault position and fault information of each memory fault position;

Determining a memory location to be detected corresponding to each memory fault location, and obtaining at least one memory location to be detected;

and aiming at each memory fault position, determining the fault severity of the memory position to be detected corresponding to the memory fault position according to the fault information of the memory fault position.

In the scheme, the fault severity of the memory location to be detected can be determined according to the fault information of the memory fault location, so that the purpose of determining the fault severity of the memory location to be detected is achieved.

In a possible implementation manner, determining the first memory location and the second memory location in the memory according to the severity of the fault of each memory location to be detected includes:

acquiring resource occupation information of the computing equipment;

determining the first threshold and the second threshold according to the resource occupation information;

and determining the memory location to be detected with the fault severity smaller than or equal to the first threshold as the first memory location, and determining the memory location to be detected with the fault severity larger than or equal to the second threshold as the second memory location.

In the above scheme, the first memory location may be determined according to the first threshold, and the second memory location may be determined according to the second threshold. The purpose of determining the first memory location and the second memory location according to the severity of the fault is achieved.

In a possible implementation, the resource occupancy information includes a first CPU occupancy of the computing device; determining the first threshold and the second threshold according to the resource occupation information comprises the following steps:

acquiring an initial CPU occupancy rate, a first initial threshold value and a second initial threshold value;

determining the difference value between the first CPU occupancy rate and the initial CPU occupancy rate as a CPU occupancy rate difference value;

updating the first initial threshold according to the CPU occupancy rate difference value to obtain the first threshold;

and updating the second initial threshold according to the CPU occupancy rate difference value to obtain the second threshold.

In the above scheme, the first threshold and the second threshold may be determined according to the first CPU occupancy rate, so that when the first CPU occupancy rate is larger, the overlapping portions of the M first memory locations and the N second memory locations are adjusted to be smaller by the first threshold and the second threshold; and when the first CPU occupancy rate is smaller, adjusting the overlapping parts of the M first memory positions and the N second memory positions to be larger through the first threshold value and the second threshold value.

In a possible implementation manner, after determining the memory detection policy based on the severity of the fault of the first memory location and the second memory location, the method further includes:

And performing fault detection on the second memory location by the second detection mode, and indicating the memory to perform fault detection on the first memory location by the first detection mode.

In the above scheme, the fault detection can be performed on the second memory location by the second detection mode, and the fault detection can be performed on the first memory location by the first detection mode. The purpose of detecting different memory positions through different detection modes is achieved.

In one possible implementation manner, the number of the first memory locations is M, where M is an integer greater than or equal to 0; performing fault detection on the first memory location in the first detection manner, including:

determining first detection information of the M first memory locations according to the fault severity of the M first memory locations, wherein the first detection information comprises the detection rate and the detection times of each first memory location;

detecting the M first memory positions according to the first detection information, subtracting 1 from the detection times of each first memory position in the first detection information, and deleting a first memory section with the detection times of 0 to obtain first updated detection information;

Generating an (i+1) th detection task according to the (i) th updated detection information, wherein the (i+1) th detection task comprises all first memory positions in the (i) th updated detection information and detection rates of all first memory positions;

executing the (i+1) th detection task, subtracting 1 from the detection times of each first memory location in the (i) th detection information, and deleting the first memory location with the detection times of 0 to obtain the (i+1) th updated detection information;

and taking 1, 2, … … and K-1 in sequence, wherein K is the maximum value of detection times corresponding to the M first memory positions, and K is an integer larger than 1.

In the above scheme, the first detection information of the M first memory locations may be determined according to the severity of the failure of the M first memory locations. According to the first detection information, the M first memory locations can be subjected to fault detection processing in the first detection mode, so that the fault detection processing of all memory areas in the first detection mode is avoided, and the power consumption of the detection processing is saved.

In a possible implementation manner, determining the first detection information of the M first memory locations according to the severity of the fault of the M first memory locations includes:

For any one first memory location, if the fault severity of the first memory location is less than or equal to a preset severity, determining that the detection rate of the first memory location is a preset rate, and determining that the detection times of the first memory location are preset times;

if the fault severity of the first memory location is greater than the preset severity, determining a detection rate of the first memory location according to the preset speed and the fault severity, and determining the detection times of the first memory location according to the preset times and the fault severity.

In the above scheme, the detection rate and the detection times of the first memory location are determined according to the fault severity of the first memory location. By the method, the purpose of dynamically adjusting the first detection information can be achieved, so that the practicability of fault detection processing of the first detection mode is higher.

In a possible implementation manner, the detecting the M first memory locations according to the first detection information includes:

generating a first detection task according to the first detection information, wherein the first detection task comprises the M first memory positions and the detection rate of each first memory position;

And executing the first detection task to detect the M first memory locations.

In the above scheme, the detection task can be generated according to the first detection information, and the detection processing is performed on the M first memory locations according to the detection task in the first detection mode, so that the purpose of performing fault detection processing on the M first memory locations is achieved.

In one possible implementation manner, the number of the second memory locations is N; performing fault detection on the second memory location in the second detection manner, including:

determining second detection information of the N second memory locations according to the fault severity of the N second memory locations, wherein the second detection information comprises the detection rate and the detection times of each second memory location;

detecting the N second memory positions according to the second detection information, subtracting 1 from the detection times of each second memory position in the second detection information, and deleting the second memory position with the detection times of 0 to obtain first updated detection information;

generating an (i+1) th detection task according to the (i) th update detection information, wherein the (i+1) th detection task comprises each second memory location in the (i) th update detection information and the detection rate of each second memory location;

Executing the (i+1) th detection task, subtracting 1 from the detection times of each second memory location in the (i) th detection information, and deleting the second memory location with the detection times of 0 to obtain the (i+1) th updated detection information;

and taking 1, 2, … … and K-1 in turn, wherein K is the maximum value of detection times corresponding to the N second memory positions, and K is an integer greater than 1.

In the above scheme, the second detection information of the N second memory locations may be determined according to the severity of the failure of the N second memory locations. According to the second detection information, the N second memory locations can be subjected to fault detection processing in the second detection mode, so that the fault detection processing of all the memory areas in the second detection mode is avoided, and the power consumption of the detection processing is saved.

In a possible implementation manner, determining the second detection information of the N second memory locations according to the severity of the failure of the N second memory locations includes:

for any one second memory location, if the fault severity of the second memory location is less than or equal to a preset severity, determining that the detection rate of the second memory location is a preset rate, and determining that the detection times of the second memory location are preset times;

If the fault severity of the second memory location is greater than the preset severity, determining a detection rate of the second memory location according to the preset speed and the fault severity, and determining the detection times of the second memory location according to the preset times and the fault severity.

In the above scheme, the detection rate and the detection times of the second memory location are determined according to the fault severity of the second memory location. By the method. The purpose of dynamically adjusting the second detection information can be achieved, so that the practicability of fault detection processing of the second detection mode is higher.

In a possible implementation manner, the detecting processing for the N second memory locations according to the second detection information includes:

generating a first detection task according to the second detection information, wherein the first detection task comprises the N second memory positions and the detection rate of each second memory position;

and executing the first detection task to detect the N second memory locations.

In the above scheme, the detection task can be generated according to the second detection information, and the N second memory locations are detected according to the detection task in the second detection mode, so that the purpose of performing fault detection processing on the N second memory locations is achieved.

In a possible implementation manner, the method further includes:

performing repair processing on the at least one memory location to be detected according to the fault information to obtain a repair result, wherein the repair result comprises a target memory location which is successfully repaired;

deleting the target memory location, and the detection rate and detection times corresponding to the target memory location from the first detection information or the second detection information.

In the method, the at least one memory location to be detected can be repaired according to the fault information, so that the purpose of repairing the relevant memory area according to the service operation memory fault is achieved.

In a possible implementation manner, the fault information includes a fault number of faults occurring in the at least one memory fault location; and repairing the at least one position to be detected according to the fault information to obtain a repairing result, wherein the repairing result comprises the following steps:

determining the number of faults corresponding to the at least one memory location to be detected according to the number of faults of the at least one memory location;

determining a first memory location to be detected in the at least one memory location to be detected according to the number of faults corresponding to the at least one memory location to be detected, wherein the number of faults corresponding to the first memory location to be detected is greater than or equal to a preset threshold value;

And repairing the first memory location to be detected to obtain the repairing result.

In a possible implementation manner, the method further includes:

performing repair processing on at least one first memory location according to a fault detection result of a first detection mode to obtain a repair result, wherein the repair result comprises a target memory location which is successfully repaired, and the fault detection result of the first detection mode is a result of performing fault detection processing on the M first memory locations through the first detection mode;

In the method, the at least one first memory location can be repaired according to the fault detection result of the first detection mode, so that the purpose of repairing the related memory area according to the fault detection result of the first detection mode is achieved.

In a possible implementation manner, the fault detection result of the first detection manner includes a fault frequency of the at least one first memory location in fault; according to the fault detection result of the first detection mode, repairing at least one first memory location to obtain a repairing result, including:

determining a first memory location to be detected in the at least one first memory location according to the failure times of the at least one first memory location, wherein the failure times corresponding to the first memory location to be detected are greater than or equal to a preset threshold value;

In a possible implementation manner, the method further includes:

repairing at least one second memory location according to a fault detection result of a second detection mode to obtain a repairing result, wherein the repairing result comprises a target memory location which is successfully repaired, and the fault detection result of the second detection mode is a result of performing fault detection processing on the N second memory locations through the second detection mode;

In the method, the at least one second memory location can be repaired according to the fault detection result of the second detection mode, so that the purpose of repairing the related memory area according to the fault detection result of the second detection mode is achieved.

In a possible implementation manner, the fault detection result of the second detection manner includes a fault frequency of the at least one second memory location in fault; according to the fault detection result of the second detection mode, repairing at least one second memory location to obtain a repairing result, including:

determining a first memory location to be detected in the at least one second memory location according to the failure times of the at least one second memory location, wherein the failure times corresponding to the first memory location to be detected are greater than or equal to a preset threshold value;

In a second aspect, embodiments of the present application provide a computing device comprising a processor and a memory, wherein,

the processor is used for determining a first memory location and a second memory location in a memory, wherein the severity of faults of the first memory location and the second memory location are different;

the processor is further configured to determine a memory detection policy based on the severity of the failure of the first memory location and the second memory location, the memory detection policy indicating that the first memory location is failure-detected based on a first detection mode and indicating that the second memory location is failure-detected based on a second detection mode.

In the above scheme, the first memory location and the second memory location may be determined in the memory, where the severity of the fault in the first memory location and the severity of the fault in the second memory location are different; and determining a memory detection policy based on the severity of the failure of the first memory location and the second memory location; the memory detection strategy indicates that the fault detection is performed on the first memory location based on the first detection mode, and indicates that the fault detection is performed on the second memory location based on the second detection mode, so that the fault detection processing can be performed on the first memory section with the smaller fault severity through the first detection mode; and performing fault detection processing on the second memory section with larger fault severity by a second detection mode. The detection effect of the first detection mode and the second detection mode is better exerted, so that the memory detection practicability of the first detection mode and the second detection mode is higher, and detection resources are saved.

In one possible implementation, the processor is specifically configured to,

acquiring resource occupation information of the computing equipment;

In one possible implementation, the resource occupancy information includes a first CPU occupancy of the computing device; the processor is specifically configured to perform,

In one possible implementation, the processor is specifically configured to,

In one possible implementation manner, the number of the first memory locations is M, where M is an integer greater than or equal to 0; the processor is specifically configured to perform,

determining first detection information of the M first memory locations according to the fault severity of the M first memory locations, wherein the first detection information comprises the detection rate and/or the detection times of each first memory location;

In one possible implementation, the processor is specifically configured to,

determining second detection information of the N second memory locations according to the fault severity of the N second memory locations, wherein the second detection information comprises detection rate and detection times of each second memory interval;

In a third aspect, an embodiment of the present application provides a memory detection apparatus, where the apparatus includes a determining module, where,

the determining module is used for determining a first memory location and a second memory location in a memory, wherein the severity of faults of the first memory location and the second memory location are different;

the determining module is further configured to determine a memory detection policy based on severity of failures of the first memory location and the second memory location, where the memory detection policy indicates that failure detection is performed on the first memory location based on a first detection mode, and indicates that failure detection is performed on the second memory location based on a second detection mode.

In one possible implementation, the determining module is specifically configured to,

obtaining a fault log of a first memory fault, wherein the first memory fault is a fault generated when a computing device performs service operation on the memory, and the fault log comprises at least one memory fault position and fault information of each memory fault position;

acquiring resource occupation information of computing equipment;

determining a first threshold and a second threshold according to the resource occupation information;

In one possible embodiment, the apparatus further comprises a detection module for detecting,

In one possible implementation manner, the number of the first memory locations is M, where M is an integer greater than or equal to 0; the detection module is particularly adapted to detect, in particular,

In one possible implementation manner, the number of the second memory locations is N, where N is an integer greater than or equal to 0; the detection module is particularly adapted to detect, in particular,

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored therein a computer program which, when executed by a computer, implements the method of any of the first aspects.

In the above scheme, the fault detection processing can be performed on the first memory section with the smaller fault severity by the first detection mode; and performing fault detection processing on the second memory section with larger fault severity by a second detection mode. The detection effect of the first detection mode and the second detection mode is better exerted, so that the memory detection practicability of the first detection mode and the second detection mode is higher, and detection resources are saved.

In a fifth aspect, an embodiment of the present application provides a computer program product comprising a computer program which, when executed by a computer, implements the method of any of the first aspects.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the application, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a schematic diagram of a computing device according to an embodiment of the present application;

FIG. 2 is a schematic diagram of DDR5 fault handling according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a memory detection flow according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a memory detection method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a method for determining a detection scheme according to an embodiment of the present application;

FIG. 6 is a flowchart of a memory detection method according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating another memory detection method according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of a fault detection process for a first memory location by a first detection method according to an embodiment of the present application;

Fig. 9 is a schematic flow chart of fault detection processing for a second memory location in a second detection manner according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

For ease of understanding, a system architecture of a computing device according to an embodiment of the present application will be described first with reference to fig. 1.

Fig. 1 is a schematic architecture diagram of a computing device according to an embodiment of the present application. Referring to fig. 1, the hardware portion of the computing device may include memory, interconnect cabling, a processor, and a baseboard management controller (Baseboard Management Controller, BMC). The memory is connected with the processor through an interconnection cable.

The software portion of the computing device may include a basic input output System (basic input output System, BIOS) and an Operating System (OS). The BIOS and OS may be located within the processor.

Memory, also referred to as internal memory or main memory, is installed in memory slots on a motherboard of a computing device. The memory may be used to store operational data for the processor. For example, the memory may be dynamic random access memory (Dynamic Random Access Memory, DRAM), random access memory (Random Access Memory, RAM), DDR, or the like. In the following embodiments, the memory is described as DDR 5.

The BMC may be a management unit that is independent of non-business modules outside of the operating system of the computing device. For example, the BMC may remotely maintain and manage the computing device via a dedicated data channel and may communicate with the BIOS and OS via an out-of-band management interface of the computing device.

It should be noted that, in different computing devices, BMCs are referred to differently. For example, some computing devices are referred to as BMCs, others as iLO, and others as iDRAC. BMC, iLO, or iDRAC may be understood as BMC in embodiments of the present application.

It should be noted that, the system architecture described in the embodiment of the present application is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided in the embodiment of the present application, and those skilled in the art can know that, with the evolution of the system architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is applicable to similar technical problems.

In the above computing device, if the memory fails, the BIOS may collect the failure information and may report the failure information to the failure diagnosis module. The fault diagnosis module can run on the BMC; alternatively, the fault diagnosis module may run on the OS; alternatively, the fault diagnosis module may run in other system monitoring units.

The fault diagnosis module can perform fault diagnosis according to the fault information and can send diagnosis results to the BIOS so that the BIOS can perform fault repair according to the diagnosis results.

For easy understanding, the following describes a memory failure processing procedure according to an embodiment of the present application with reference to fig. 2.

Fig. 2 is a schematic diagram of DDR5 fault handling according to an embodiment of the present application. Please refer to fig. 2, which includes a processor and DDR5.

In one aspect, DDR5 may include a plurality of memory granules and an On-Die-error correction code (error correcting code, ECC) module. The processor may include an On-Module-ECC Module. The BIOS may be run on the processor.

When the processor reads data from the storage particles, the On-Die-ECC Module and the On-Module-ECC Module can check the data read by the processor successively to determine whether a memory failure exists. Memory failures may include single bit failures, multi-bit failures, and uncorrectable failures.

The On-Die-ECC module may correct the checked single bit failure.

The On-Module-ECC Module can check out multi-bit faults and uncorrectable faults and correct the multi-bit faults in the multi-bit faults. The On-Module-ECC Module may also store all fault information verified to the fault log and may report an interrupt notification to the processor. After the processor receives the interrupt notification, the BIOS running on the processor can read the fault information from the fault log, so that the computing device can perform memory fault diagnosis according to the fault information.

On the other hand, the ECS engine may be run on DDR 5. The ECS engine may detect the memory granule on DDR5 at specified periods to determine if there is a memory failure on the memory granule. The ECS engine can correct the detected single-bit faults, write the corrected data back to the memory granules and store fault information to the ECS log. The BIOS may read the fault information from the ECS log to facilitate memory fault diagnosis by the computing device based on the fault information.

On the other hand, a polling check engine may also be run on the processor. The polling check engine may detect the stored particles at a specified period. The poll checking engine may detect multi-bit faults and uncorrectable faults, may correct the detected multi-bit faults, and may write back corrected data to the memory. The poll inspection engine may also store the detected fault information in a fault log and may report a terminal notification to the processor. After the processor receives the interrupt notification, the BIOS running on the processor can read the fault information from the fault log, so that the computing device can perform memory fault diagnosis according to the fault information.

In the related art, after the computing device is started, the ECS engine and the polling check engine are enabled by default to detect. As shown in fig. 3, fig. 3 is a schematic diagram of a memory detection flow according to an embodiment of the present application. Referring to fig. 3, after the computing device is started, the ECS engine is enabled by default, all memories in the computing device are detected in a fixed period, and after one period is detected, the detection of the next period is continuously executed; meanwhile, the polling check engine is enabled by default to detect all memories in the computing device in a fixed period, and detection of the next period is continuously executed after detection of one period is completed. The BIOS may configure the detection period of the ECS engine, as well as the detection period of the poll check engine, among other things. For example, the detection period of the ECS engine and the detection period of the polling check engine each defaults to 24 hours.

It should be noted that the ECS engine and the polling check engine operate independently of each other, and the detection logic and the detection information record of both are not associated. Thus, for the same memory location, the ECS engine and the poll check engine may detect separately.

However, upon startup of the computing device, the memory may not fail, or may only fail in a single bit, and the poll checking engine may not detect a single bit failure. At this time, the polling check engine detects all the memories, which may result in waste of detection resources.

As the service runs, the severity of the memory failure increases, and the distribution range of the memory failure may be larger. For example, memory failures may be distributed across multiple memory cells, or across a storage array, etc. After the ECS engine detects the storage particles, only one row with the highest fault severity can be counted, and the memory faults in the larger memory area can not be collected. At this time, the ECS engine detects all the memories, which may result in waste of detection resources.

Therefore, the detection method cannot better exert the detection functions of the polling check engine and the ECS engine, so that the memory detection practicability of the polling check engine and the ECS engine is lower, and the waste of detection resources is caused.

In view of this, the embodiment of the application provides a memory detection method, which can dynamically formulate a detection scheme of a patrol engine according to a memory fault generated when a service accesses a memory and detection characteristics of a polling check engine and an ECS engine. The detection function of the polling check engine and the ECS engine is better exerted, so that the memory detection practicability of the polling check engine and the ECS engine is higher, and the detection resources are saved. Next, a method for detecting a memory according to an embodiment of the present application will be described with reference to fig. 4.

Fig. 4 is a schematic diagram of a memory detection method according to an embodiment of the present application. Referring to fig. 4, the memory granule of DDR5 may be accessed when the processor performs a business operation. When the processor accesses the storage particles, the On-Die-ECC Module and the On-Module-ECC Module can sequentially perform memory fault verification to determine memory fault information generated when the processor performs service operation.

After the On-Module-ECC Module verifies the memory fault, the memory fault information can be stored in a fault log, and an interrupt notification can be reported to the processor.

After the processor receives the interrupt notification reported by the On-Module-ECC Module, the BIOS running On the processor can read the memory fault information from the fault log and can send the memory fault information to the fault diagnosis Module. The fault diagnosis module can run on the BMC; alternatively, the fault diagnosis module may run on the OS; alternatively, the fault diagnosis module may run in other system monitoring units. In the following embodiments, description will be made taking an example in which a failure diagnosis module runs on an OS.

The fault diagnosis module can determine the fault grade at the memory fault position according to the memory fault information, and can determine the detection scheme of the inspection engine according to the fault grade at the memory fault position, so that the polling inspection engine and the ECS engine can inspect the memory according to the detection scheme of the inspection engine.

Next, a method for determining a detection scheme will be described with reference to fig. 5.

Fig. 5 is a schematic diagram of a method for making a detection scheme according to an embodiment of the present application. Referring to fig. 5, when the computing device has just been started, or when the runtime is not long, there may be no failure (None error, NE) in the memory. With increasing run time, memory may experience Single bit failures (SBEs), multiple bit failures (MBEs), uncorrectable failures (Uncorrectable error, UEs), or Transfer failures (TEs). If the computing device does not detect and correct a correctable fault in time, a single bit fault may evolve into a multi-bit fault, and even into an uncorrectable fault.

As shown in fig. 5, a severity threshold a and a severity threshold B may be set according to the severity of the memory failure. When the severity of the memory fault is less than the severity threshold A, more single-bit faults exist in the memory, and fewer multi-bit faults exist in the memory; when the severity of the memory fault is greater than the severity threshold B, the memory has more multi-bit faults and fewer single-bit faults.

In the embodiment of the application, when the severity of the memory fault is smaller than the severity threshold A, the ECS engine can detect the memory, so that the ECS engine can find and correct the single-bit fault in the memory. The memory may be detected by the poll check engine when the severity of the memory failure is greater than the severity threshold B, such that the poll check engine may discover and correct multi-bit failures in the memory.

By the method, the detection scheme can be dynamically formulated according to the detection characteristics of different inspection engines, the detection functions of the polling inspection engine and the ECS engine are better exerted, the memory detection practicability of the polling inspection engine and the ECS engine is higher, and the detection resources are saved.

The technical scheme of the embodiment of the application is described in detail below by using specific embodiments. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 6 is a flowchart of a memory detection method according to an embodiment of the present application. The execution subject of the method may be a memory detection device or a computing device integrated with the memory detection device, and the following description will take the execution subject as an example of the computing device integrated with the memory detection device. Referring to fig. 6, the method may include:

s601, determining a first memory location and a second memory location in a memory.

The first memory location and the second memory location have different severity of failures.

Alternatively, the first memory location may correspond to a memory interval in the memory, and the second memory location may correspond to a memory interval in the memory.

In this embodiment, the severity of the fault in the first memory location is less than or equal to the first threshold, and the severity of the fault in the second memory location is greater than or equal to the second threshold.

Optionally, the second threshold is less than or equal to the first threshold.

The severity of the fault may be used to represent the severity of the fault, or the probability of a memory fault evolving into an uncorrectable memory fault.

The severity of the failure of the first memory location may be a probability that the memory failure of the first memory location evolves into an uncorrectable memory failure.

The severity of the failure of the second memory location may be a probability that the memory failure of the second memory location evolves into an uncorrectable memory failure.

The first threshold may be a fault severity threshold; the second threshold may be a fault severity threshold.

It should be noted that, specific values of the first threshold and the second threshold may be set according to actual needs, which is not limited in the embodiment of the present application.

In this embodiment, at least one memory location to be detected and a fault severity of each memory location to be detected may be determined in the memory; and determining a first memory location and a second memory location in the memory according to the fault severity of each memory location to be detected.

The memory location to be detected may be a memory location in which a memory failure occurs in the memory; the memory fault may be a fault generated when the computing device performs a service operation on the memory.

When a memory failure occurs in the memory, the severity of the failure may be determined according to the temperature at the memory failure location, the number of times the memory failure occurs at the memory failure location, and the like. The method for determining the severity of the fault may refer to the related art, and this embodiment is not described herein. For example, the temperature at the memory failure location, the number of memory failures occurred at the memory failure location, etc. may be calculated by an artificial intelligence (Artificial Intelligence, AI) algorithm to obtain the severity of the failure at the memory failure location.

Illustratively, assume that the first threshold is 50% and the second threshold is 40%. When the computing equipment performs business operation on the memory, a memory fault is generated, and the memory area with the memory fault is a memory position A to be detected and a memory position B to be detected. After the calculation by AI, the severity of the failure of the memory location A to be detected is determined to be 30%, and the severity of the failure of the memory location B to be detected is determined to be 60%. Since the severity of the failure of the memory location a to be detected is less than the first threshold and the severity of the failure of the memory location B to be detected is greater than the second threshold. Thus, it can be determined that the memory location a to be detected may be a first memory location, and the memory location B to be detected may be a second memory location.

It should be noted that, if the severity of the fault in the memory location to be detected is smaller than the first threshold and larger than the second threshold, the memory location to be detected may be the first memory location and the second memory location at the same time.

Illustratively, assume that the first threshold is 50% and the second threshold is 40%. The severity of the failure of the memory location C to be detected is 45%. The severity of the failure of the memory location C to be detected is less than the first threshold while the severity of the failure of the memory location C to be detected is greater than the second threshold. Thus, it can be determined that the memory location C to be detected may be the first memory location, and at the same time, the memory location C to be detected may be the second memory location.

Optionally, in some embodiments, the second threshold may also be greater than the first threshold.

S602, determining a memory detection strategy based on the fault severity degree of the first memory location and the second memory location.

The memory detection policy indicates that the first memory location is to be failure-detected based on the first detection scheme, and indicates that the second memory location is to be failure-detected based on the second detection scheme.

In this embodiment, when the severity of the fault at the first memory location is less than or equal to a first threshold, determining that the memory detection policy corresponding to the first memory location is a first detection mode; and when the fault severity of the second memory interval is greater than or equal to a second threshold value, determining that the memory detection strategy corresponding to the second memory position is a second detection mode.

The first detection mode may be a mode of detecting the memory by the first detection engine. The first detection engine is used for detecting single-bit faults of the memory. For example, the first detection engine may be an ECS engine.

The second detection mode may be a mode of detecting the memory by the second detection engine. The second detection engine is used for detecting multi-bit faults and uncorrectable faults of the memory. For example, the second detection engine may be a polling check engine.

In this embodiment, the computing device includes a processor and a memory, where a first detection engine is disposed in the memory, and a second detection engine is disposed in the processor.

Illustratively, assume that the first detection engine is an ECS engine and the second detection engine is a poll check engine. The first memory location is the memory location a to be detected, and the second memory location is the detected memory location B. The ECS detection process may be performed on the memory location a to be detected by the ECS engine, and the polling check detection process may be performed on the memory location B to be detected by the polling check engine.

It should be noted that, the specific implementation manner of performing the fault detection processing on the first memory location by using the first detection engine may refer to the embodiment of fig. 8, and the specific implementation manner of performing the fault detection processing on the second memory location by using the second detection engine may refer to the embodiment of fig. 9, which is not described herein.

According to the memory detection method provided by the embodiment, the first memory location and the second memory location can be determined in the memory, and the fault severity of the first memory location is different from that of the second memory location; and determining a memory detection policy based on the severity of the failure of the first memory location and the second memory location; the memory detection policy indicates that the first memory location is to be failure-detected based on the first detection scheme, and indicates that the second memory location is to be failure-detected based on the second detection scheme. Through the method, the first memory section with smaller fault severity can be subjected to fault detection processing in a first detection mode; and performing fault detection processing on the second memory section with larger fault severity by a second detection mode. The detection effect of the first detection mode and the second detection mode is better exerted, so that the memory detection practicability of the first detection mode and the second detection mode is higher, and detection resources are saved.

Based on the embodiment of fig. 6, the embodiment of the application can determine the first memory location and the second memory location according to a fault generated when the computing device performs service operation on the memory. The memory detection method provided by the embodiment of the application is further described in detail below with reference to fig. 7.

Fig. 7 is a flowchart of another memory detection method according to an embodiment of the present application. The execution subject of the method may be a memory detection device or a computing device integrated with the memory detection device, and the following description will take the execution subject as an example of the computing device integrated with the memory detection device. Referring to fig. 7, the method may include:

s701, obtaining a fault log of a first memory fault, wherein the fault log comprises at least one memory fault position and fault information of each memory fault position.

In this embodiment, the first memory failure is a failure generated when the computing device performs a service operation on the memory, and the fault log includes at least one memory failure location and failure information of each memory failure location;

by way of example, a computing device may include a processor and memory. The first memory failure may be a memory failure generated when the processor reads data from the memory.

The failure type of the first memory failure may be a single bit failure, a multiple bit failure, or an uncorrectable failure, etc. This embodiment is not limited thereto.

In this embodiment, the first memory failure may correspond to at least one memory failure location, and the at least one memory failure location and the failure information of each memory failure location may be stored in a failure log.

The fault information may be a temperature at a corresponding memory fault location, a number of times a memory fault occurs at the memory fault location, and so on.

S702, determining a memory location to be detected corresponding to each memory fault location, and obtaining at least one memory location to be detected.

In this embodiment, a storage array in which each memory failure location is located may be determined in a memory, so as to obtain at least one storage array; and determining the at least one storage array as at least one memory location to be detected.

In the actual implementation process, each memory fault location included in the fault log may be located in the same storage array; alternatively, each memory failure location included in the failure log may be located in a different storage array. According to the above situation, determining at least one memory location to be detected includes at least the following 2 cases:

in case 1, each memory failure location in the failure log is located in a different storage array.

In this case, a storage array may be determined according to each memory failure location, where the storage array is the memory location to be detected. It should be appreciated that in this case the number of memory failure locations is the same as the number of memory locations to be detected.

Illustratively, assume that the fault log includes memory fault location 1, memory fault location 2, and memory fault location 3. Memory failure location 1 is located in storage array A, memory failure location 2 is located in storage array C, and memory failure location 3 is located in storage array F. Then 3 memory locations to be detected may be determined from the 3 memory failure locations. The 3 memory locations to be detected are a memory array A, a memory array C and a memory array F respectively.

And 2, at least 2 memory fault positions in the fault log are positioned in the same storage array.

In this case, one storage array may be determined based on at least 2 memory failure locations located in the same storage array. It should be appreciated that in this case, the number of memory failure locations is greater than the number of memory locations to be detected.

Illustratively, assume that the fault log includes memory fault location 1, memory fault location 2, and memory fault location 3. Memory failure location 1 is located in storage array A, memory failure location 2 is located in storage array C, and memory failure location 3 is located in storage array A. Then 2 storage arrays may be determined from the 3 memory failure locations, i.e., 2 memory locations to be detected may be determined from the 3 memory failure locations. The 2 memory locations to be detected are a memory array A and a memory array C respectively.

S703, determining the fault severity of the memory location to be detected corresponding to the memory fault location according to the fault information of the memory fault location for each memory fault location.

The severity of the fault in the memory location to be detected may be the severity of the fault in the memory fault location corresponding to the memory location to be detected.

Optionally, if the memory location to be detected corresponds to a plurality of memory failure locations, a maximum value of the failure severity of the plurality of memory failure locations may be determined as the failure severity of the memory location to be detected.

In this embodiment, the severity of the failure at the memory failure location may be calculated by the content included in the failure information. The method for calculating the severity of the fault may be referred to in the related art, and will not be described herein.

For example, assuming that the to-be-detected memory location a corresponds to a memory failure location with a failure severity of 40%, it may be determined that the to-be-detected memory location a has a failure severity of 40%. Assuming that the to-be-detected memory location B corresponds to 2 memory failure locations, and the failure severity of the 2 memory failure locations is 40% and 45%, respectively, it may be determined that the failure severity of the to-be-detected memory location B is 45%.

S704, determining a first threshold and a second threshold.

In this embodiment, resource occupation information of the computing device may be acquired, and the first threshold and the second threshold may be determined according to the resource occupation information.

The resource occupancy information may be a first CPU occupancy rate; that is, the resource occupancy information may be a current CPU occupancy of the computing device.

In this embodiment, an initial CPU occupancy rate, a first initial threshold value, and a second initial threshold value may be obtained; determining the difference value between the first CPU occupancy rate and the initial CPU occupancy rate as a CPU occupancy rate difference value; updating a first initial threshold according to the CPU occupancy rate difference value to obtain a first threshold; and updating the second initial threshold according to the CPU occupancy rate difference value to obtain a second threshold.

The first initial threshold may be a preset first threshold, and the second initial threshold may be a preset second threshold.

The first initial threshold and the second initial threshold may be stored in the computing device. Alternatively, the first initial threshold and the second initial threshold may be stored in a processor or BMC of the computing device.

The magnitudes of the first initial threshold and the second initial threshold may be set according to actual needs, which is not limited in this embodiment. For example, the first initial threshold may be 50% and the second initial threshold may be 40%.

The initial CPU occupancy may be a CPU occupancy of the computing device upon start-up. For example, the initial CPU occupancy may be 50%.

In this embodiment, if the absolute value of the CPU difference is smaller than 10%, the first threshold may be determined to be a first initial threshold and the second threshold may be determined to be a second initial threshold. If the CPU occupancy rate difference value is increased by n multiplied by 10%, the first initial threshold value can be reduced by n5% to obtain a first updated initial threshold value, and the first threshold value is determined to be the first updated initial threshold value; the second initial threshold may be increased by n x 5% to obtain a second updated initial threshold, and the second threshold is determined to be the second updated initial threshold. If the CPU occupancy rate difference value is reduced by n multiplied by 10%, increasing the first initial threshold value by n multiplied by 5% to obtain a first updated initial threshold value, and determining the first threshold value as the first updated initial threshold value; the second initial threshold may be reduced by n x 5% to obtain a second updated initial threshold, and the second threshold is determined to be the second updated initial threshold. Wherein n is an integer of 1 or more.

By way of example, assume an initial CPU occupancy of 50%, a first initial threshold of 50%, a second initial threshold of 40%, and a first CPU occupancy of 55%. The CPU occupancy difference is 5% and the first threshold may be determined to be 50% (i.e., the first threshold is equal to the first initial threshold) and the second threshold may be determined to be 40% (i.e., the second threshold is equal to the second initial threshold).

Assuming that the initial CPU occupancy is 50%, the first initial threshold is 50%, the second initial threshold is 40%, and the first CPU occupancy is 65%. The CPU occupancy rate difference increases by 15% and n is 1. The first updated initial threshold is 45% and the second updated threshold is 45%. From this, it can be determined that the first threshold is 45% and the second threshold is 45%.

Assuming an initial CPU occupancy of 50%, a first initial threshold of 50%, a second initial threshold of 40%, and a first CPU occupancy of 30%. The CPU occupancy difference is reduced by 20% and n is 2. The first updated initial threshold is 60% and the second updated threshold is 30%. From this, it can be determined that the first threshold is 60% and the second threshold is 30%.

S705, determining a first memory location and a second memory location in the memory according to the fault severity of each memory location to be detected, the first threshold and the second threshold.

In this embodiment, the memory location to be detected with the fault severity less than or equal to the first threshold may be determined as the first memory location, and the memory location to be detected with the fault severity greater than or equal to the second threshold may be determined as the second memory location.

Illustratively, assume that the first threshold is 50% and the second threshold is 40%. And determining 3 memory positions to be detected according to the fault log of the first memory fault, wherein the 3 memory positions to be detected are a memory position A to be detected, a memory position B to be detected and a memory position C to be detected respectively. The severity of the fault of the memory location a to be detected is 30%, the severity of the fault of the memory location B to be detected is 45%, and the severity of the fault of the memory location C to be detected is 60%. The first memory location may include a to-be-detected memory location a and a to-be-detected memory location B, and the second memory location may include a detected memory location a and a to-be-detected memory location C.

S706, performing fault detection on the second memory location by a second detection mode, and performing fault detection on the first memory location by a first detection mode.

It should be noted that, the specific implementation manner of performing the fault detection on the first memory location by using the first detection manner may refer to the embodiment of fig. 8, and the specific implementation manner of performing the fault detection on the second memory location by using the second detection manner may refer to the embodiment of fig. 9, which is not described herein.

And S707, repairing at least one memory location to be detected according to the fault information.

In this embodiment, after obtaining the fault log of the first memory fault, the computing device may determine at least one memory location to be detected according to at least one memory fault location in the fault log, and may repair the at least one memory location to be detected according to the fault information in the fault log.

The fault log of the first memory fault includes: at least one memory failure location and failure information for each memory failure location; the fault information includes: the number of failures at which at least one memory failure location fails.

The computing device may determine, according to the number of failures at which the at least one memory failure location fails, a number of failures corresponding to the at least one memory location to be detected; determining a first memory location to be detected in the at least one memory location to be detected according to the number of faults corresponding to the at least one memory location to be detected, wherein the number of faults corresponding to the first memory location to be detected is greater than or equal to a preset threshold value; and repairing the first memory location to be detected to obtain a repairing result. The repairing result comprises the target memory location which is successfully repaired.

For any memory location to be detected, the number of faults corresponding to the memory location to be detected may be the maximum number of faults of one or more memory fault locations included in the memory location to be detected.

Illustratively, assume that a memory location to be detected includes 2 memory failure locations. The number of faults occurring in the 2 memory fault positions is 2000 times and 1500 times respectively, and the number of faults corresponding to the memory position to be detected may be 2000 times.

The first memory location to be detected may be a memory location to be detected that needs to be repaired.

In this embodiment, the memory location to be detected, where the number of failures is greater than or equal to a preset threshold, may be determined as the first memory location to be detected. It should be noted that, the preset threshold may be set according to actual needs, and the size of the preset threshold is not limited in the embodiment of the present application. For example, the preset threshold may be 6000 times.

For example, assuming that the preset threshold is 6000 times, the to-be-detected memory location with the failure number greater than or equal to 6000 may be the first to-be-detected memory location.

In this embodiment, the redundant memory area may be used to replace the first to-be-detected memory area, so as to implement repair processing on the first to-be-detected memory location.

The target memory location may be a first to-be-detected memory location that is successfully repaired.

Illustratively, assume that the first to-be-detected memory location includes a storage array A, a storage array C, and a storage array E. After the storage array A, the storage array C and the storage array E are repaired, the storage array A is successfully repaired, and the storage array C and the storage array E are failed to be repaired. The target memory location may be storage array a.

In this embodiment, after determining the target memory location, the target memory location may be deleted in the first memory location or the second memory location.

S707 may be performed before S703; alternatively, S707 may be implemented after S703; alternatively, S707 may be implemented simultaneously with S703. The embodiment of the present application is not limited thereto.

S708, repairing at least one first memory location according to the fault detection result of the first detection mode.

The failure detection result of the first detection method may be a result of performing failure detection processing on the first memory location by the first detection method. The failure detection result of the first detection mode may include the number of failures of at least one first memory location.

For any one first memory location, the number of times of faults of the first memory location may be the number of times of faults detected when the first memory location is detected by the first detection method.

In this embodiment, after the detection processing is performed on the at least one first memory location, the computing device may further perform repair processing on the at least one first memory location according to the failure detection result.

The computing device may determine a first memory location to be detected in the at least one first memory location according to a number of failures of the at least one first memory location; and repairing the first memory location to be detected to obtain a repairing result. The repairing result comprises the target memory location which is successfully repaired.

In this embodiment, the first memory location with the failure number greater than or equal to the preset threshold may be determined as the first memory location to be detected. It should be noted that, the preset threshold may be set according to actual needs, and the size of the preset threshold is not limited in the embodiment of the present application. For example, the preset threshold may be 5000 times.

For example, if the preset threshold is 5000 times, when the first detection engine performs detection processing on each first memory location, the first memory location with the number of detected failures greater than or equal to 5000 may be the first memory location to be detected.

S709, repairing at least one second memory location according to the fault detection result of the second detection mode.

The failure detection result of the second detection method may be a result of performing failure detection processing on the second memory location by the second detection method. The failure detection result of the second detection mode may include the number of failures of at least one second memory location.

For any one second memory location, the number of times of faults of the second memory location may be the number of times of faults detected when the second memory location is detected by a second detection method.

In this embodiment, after the detection processing is performed on the at least one second memory location, the computing device may further perform repair processing on the at least one second memory location according to the failure detection result.

The computing device may determine a first memory location to be detected in the at least one second memory location according to a number of failures of the at least one second memory location; and repairing the first memory location to be detected to obtain a repairing result. The repairing result comprises the target memory location which is successfully repaired.

In this embodiment, the second memory location with the failure number greater than or equal to the preset threshold may be determined as the first memory location to be detected. It should be noted that, the preset threshold may be set according to actual needs, and the size of the preset threshold is not limited in the embodiment of the present application. For example, the preset threshold may be 5000 times.

For example, if the preset threshold is 5000 times, the second memory location with the number of detected failures greater than or equal to 5000 may be the first memory location to be detected when the second detection engine performs detection processing on each second memory location.

Note that S708 may be implemented before S709; alternatively, S708 may be implemented after S709; alternatively, S708 may be implemented simultaneously with S709. The embodiment of the present application is not limited thereto.

The memory detection method provided by the embodiment can obtain a fault log of the first memory fault, wherein the fault log comprises at least one memory fault position and fault information of each memory fault position; determining a memory location to be detected corresponding to each memory fault location to obtain at least one memory location to be detected; for each memory fault position, determining the fault severity of the memory position to be detected corresponding to the memory fault position according to the fault information of the memory fault position; determining a first threshold and a second threshold; determining a first memory location and a second memory location in a memory according to the fault severity of each memory location to be detected, a first threshold value and a second threshold value; and performing fault detection processing on the first memory location by a first detection mode, and performing fault detection processing on the second memory location by a second detection mode. The first detection mode may be a mode of detecting the memory based on the first detection engine, and the second detection mode may be a mode of detecting the memory based on the second detection engine. Through the method, the memory position to be detected with smaller fault severity can be subjected to fault detection processing through the first detection engine, the memory position to be detected with larger fault severity can be subjected to fault detection processing through the second detection engine, the detection functions of the first detection engine and the second detection engine are better played, the memory detection practicability of the first detection engine and the second detection engine is higher, and detection resources are saved.

In addition to any of the above embodiments, a method of performing failure detection processing on the first memory section by the first detection method will be described in detail below with reference to fig. 8.

Fig. 8 is a schematic flow chart of fault detection processing for a first memory location by a first detection method according to an embodiment of the present application. Referring to fig. 8, the method may include:

s801, determining first detection information of a first memory location according to the fault severity of the first memory location.

In this embodiment, the number of first memory locations is M, where M is an integer greater than or equal to 0.

The first detection information comprises detection rate and detection times of each first memory location.

In this embodiment, the detection rate and the detection times of each first memory location may be determined according to the severity of the fault of the first memory location.

Specifically, for any one first memory location, if the fault severity of the first memory location is less than or equal to a preset severity, determining that the detection rate of the first memory location is a preset rate, and determining that the detection times of the first memory location are preset times; if the fault severity of the first memory location is greater than the preset severity, determining a detection rate of the first memory location according to the preset speed and the fault severity, and determining the detection times of the first memory location according to the preset times and the fault severity.

The preset severity may be any fault severity value. For example, the preset severity may be 30%.

The preset rate may be a predetermined one of the detection rates. For example, the preset rate may be 1TB/24h = 0.042TB/h; wherein 1TB may be the size of memory in the computing device.

The preset number of times may be a predetermined number of times of detection. For example, the preset number of times may be 1 time.

It should be noted that, the embodiment of the present application does not specifically limit the preset severity, the preset rate and the preset times. In the implementation process, the preset severity, the preset rate and the preset times can be set according to actual needs.

According to the difference of the magnitude relation between the fault severity degree of the first memory location and the preset severity degree, the determination of the first detection information at least comprises the following two conditions:

in case 1, the severity of the fault at the first memory location is less than or equal to a preset severity.

In this case, the detection rate of the first memory location may be equal to a preset rate, and the detection times may be equal to a preset times.

For example, assuming a preset severity of 30%, a preset rate of 0.042TB/h, a preset number of 1, and a first memory location of 25%. The preset rate and the preset times corresponding to the first memory location are respectively 0.042TB/h and 1 time.

And 2, the fault severity degree of the first memory location is larger than the preset severity degree.

In this case, the detection rate may be calculated according to the fault severity degree, the preset severity degree, and the preset rate of the first memory location; the detection rate may be calculated according to the severity of the fault at the first memory location, a preset severity, and a preset number of times.

In a possible implementation, it is assumed thatThen the detection rate=preset rate× (1+n), then the detection count=preset count+n.

By way of example, assuming a preset severity of 30%, a preset rate of 0.042TB/h, a preset number of 1, and a fault severity of 42% for the first memory location. ThenDetection rate=0.042 TB/h× (1+1) =0.084 TB/h, detection times=1+1=2.

In another possible implementation, assume that Then the detection rate=preset rate× (1+n), then the detection count=preset count+n. />

By way of example, assuming a preset severity of 30%, a preset rate of 0.042TB/h, a preset number of 1, and a fault severity of 42% for the first memory location. ThenDetection rate=0.042 TB/h× (1+2) =0.126 TB/h, detection times=1+2=3.

Optionally, if there are consecutive first memory locations in the first detection information, the consecutive at least two first memory locations may be combined.

Specifically, if there are 2 or more consecutive first memory locations in the first detection information, the 2 or more consecutive first memory locations may be merged. The detection rate corresponding to the first memory location after merging may be the maximum value of the detection rates corresponding to the 2 or more continuous first memory locations. The number of times of detection corresponding to the first memory location after merging may be a maximum value of the number of times of detection corresponding to the 2 or more consecutive first memory locations.

Illustratively, assume that the first detection information is as shown in table 11:

TABLE 11

First memory location	Rate of detection	Number of times of detection
			Storage array A	0.126TB/h	3 times
Memory array B	0.084TB/h	2 times
			Storage array D	0.042TB/h	1 time

As shown in table 11, the first detection information includes 3 first memory locations. The 3 first memory locations are a memory array a, a memory array B, and a memory array D, respectively. The corresponding detection rate of the storage array A is 0.126TB/h, the corresponding detection times are 3 times, the corresponding detection rate of the storage array B is 0.084TB/h, the corresponding detection times are 2 times, the corresponding detection rate of the storage array C is 0.042TB/h, and the corresponding detection times are 1 time. If the memory array a and the memory array B are two continuous memory arrays, the memory array a and the memory array B may be combined to obtain the memory array AB. The corresponding detection rate of the storage array AB is 0.126TB/h, and the corresponding detection times are 3 times. The combined first detection information may be as shown in table 12:

Table 12

First memory location	Rate of detection	Number of times of detection
			Storage array AB	0.126TB/h	3 times
Storage array D	0.042TB/h	1 time

S802, generating a first detection task according to the first detection information.

In this embodiment, the first detection task includes M first memory locations and a detection rate of each first memory location.

Illustratively, assume that the first detection information is as shown in table 21:

table 21

The first detection task may be as shown in table 22:

table 22

First memory location	Rate of detection
		Storage array A	0.126TB/h
Memory array B	0.084TB/h
		Storage array D	0.042TB/h

That is, the first detection task includes: each first memory location in the first detection information, and a detection rate of each first memory location.

S803, executing a first detection task to detect M first memory locations.

In this embodiment, the detection processing may be performed on the M first memory locations by the first detection engine.

In this embodiment, the computing device may set, according to the first detection task, a start position, an end position, and a detection rate of the M first memory locations in the configuration file, and may enable the first detection engine to operate, so as to achieve the purpose of performing fault detection processing on the M first memory locations.

Alternatively, the computing device may set the start and end positions of the M first memory locations in a Mode Register (MR). Wherein the mode register may be located in the memory. It should be appreciated that a configuration file may be stored in the mode register for setting the start and end positions of the first memory location.

Optionally, the computing device may also configure the starting location and the ending location of the M first memory locations by a local array auto-refresh (Partial Array Self Refresh, PARS) function of the memory to generate a first memory location list.

Alternatively, the computing device may configure the detection rates of the M first memory locations in registers in the processor. The register can have the online modification capability during service operation so as to realize the regulation and control of the detection rate. It should be appreciated that the register may store therein a configuration file for setting the first memory location detection rate.

S804, the detection times of all the first memory locations in the first detection information are reduced by 1, and the first memory locations with the detection times of 0 are deleted, so as to obtain first updated detection information.

Specifically, the number of detection times of all the first memory locations in the first detection information is reduced by 1, and then the first memory locations with the number of detection times of 0 after the reduction of 1 are deleted, so as to obtain the first updated detection information.

In this embodiment, after the computing device performs the first detection task, the detection information may be updated to obtain the first updated detection information.

Illustratively, assume that the first detection information is as shown in table 31:

table 31

The first update detection information may be as shown in table 32:

table 32

First memory location	Rate of detection	Number of times of detection
			Storage array A	0.126TB/h	2 times
Memory array B	0.084TB/h	1 time

S805, generating an (i+1) th detection task according to the (i) th update detection information.

The (i+1) th detection task comprises each first memory location in the (i) th updated detection information and the detection rate of each first memory location.

And i is sequentially 1, 2, … … and K-1, wherein K is the maximum value of detection times corresponding to M first memory positions, and K is an integer greater than 1.

It should be noted that, the specific implementation of S805 may refer to S802, which is not described herein.

S806, executing the (i+1) th detection task, and subtracting 1 from the detection times of each first memory location in the (i) th detection information and deleting the first memory location with the detection times of 0 to obtain the (i+1) th updated detection information.

It should be noted that, the specific implementation of S806 may refer to S803, which is not described herein.

According to the method for performing fault detection processing on the M first memory locations in the first detection mode, first detection information of the M first memory locations can be determined according to fault severity of the M first memory locations; generating a first detection task according to the first detection information; executing a first detection task to detect M first memory positions; subtracting 1 from the detection times of each first memory location in the first detection information, and deleting the first memory location with the detection times of 0 to obtain first updated detection information; generating an (i+1) th detection task according to the (i) th updated detection information; and executing the (i+1) th detection task, and subtracting 1 from the detection times of each first memory location in the (i) th detection information and deleting the first memory location with the detection times of 0 to obtain the (i+1) th updated detection information. Through the method, the detection rate and the detection times of the first memory location can be determined through the fault severity degree of the first memory location, the larger detection rate and the larger detection times are determined for the first memory location with higher fault severity degree, the smaller detection rate and the smaller detection times are determined for the first memory location with lower fault severity degree, the detection function of the first detection engine is better played, and the detection practicability of the first detection engine is higher.

In addition to any of the above embodiments, a method of performing fault detection processing on the second memory location by the second detection method will be described in detail below with reference to fig. 9.

Fig. 9 is a schematic flow chart of fault detection processing for a second memory location in a second detection manner according to an embodiment of the present application. Referring to fig. 9, the method may include:

s901, determining second detection information of a second memory location according to the fault severity of the second memory location.

In this embodiment, the number of the second memory locations is N, where N is an integer greater than or equal to 0.

The second detection information comprises the detection rate and the detection times of each second memory location.

S902, generating a first detection task according to the second detection information.

The first detection task comprises N second memory locations and a detection rate of each second memory location.

It should be noted that, the implementation of S901-S902 is similar to the implementation of S801-S802, and specific reference may be made to S801-S802, which are not repeated here.

S903, executing a first detection task to detect N second memory locations.

In this embodiment, the detection processing may be performed on the N second memory locations by the second detection engine.

In this embodiment, the computing device may set, according to the first detection task, a start position, an end position, and a detection rate of the N second memory locations in the configuration file, and may enable the second detection engine to operate, so as to achieve the purpose of performing fault detection processing on the N second memory locations.

Alternatively, the computing device may set the start location and the end location of the N second memory locations in the detection address register. The detection address register may be located in the memory. For example, the detection address registers may be a scrubaddress sl0 register and a scrubaddreshi register. It should be appreciated that the detection address register may store therein a configuration file for setting the start position and the end position of the second memory location.

Alternatively, the computing device may configure the detection rate of the N second memory locations via the power control unit (power control unit, PCU).

S904, the detection times of all the second memory positions in the second detection information are reduced by 1, and the first memory position with the detection times of 0 is deleted, so that first updated detection information is obtained.

S905, generating an (i+1) th detection task according to the (i) th update detection information.

The (i+1) th detection task comprises each second memory location in the (i) th updated detection information and the detection rate of each second memory location.

And i is sequentially 1, 2, … …, K-1, K is the maximum value of detection times corresponding to N second memory positions, and K is an integer greater than 1.

S906, executing the (i+1) th detection task, and subtracting 1 from the detection times of the second memory locations in the (i) th detection information and deleting the second memory locations with the detection times of 0 to obtain the (i+1) th updated detection information.

It should be noted that, the implementation of S904-S906 is similar to the implementation of S804-S806, and specific reference may be made to S804-S806, which are not repeated here.

According to the method for performing fault detection processing on the N-th memory locations through the second detection mode, second detection information of the N second memory locations can be determined according to the fault severity of the N-th memory locations; generating a first detection task according to the second detection information; executing a first detection task to detect N second memory positions; subtracting 1 from the detection times of each second memory location in the second detection information, and deleting the second memory location with the detection times of 0 to obtain first updated detection information; generating an (i+1) th detection task according to the (i) th updated detection information; and executing the (i+1) th detection task, and subtracting 1 from the detection times of each second memory location in the (i) th detection information and deleting the second memory location with the detection times of 0 to obtain the (i+1) th updated detection information. Through the method, the detection rate and the detection times of the second memory location can be determined through the fault severity degree of the second memory location, the larger detection rate and the larger detection times are determined for the second memory location with higher fault severity degree, the smaller detection rate and the smaller detection times are determined for the second memory location with lower fault severity degree, the detection function of the second detection engine is better played, and the detection practicability of the second detection engine is higher.

Fig. 10 is a schematic structural diagram of a computing device according to an embodiment of the present application. Referring to fig. 10, the computing device 10 includes a processor 11 and a memory 12; wherein,,

the processor 11 is configured to determine a first memory location and a second memory location in a memory, where the severity of a fault in the first memory location and the second memory location are different;

the processor 11 is further configured to determine a memory detection policy based on the severity of the failure of the first memory location and the second memory location, where the memory detection policy indicates that the first memory location is failed to detect based on a first detection mode and indicates that the second memory location is failed to detect based on a second detection mode.

The computing device provided in this embodiment may be used to execute the technical solution shown in any of the foregoing method embodiments, and its implementation principle and technical effects are similar, and are not described herein.

In one possible embodiment, the processor 11 is specifically configured to,

acquiring resource occupation information of the computing equipment;

In one possible implementation, the resource occupancy information includes a first CPU occupancy of the computing device; the processor 11 is specifically configured to,

In one possible embodiment, the processor 11 is specifically configured to,

In one possible implementation manner, the number of the first memory locations is M, where M is an integer greater than or equal to 0; the processor 11 is specifically configured to,

In one possible embodiment, the processor 11 is specifically configured to,

The embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where when the computer program is executed by a computer, the memory detection method executed by any one of the above method embodiments is implemented, and the implementation principle and technical effects are similar, and are not repeated herein.

The embodiment of the present application further provides a computer program product, which includes a computer program, where the computer program when executed by a computer implements a memory detection method implemented by any of the above method embodiments, and the implementation principle and technical effects are similar, and are not described herein.

All or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a readable memory. The program, when executed, performs steps including the method embodiments described above; and the aforementioned memory (storage medium) includes: read-only memory (ROM), RAM, flash memory, hard disk, solid state disk, magnetic tape, floppy disk, optical disk, and any combination thereof.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, embedded processor, or other programmable terminal device to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable terminal device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable terminal device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer implemented process such that the instructions which execute on the computer or other programmable device provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims and the equivalents thereof, the present application is also intended to include such modifications and variations.

In the present disclosure, the term "include" and variations thereof may refer to non-limiting inclusion; the term "or" and variations thereof may refer to "and/or". The terms "first," "second," and the like, herein, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. In the present application, "a plurality of" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

Claims

1. The memory detection method is characterized by comprising the following steps:

2. The method of claim 1, wherein determining a memory detection policy based on a severity of a failure of the first memory location and the second memory location comprises:

3. The method of any of claims 1-2, wherein determining a first memory location and a second memory location in a memory comprises:

4. The method of claim 3, wherein determining at least one memory location to be inspected in the memory and a severity of failure for each memory location to be inspected comprises:

5. The method of claim 3 or 4, wherein determining the first memory location and the second memory location in the memory based on the severity of the failure for each memory location to be detected comprises:

Acquiring resource occupation information of computing equipment;

6. The method of claim 5, wherein the resource occupancy information comprises a first CPU occupancy of the computing device; determining the first threshold and the second threshold according to the resource occupation information comprises the following steps:

7. The method of any of claims 1-6, wherein after determining a memory detection policy based on the severity of the failure of the first memory location and the second memory location, further comprising:

8. The method of claim 7, wherein the number of first memory locations is M, M being an integer greater than or equal to 0; performing fault detection on the first memory location in the first detection manner, including:

determining first detection information of M first memory locations according to the fault severity of the M first memory locations, wherein the first detection information comprises the detection rate and/or the detection times of each first memory location;

9. The method of any of claims 7-8, wherein the number of second memory locations is N, N being an integer greater than or equal to 0; performing fault detection on the second memory location in the second detection manner, including:

determining second detection information of the N second memory locations according to the fault severity of the N second memory locations, wherein the second detection information comprises the detection rate and the detection times of each second memory interval;

10. A computing device, the computing device comprising a processor and a memory; wherein,,