CN114780270A - Memory fault processing method and device, electronic equipment and computer readable storage medium - Google Patents

Memory fault processing method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN114780270A
CN114780270A CN202210273519.1A CN202210273519A CN114780270A CN 114780270 A CN114780270 A CN 114780270A CN 202210273519 A CN202210273519 A CN 202210273519A CN 114780270 A CN114780270 A CN 114780270A
Authority
CN
China
Prior art keywords
memory
fault
computing
computing process
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210273519.1A
Other languages
Chinese (zh)
Inventor
马旭华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202210273519.1A priority Critical patent/CN114780270A/en
Publication of CN114780270A publication Critical patent/CN114780270A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Hardware Redundancy (AREA)

Abstract

The application discloses a memory fault processing method and device, electronic equipment and a computer readable storage medium. The method comprises the following steps: determining a computing process corresponding to a fault memory according to a page identifier of a page corresponding to the fault memory with the fault; calculating damaged parameters of the calculation process according to the fault data of the fault memory; determining a processing mode aiming at the fault memory according to the damaged parameters; and executing processing on the fault memory by using the processing mode. According to the method and the device, a fault processing mode based on the influence of the memory fault is realized, the influence of the memory fault on the use of the cloud computing service by the user is greatly reduced, and therefore the problem of user loss caused by the fact that a key process cannot be used due to the fact that the fault memory is isolated in the prior art is solved.

Description

Memory fault processing method and device, electronic equipment and computer readable storage medium
Technical Field
The application relates to the technical field of cloud computing, in particular to a cloud computing platform.
Background
As cloud computing technology evolves, more and more users perform a variety of complex computing tasks using instances created based on cloud computing resources. Particularly, the cloud computing resources can be integrated into a cloud computing resource pool by using massive physical server resources based on network connection of the internet to provide flexible and efficient cloud computing services for users according to various requirements of the users.
With the improvement of the performance of physical computing resources, under the condition that computing tasks submitted by users to cloud computing service providers are more and more complex, physical resources allocated to the users are increasingly increased, which is also called as large-scale cloud computing service. In a large-scale cloud computing service, memory resources are one of the key resources that affect the execution of cloud computing for users. The increase of the memory capacity allocated to the user also increases the probability of memory failure, when the memory fails, the computing service using the part of memory is directly influenced to be unavailable, and when the role of the computing service of the part of memory in the whole computing task of the user is very important, the failure of the part of memory can even cause the problem of the whole computing task execution of the user, and even cause the whole task failure. Therefore, a solution capable of handling memory failures is needed.
Disclosure of Invention
Embodiments of the present application provide a memory fault processing method and apparatus, an electronic device, and a computer-readable storage medium, so as to solve a defect that a computing task of a user is affected when a memory fault is processed in the prior art.
To achieve the above object, an embodiment of the present application provides a memory fault handling method, where the memory is allocated to at least one computing process, and the method includes:
determining a computing process corresponding to a fault memory according to a page identifier of a page corresponding to the fault memory with the fault;
calculating damage parameters of the computing process according to fault data of the fault memory, wherein the damage parameters identify risks of damage to the computing process;
determining a processing mode aiming at the fault memory according to the damaged parameters;
and executing processing on the fault memory by using the processing mode.
An embodiment of the present application further provides a memory fault processing apparatus, including:
the first determining module is used for determining a computing process corresponding to a fault memory according to a page identifier of a page corresponding to the fault memory with the fault;
the first calculation module is used for calculating damaged parameters of the calculation process according to the fault data of the fault memory, wherein the damaged parameters identify the risk of damage of the calculation process;
the second determining module is used for determining a processing mode aiming at the fault memory according to the damaged parameters of the computing process;
and the processing module is used for executing processing on the fault memory by using the processing mode.
An embodiment of the present application further provides an electronic device, including:
a memory for storing a program;
and the processor is used for operating the program stored in the memory, and the memory fault processing method provided by the embodiment of the application is executed when the program is operated.
The embodiment of the present application further provides a computer-readable storage medium, on which a computer program executable by a processor is stored, where the program, when executed by the processor, implements the memory failure processing method provided in the embodiment of the present application.
According to the memory fault processing method and device, the electronic device and the computer readable storage medium provided by the embodiment of the application, the computing process corresponding to the fault memory is determined according to the page identifier of the page corresponding to the fault memory with the fault, the damaged parameter of the computing process is calculated according to the fault data of the fault memory, the processing mode aiming at the fault memory is determined according to the damaged parameter of the computing process, the computing process using the page can be positioned from the page corresponding to the fault memory with the memory fault, the damage risk of the process can be evaluated according to the damaged parameter of the process, so that the processing mode corresponding to the damaged parameter of the computing process can be adopted to process the fault memory, the processing mode more reasonable for the computing process can be adopted when the memory fault occurs, and the fault processing mode based on the influence of the memory fault is realized, the influence of memory failure on the use of the cloud computing service by a user is greatly reduced, and therefore the problem of user loss caused by the fact that a key process cannot be used due to the fact that the failure memory is isolated in the prior art is solved.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1a is a schematic view of an application scenario of a memory fault handling scheme according to an embodiment of the present application;
fig. 1b is a schematic diagram of a system architecture of a memory fault handling method according to an embodiment of the present application;
fig. 2 is a flowchart of an embodiment of a memory fault handling method provided in the present application;
fig. 3 is a flowchart of another embodiment of a memory fault handling method provided in the present application;
fig. 4 is a schematic structural diagram of an embodiment of a memory fault handling apparatus provided in the present application;
fig. 5 is a schematic structural diagram of an embodiment of an electronic device provided in the present application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Example one
The scheme provided by the embodiment of the application can be applied to any system with cloud resource management capability, such as a server system comprising a chip with cloud resource management function and related components, and the like. Fig. 1 is a schematic view of an application scenario of a memory fault handling scheme provided in an embodiment of the present application, and the scenario shown in fig. 1 is only one example to which the technical scheme of the present application is applicable.
Today, with the development of the internet, a large amount of physical computing resources can be connected together through the internet, and virtualization technology is used to provide computing services to multiple users based on the physical computing resources. In particular, a cloud computing facilitator may allocate physical computing resources to a user to construct a virtual machine instance for the user according to the user's needs, so that the user may perform its own computing tasks using the computing resources of the instance built for the user on the cloud as if using an offline server.
With the improvement of the performance of physical computing resources, under the condition that computing tasks submitted by users to cloud computing service providers are more and more complex, physical resources allocated to the users are increasingly increased, which is also called as large-scale cloud computing service. In a large-scale cloud computing service, memory resources are one of the key resources that affect the execution of cloud computing for users. The increase of the memory capacity allocated to the user also increases the probability of memory failure, when the memory fails, the computing service using the part of memory is directly influenced to be unavailable, and when the role of the computing service of the part of memory in the whole computing task of the user is very important, the failure of the part of memory can even cause the problem of the whole computing task execution of the user, and even cause the whole task failure.
In the prior art, a fault memory is usually isolated from a system by taking a fault memory page down, so that the fault memory is cleared by not allowing the system to access the page again. However, in the large-scale cloud computing described above, since the computing task executed for the user is complex and the required memory capacity is also large, in the above solution of the prior art, since the offline of the page directly reduces the total memory capacity available for the current computing task of the user, if the number of failed memories is increased, the normal execution of the current task of the user is affected, and even the computing performance of the instance used by the user may be reduced. Surprisingly, the memory resources used in the cloud computing service are actually divided into two types, one type is a memory allocated to a user according to a request of the user, namely, a memory allocated to an instance created by the user, and the other part is a memory used by the resource allocation service of the cloud computing server for computing tasks of each user. In other words, in the memory resources of the cloud computing server, a part of the memory is shared by the user, and another part of the memory is used by a dedicated management module in the server to provide various services for the user instance, such as a network forwarding service, or a service of allocating computing resources according to a user request. Therefore, if the failed memory is not allowed to be used by directly taking the memory offline from the view point of the memory when the memory fails, it is likely that the offline memory is actually the memory being used by the management module of the server, and the offline of the memory may cause the service of the management module for the user to be affected, and particularly when there is a large amount of failed memory in the memory allocated to the management module, the management module may face a situation where no memory is available, which not only affects the stability of the management module in providing the management service, but also may cause the operation of the user's instance to be affected because the management module cannot be used.
Therefore, in the prior art, such a scheme of simply isolating a faulty memory only from the perspective of the faulty memory cannot address a memory fault from the use of the memory by a user, and although the risk of the problem caused by the memory fault can be reduced to a certain extent, on one hand, the influence of the memory fault on the use of the user cannot be really solved, and on the other hand, simply isolating all faulty memory pages may cause the use of the user to be influenced, even cause the serious consequence of task damage.
For example, as shown in fig. 1, fig. 1 is a schematic diagram illustrating an application scenario of a memory fault management scheme according to an embodiment of the present application. In the scenario shown in fig. 1, for example, a computing service system supporting elastic cloud computing may include one or more virtual machine instances created for a user and a management module for managing the virtual machine instances, and one or more system modules providing system services for the user, such as a network interface module and the like. As described above, the elastic cloud computing system may include a physical resource pool composed of a plurality of physical servers, so that the management module may create a virtual machine instance for a user according to the computing task requirement of the user, and allocate corresponding physical computing resources for the user from the physical resource pool so as to perform the computing task of the user using the physical resources. Thus, a user may perform their own computing task in the virtual machine instance created for them using the physical computing resources allocated to them. In addition, in general, in the virtual machine instance created for the user, only processor resources, memory resources, and data storage space resources are allocated for the user, and other system resources required by the user may be provided for the user by a common component set in the computing service system. In this case, the virtual machine instance of the user may send data that needs to be processed by these system common components to the management module or the corresponding common components, and the management module controls the system components to process. For example, when the user's virtual machine instance needs to send the processed data to the user via the internet, the management module may control the network interface module to provide the network forwarding processing for the user.
In a cloud computing system, both the virtual machine instances created for the user and the management modules of the system need to use memory to perform their respective processing tasks. Thus, in a cloud computing system, physical memory may be divided into two portions, one portion being allocated to a user's virtual machine instance for use, and the other portion being used by the management module to handle various processes in the system. In other words, in the memory of the cloud computing system, a part of the memory is used by each computing process in the virtual machine instance of the user, and another part of the memory is used by the system computing process of the management module.
For example, in the scenario as shown in fig. 1, three virtual machine instances VM1-VM3 may have been created for a user and management is performed by the management module for the three virtual machine instances separately, such as controlling system processes of restart, migration, etc. of the virtual machine instances, and the management module may also control system components, such as network interfaces, to perform network forwarding processes on data in the user virtual machine instances. Thus, in the scenario shown in fig. 1, management module 1 uses process 1 to schedule resources for the VM1 instance, and uses process 4 to forward data for the VM2 instance. Corresponding to the execution of the processes, a plurality of pages are respectively allocated in the memory resources of the cloud computing system for the processing data of the processes. For example, process 1 currently uses page 2 to perform resource scheduling processing and process 4 currently uses page 6 to perform network data forwarding. In addition, in the process that each user uses the virtual machine instance created for the user, the memory resource is also required to be used for executing the corresponding process. These processes are different from the system processes that the management module executes in the computing system, but instead execute user-directed computing processes within the virtual machine instance. In the scenario shown in FIG. 1, for example, page 3 may be allocated for virtual machine instance VM1 to perform an internal computing process.
In actual execution, both the system process used by the management module and the internal computing process created by the virtual machine instance for executing the user's instructions require the use of actual physical memory resources. In other words, each page in the memory resources in the scenario shown in fig. 1a actually corresponds to an actual physical memory address. Therefore, as described above, in the cloud computing system, the memory space available for the computing system can be configured by using multiple memories in multiple servers, which can provide better computing performance service for users, but also brings higher failure probability. Therefore, when a memory bank in which a memory page used by the computing process is located fails, an execution error or even an execution failure of the computing process may be caused. Therefore, as described above, in the prior art, when the memory of the physical resource fails, the computing system usually takes the page related to the memory off-line, i.e. page offline. For example, in the scenario shown in fig. 1, pages 2 and 3 are both located in the same physical memory, and the physical memory fails, and if the failed memory is simply taken offline, so that the computing system can no longer use the memory, page 2 and page 3 can no longer be used by the process directly. In this case, although page 3 is used by the user-instructed process in virtual machine instance VM1, so page 3 is taken directly off-line, the user can again issue instructions to restart the process, and so the use by the user is less affected. Page 2 is used by process 1, which the management module allocates resources for VM 1. Thus, if page 2 were taken directly off-line, which would result in the management module being unable to allocate computing resources for the virtual machine instance, the normal use of VM1 by the user would likely be compromised.
In addition, if the page corresponding to the failed memory is page 6, this page 6 is currently used by the process 4 of the management module to provide data forwarding processing to the network interface for all virtual machine instances. Then, in the prior art, if the failed memory is directly taken offline so that the system can no longer use the memory, the process 6 of the control module cannot call the page 6 to perform forwarding processing, which affects all virtual machine instances in the system.
Thus, the memory failure handling scheme of the prior art is actually handled only from the viewpoint of the failed memory. That is, in the prior art, only the physical memory with the fault is considered when processing the fault memory, and the current use condition of the memory with the fault is not considered. Such prior solutions therefore lead to the above-mentioned problems affecting the user's use.
Therefore, in the embodiment of the present application, when a memory failure is detected, a computing process using a page may be queried according to a page identifier of the page corresponding to the physical memory. For example, the computing process mapped to the page may be queried according to the page identifier, so as to determine the computing process corresponding to the failed memory. Therefore, an appropriate processing method can be selected and processed according to the calculation procedure. For example, in the scenario shown in fig. 1, if the page corresponding to the failed memory is page 3, the computing process using page 3 is a computing process performed in the virtual machine instance, and therefore, the virtual machine instance may be migrated to a healthy server in a live migration manner, and then the failed memory is offline. Particularly, in the embodiment of the present application, when the failed memory is processed, a page corresponding to the failed memory may be marked, for example, the page may be marked as offline, so that the management module may avoid the page when allocating memory resources for other subsequent processes, and does not allocate the memory resources to the subsequent processes for use. Or in the embodiment of the present application, an online repair scheme may also be used to perform online repair, so that after the repair is successful, the management module may still allocate the page to another subsequent process for use.
On the contrary, in the scenario shown in fig. 1a, if the page corresponding to the failed memory is page 2, the computing process using page 2 is process 1 used by the management module to allocate resources for the virtual machine instance. Therefore, the process 1 is actually a system process and has a high importance, and if the off-line processing is directly performed on the memory corresponding to the page as in the prior art, the normal operation of the virtual machine instance VM1 is affected. Therefore, in the embodiment of the present application, after the corresponding process is determined according to the page identifier of the page corresponding to the failed memory, the process is selected to be restarted and processed in a manner of reallocating other memory pages to the process according to the importance of the process as a system process, and then the failed memory is isolated. Thus, in this manner, normal use of the process 2 may not be affected.
In addition, referring to fig. 1b, fig. 1b shows a schematic diagram of a system architecture of a memory fault handling scheme according to an embodiment of the present application. As shown in fig. 1b, the hardware may be a system computing resource such as memory. Therefore, when hardware fails, in the embodiment of the present application, the memory failure reporting module may collect a system execution log, where the log may record various information before and when the memory hardware fails, and the memory failure reporting module may also collect detection data of special memory failure detection software. For example, the memories used by the current servers usually have memories with ECC (error check code) functions, so that a dedicated error check code detection tool software can be used to detect the fault state of the memories, and thus the memory fault reporting module can collect the detection data of such tool software to report together with the log file.
For example, the fault-affecting process mapping module may determine, based on the log file, a page corresponding to the faulty memory and invoke a corresponding process of the page, so that the system may determine, according to the address of the faulty memory, a page identifier of the allocated page, and therefore the fault-affecting process mapping module may query, according to the page identifier information, the corresponding computing process from the log file, and establish a mapping relationship, so as to determine the attribute information of the computing process. For example, it may be determined whether the process is a system process used by a management module or a user computing process used by a virtual machine instance. In addition, after determining the attribute of the computing process corresponding to the failed memory based on the log file, the memory failure mode may be further computed according to all the memory data of the process. Particularly, in the embodiment of the present application, since a system process used by the management module is complex, for example, besides determining the processing manner only according to the attribute of the process, a specific fault pattern may be calculated by further combining fault detection data corresponding to the process, and the fault pattern may be used to determine a risk level caused by a fault, so that the process related to the fault can be further finely distinguished by the risk determined based on the fault pattern, and thus a more reasonable processing manner can be determined. For example, if the memory failure mode determines that the risk is that it is easy to cause the virtual machine instance of the user to be down, in this case, it may be further determined to perform a hot migration on the virtual machine process involved in the failure, so that the virtual machine may use healthy memory resources. In other words, in the embodiment of the present application, the process involved and the risk of the process using the fault for the process may be determined based on the log and the detection data reported by the memory fault reporting module, and thus, the memories having the fault may be classified more finely, and a processing manner more suitable for the process and the specific situation may be selected.
In the embodiment of the present application, when calculating the risk, for example, a corresponding relationship list between the failure mode and the risk may be established in advance, so that after the failure mode is determined, the failure mode may be queried in the list, and then the corresponding risk may be determined. In addition, the machine learning model may be trained based on the historical data, so that after the failure mode is determined, the failure mode may be input to the trained machine learning model to calculate the correlation between the failure mode and the risk, and the risks may be ranked according to the magnitude of the correlation, so that the result with the highest correlation may be selected as the risk corresponding to the failure mode.
Further, when calculating the risk corresponding to the failure, the calculation of the risk may be performed in a database. For example, as shown in fig. 1b, the memory failure encountered by the server and the risk determination result may be stored in a local failure progress damage risk database, so that as the running time of the server increases, more failure data and risk determination results may be accumulated, and as the determination calculation is performed locally, a faster calculation speed may be achieved, the processing time after the failure occurs may be reduced, and the possible loss of the user may be reduced.
In addition, when calculating the risk corresponding to the failure, the calculation can be performed at a remote server, for example, a central server of the cloud computing platform. For example, as shown in fig. 1b, the local failure process damage risk database may aggregate the data to, for example, a central server after collecting the failure data, so that the central server generally has higher computational performance, and thus, the computational performance requirement on the local server can be reduced, and the local server overhead and construction cost can be saved. In addition, because a plurality of servers of the cloud computing platform gather respective fault data to the central server, more historical data can be accumulated by the central server more quickly, and therefore better effects can be achieved no matter whether the rule table is established or the machine learning model is trained. Of course, in the embodiments of the present application, a combination of the above two determination manners may also be used, and the present application is not limited thereto.
After calculating the attribute information of the computing process and the risk corresponding to the failure, in the embodiment of the present application, the corresponding processing manner may be determined in the database of the server as described above. In the embodiment of the present application, the processing manner may include, for example, a process restart, a page offline, and a hot migration and a cold migration. In addition, in the case of performing computation using the central server, the method may further include performing repair processing of the computation process using the central server.
Therefore, compared with the single processing mode in which only the failed memory is offline in the prior art, in the embodiment of the present application, the attribute information of the process, such as the system key process or the computing process in which the user uses the virtual machine instance, may be determined according to the identifier of the page used by the process, on this basis, the risk condition that the process uses the failed memory may be further computed according to the data of the failed memory, and therefore, the process using the failed memory may be further refined and distinguished by combining the risk condition and the attribute information of the process, so that a more reasonable processing mode may be selected to process the process and the failed memory. Through the fault processing method based on the influence of the memory fault, the influence of the memory fault on the use of the cloud computing service by the user is greatly reduced, and therefore the problem of user loss caused by the fact that a key process cannot be used due to the fact that the fault memory is isolated in the prior art is solved.
The above embodiments are illustrations of technical principles and exemplary application frameworks of the embodiments of the present application, and specific technical solutions of the embodiments of the present application are further described in detail below through a plurality of embodiments.
Example two
Fig. 2 is a flowchart of an embodiment of a memory fault processing method provided in the present application, where an execution subject of the method may be various terminal or server devices with resource configuration capability, or may also be a device or chip integrated on these devices. As shown in fig. 2, the memory failure processing method includes the following steps:
s201, determining a computing process corresponding to the fault memory according to the page identification of the page corresponding to the fault memory with the fault.
In step S201, when a memory failure is detected, a computing process using a physical memory may be determined according to a page identifier of a page corresponding to the failed physical memory. In the cloud computing service system, a management module may allocate physical resources such as a memory to each process for use, and particularly when allocating the memory, a page is usually used as an allocation unit of the memory to allocate a physical memory space for the process based on correspondence between an address of the physical memory and the page. For example, a predetermined length after a certain address of the physical memory may be allocated to the process as a page, and a page identifier may be generated for the page to distinguish each page, so that when the physical memory fails in step S201, the corresponding page may be found according to the address of the failed memory, and the computing process using the physical memory space is determined according to the correspondence between the page identifier and the computing process.
S202, calculating damaged parameters of the calculation process according to the fault data of the fault memory.
And S203, determining a processing mode aiming at the fault memory according to the damaged parameters and the attribute information.
In step S202, after the calculation process is determined in step S201, the damage parameters of the process may be calculated according to the fault data in the fault memory. In the embodiment of the present application, the damage parameter may indicate a risk of the process using the faulty memory, for example. For example, in step S202, the risk of the process using the faulty memory may be calculated according to the faulty data of the faulty memory of the process determined in step S201. The processes involved in the fault can therefore be further refined by the determined risk and a more rational approach can therefore be determined.
And S204, executing processing on the fault memory by using a processing mode.
In step S204, the processing manner determined in step S203 may be used to execute processing on the faulty memory. In particular, in this embodiment, processing the failed memory may include processing also a process involved in the failed memory. For example, as described above, if it is determined in step S203 that the risk of damage to the computing process determined in step S201 is high, in step S204, the process may be restarted first, and other memory pages may be allocated to the process, and meanwhile or later, offline isolation processing may be performed on the faulty memory.
The memory fault processing method provided in the embodiment of the present application determines the computing process corresponding to the faulty memory according to the page identifier of the page corresponding to the faulty memory, calculates the damaged parameter of the computing process according to the fault data of the faulty memory, and determines the processing mode for the faulty memory according to the damaged parameter of the computing process, so that the computing process using the page can be located from the page corresponding to the faulty memory in the case of a memory fault, and the damage risk of the computing process can be evaluated according to the damaged parameter of the process, so that the faulty memory can be processed in the processing mode corresponding to the damaged parameter of the computing process, a more reasonable processing mode for the computing process can be adopted in the case of a memory fault, a fault processing mode based on the influence of the memory fault is implemented, and the influence of the memory fault on the cloud computing service used by the user is greatly reduced, therefore, the problem of user loss caused by the fact that the key process cannot be used due to the fact that a fault memory is isolated in the prior art is solved.
EXAMPLE III
Fig. 3 is a flowchart of another embodiment of the memory fault processing method provided in the present application, where an execution subject of the method may be various terminal or server devices with cloud computing resource configuration capability, or may also be a device or chip integrated on these devices. As shown in fig. 3, on the basis of the embodiment shown in fig. 2, the method for processing a memory fault provided in the embodiment of the present application may include the following steps:
s301, determining a computing process corresponding to the fault memory according to the page identification of the page corresponding to the fault memory with the fault.
In step S301, when a memory failure is detected, a computing process using the physical memory may be determined according to a page identifier of a page corresponding to the failed physical memory. In the cloud computing service system, a management module may allocate physical resources such as a memory to each process for use, and particularly when allocating the memory, a page is usually used as an allocation unit of the memory to allocate a physical memory space for the process based on correspondence between an address of the physical memory and the page. For example, a predetermined length after a certain address of the physical memory may be allocated to a process as a page, and a page identifier may be generated for the page to distinguish each page, so that when the physical memory fails in step S201, a corresponding page may be found according to the address of the failed memory, and a computing process using the physical memory space is determined according to a correspondence between the page identifier and the computing process.
S302, calculating damaged parameters of the calculation process according to the fault data of the fault memory.
And S303, determining a processing mode aiming at the fault memory according to the damaged parameters and the type or the grade of calculation.
In step S302, after the calculation process is determined in step S301, the damage parameters of the process may be calculated according to the fault data of the fault memory. In the embodiment of the present application, the damage parameter may indicate a risk of the process using the faulty memory, for example. For example, in step S302, the risk of the process using the faulty memory may be calculated according to the faulty data of the faulty memory of the process determined in step S301. The processes involved in the fault can therefore be further refined by the determined risk and a more rational approach can therefore be determined.
In addition, in step S302, a memory failure mode may also be calculated based on all the memory data of the calculation process determined in step S301. In particular, in the embodiment of the present application, since, for example, a system process used by the management module is complex, in addition to determining a processing manner only according to the attribute of the process, a specific fault pattern may be further calculated in combination with fault detection data corresponding to the process, so that a risk caused by a fault may be determined based on the fault pattern, for example, if it is determined that the risk is easily causing a virtual machine instance of a user to be down according to a memory fault pattern in step S302, in this case, it may be further determined that a virtual machine process related to the fault is subjected to a hot migration in step S303, so that the virtual machine may use healthy memory resources. In other words, in the embodiment of the present application, the attribute information of the process may be determined based on the log, and further, the risk of the process using the fault for the process may be determined in combination with the fault data, and therefore, the memory in which the fault occurs may be classified more finely, and therefore, a processing manner more suitable for the process and the specific situation may be selected.
In addition, in the embodiment of the present application, when the damaged parameter is determined in step S302, for example, a corresponding relationship list between the failure mode and the damaged parameter may be established in advance, so that after the failure mode is determined by using the failure data, the failure mode may be queried in the list, and then the corresponding damaged parameter may be determined. Further, the machine learning model may be trained based on the historical data, so that after the failure mode is determined by using the failure data in step S302, the failure mode may be input to the trained machine learning model to calculate the correlation between the failure mode and the risk, and the risks may be ranked according to the magnitude of the correlation, so that the result with the highest correlation may be selected as the risk corresponding to the failure mode.
In addition, in step S303, a page corresponding to the failed memory and a corresponding process for calling the page may also be determined based on the log file, so that the system may determine the page identifier of the allocated page according to the address of the failed memory, and therefore the failure-affected process mapping module may query the corresponding computing process from the log file according to the page identifier information and establish a mapping relationship, so as to determine the attribute information of the computing process. For example, it may be determined whether the process is a system process used by a management module or a user computing process used by a virtual machine instance. Therefore, in step S303, after determining the attribute of the computing process corresponding to the failed memory based on the log file, the processing manner may be selected in combination with the damage parameter determined in step S302.
Further, when the impairment parameters of the process are calculated in step S302, the calculation may be performed based on data in the database. For example, memory failure and damaged parameter determination results encountered by the server may be stored in the local database, so that as the running time of the server increases, more failure data and damaged parameter determination results may be accumulated, and as determination calculation is performed locally, faster calculation speed may be achieved, processing time after failure occurrence may be reduced, and possible loss of a user may be reduced.
In addition, when computing the compromised parameters of the process, the computation may also be performed at a remote server, such as a central server of a cloud computing platform. For example, the local server may gather the data to, for example, the central server after collecting the failure data, so that the central server generally has higher computational performance, and thus the computational performance requirement on the local server can be reduced, and the local server overhead and the construction cost can be saved. In addition, because a plurality of servers of the cloud computing platform gather respective fault data to the central server, more historical data can be accumulated by the central server more quickly, and therefore better effects can be achieved no matter whether the rule table is established or the machine learning model is trained. Of course, in the embodiments of the present application, a combination of the above two determination manners may also be used, and the present application is not limited thereto.
After calculating the attribute information of the process and the damage parameter of the process in step S302, in step S303, the corresponding processing manner may be determined using the database of the remote center server. In the embodiment of the present application, the processing manner may include, for example, process restart, page offline, and hot migration and cold migration. In addition, in the case of using the central server to perform the calculation, the method may further include using the central server to perform a repair process of the calculation process.
S304, processing the fault memory by using a processing mode.
In step S304, the processing may be performed on the faulty memory using the processing method determined in step S303. In particular, in the embodiment of the present application, processing the failed memory may include processing also a process related to the failed memory. For example, as described above, in step S303, it is determined that the computing process determined in step S301 is a system process, and in step S303, it is also determined that the impairment parameter of the process indicates that the risk of using the failed memory is possibly causing the virtual machine instance to be down, in step S304, the process may be restarted first, and other memory pages are allocated to the process, and at the same time or later, the failed memory may be offline and isolated.
In the memory fault processing method provided by the embodiment of the application, the computing process corresponding to the fault memory is determined according to the page identifier of the page corresponding to the fault memory with the fault, the damaged parameter of the computing process is computed according to the fault data of the fault memory, and the processing mode aiming at the fault memory is determined according to the damaged parameter of the computing process, so that the computing process using the page can be positioned from the page corresponding to the fault memory under the condition of memory fault, the damage risk of the process can be evaluated according to the damaged parameter of the process, the processing mode corresponding to the damaged parameter of the computing process can be adopted to process the fault memory, a more reasonable processing mode for the computing process can be adopted when the memory fault occurs, the fault processing mode based on the influence of the memory fault is realized, and the influence of the memory fault on the cloud computing service used by a user is greatly reduced, therefore, the problem of user loss caused by the fact that the key process cannot be used due to the fact that the fault memory is isolated in the prior art is solved.
Example four
Fig. 4 is a schematic structural diagram of an embodiment of a memory failure processing apparatus provided in the present application, which may be used to execute the memory failure processing method shown in fig. 2 or fig. 3. As shown in fig. 4, the memory failure processing apparatus may include: a first determination module 41, a first calculation module 42, a second determination module 43 and a processing module 44.
The first determining module 41 may be configured to determine, according to the page identifier of the page corresponding to the failed memory that has failed, the computing process corresponding to the failed memory.
When a memory failure is detected, the first determining module 41 may determine, according to a page identifier of a page corresponding to the failed physical memory, a computing process using the physical memory. In the cloud computing service system, a management module may allocate physical resources such as a memory to each process for use, and particularly when allocating the memory, a page is usually used as an allocation unit of the memory to allocate a physical memory space for the process based on correspondence between an address of the physical memory and the page. For example, a predetermined length after a certain address of the physical memory may be allocated to a process as a page, and a page identifier may be generated for the page to distinguish each page, so when the physical memory fails, the first determining module 41 may find the corresponding page according to the address of the failed memory, and determine a computing process using the physical memory space according to a correspondence between the page identifier and the computing process.
The first calculation module 42 may be configured to calculate the damage parameter of the calculation process according to the fault data of the fault memory.
The second determining module 43 may be configured to determine a processing manner for the failed memory according to the damaged parameter of the computing process.
After determining the computing process, first computing module 42 may compute the impairment parameters of the process based on the fault data of the failing memory. In the embodiment of the present application, the damage parameter may indicate a risk of the process using the faulty memory, for example. For example, the first calculation module 42 may calculate the risk of using the faulty memory of the process according to the fault data of the faulty memory of the process determined by the first determination module 41. The processes involved in the fault can therefore be further refined by the determined risk and a more rational approach can therefore be determined.
The second determination module 43 may select different processing modes according to the damage parameter calculated by the first calculation module 42. For example, the first calculation module 42 may calculate the memory failure mode based on all the memory data of the calculation process determined by the first determination module 41. In particular, in this embodiment of the present application, since, for example, the system process used by the management module is complex, a specific fault mode may be calculated based on the fault detection data corresponding to the process, so that the risk caused by the fault may be determined based on the fault mode, for example, if the first computing module 42 determines that the risk is easy to cause a virtual machine instance of a user to be down according to the memory fault mode, in this case, the second determining module 43 may further determine to perform the hot migration on the virtual machine process related to the fault, so that the virtual machine may use a healthy memory resource.
In this embodiment of the present application, the first computing module 42 may further determine attribute information of a process based on the log, and further determine, in combination with the attribute information, a risk of the process using the fault for the process, and thus may perform more detailed classification on the memory in which the fault occurs, and thus may select a processing manner more suitable for the process and the specific situation.
In addition, in the embodiment of the present application, when the first calculation module 42 determines the damaged parameters, for example, a corresponding relationship list between the failure mode and the damaged parameters may be established in advance, so that after the failure mode is determined by using the failure data, the failure mode may be queried in the list, and then the corresponding damaged parameters may be determined. In addition, the machine learning model may also be trained based on historical data, so that after the failure mode is determined by the first calculation module 42 using the failure data, the failure mode may be input into the trained machine learning model to calculate the correlation between the failure mode and the risk, and the risks may be ranked according to the magnitude of the correlation, so that the result with the highest correlation may be selected as the risk corresponding to the failure mode.
In addition, the first computing module 42 may further determine, based on the log file, a page corresponding to the failed memory and a corresponding process for invoking the page, so that the system may determine the page identifier of the allocated page according to the address of the failed memory, and therefore the failure-affecting process mapping module may query the corresponding computing process from the log file according to the page identifier information and establish a mapping relationship, so as to determine the attribute information of the computing process. For example, it may be determined whether the process is a system process used by the management module or a user computing process used by the virtual machine instance. Therefore, after determining the attribute of the computing process corresponding to the failed memory based on the log file, the processing manner is selected in combination with the damage parameter calculated by the first calculation module 42.
In addition, the first calculation module 42 may perform calculations based on data in the database when calculating the impairment parameters of the process. For example, memory failure and damaged parameter determination results encountered by the server may be stored in the local database, so that as the running time of the server increases, more failure data and damaged parameter determination results may be accumulated, and as determination calculation is performed locally, faster calculation speed may be achieved, processing time after failure occurrence may be reduced, and possible loss of a user may be reduced.
In addition, when computing the compromised parameters of the process, the computation may also be performed at a remote server, such as a central server of a cloud computing platform. For example, the local server may gather the data to, for example, the central server after collecting the failure data, so that the central server generally has higher computational performance, and thus, the computational performance requirement on the local server can be reduced, and the local server overhead and construction cost can be saved. In addition, because a plurality of servers of the cloud computing platform gather respective fault data to the central server, more historical data can be accumulated by the central server more quickly, and therefore better effects can be achieved no matter whether the rule table is established or the machine learning model is trained. Of course, in the embodiments of the present application, a combination of the above two determination manners may also be used, and the present application is not limited thereto.
After the first calculation module 42 calculates the impairment parameters of the calculation process and the type or level of the process, the second determination module 43 may use the database of the remote central server to determine the corresponding processing manner. In the embodiment of the present application, the processing manner may include, for example, process restart, page offline, and hot migration and cold migration. In addition, in the case of using the central server to perform the calculation, the method may further include using the central server to perform a repair process of the calculation process.
The execution module 44 may be configured to perform processing on the failed memory using the processing mode.
The execution module 44 may execute processing on the faulty memory using the processing manner determined by the second determination module 43. In particular, in the embodiment of the present application, processing the failed memory may include processing also a process related to the failed memory. For example, as described above, if the second determining module 43 determines that the computing process determined by the first determining module 41 is a system process, the executing module 44 may restart the process, allocate another memory page to the process, and perform offline isolation processing on the failed memory at the same time or later.
The memory fault processing device provided by the embodiment of the application determines the computing process corresponding to the fault memory according to the page identifier of the page corresponding to the fault memory with the fault, calculates the damaged parameters of the computing process according to the fault data of the fault memory, determines the processing mode aiming at the fault memory according to the damaged parameters of the computing process, can locate the computing process using the page from the page corresponding to the fault memory with the memory fault, and can evaluate the damage risk of the process according to the damaged parameters of the process, so that the processing mode corresponding to the damaged parameters of the computing process can be adopted to process the fault memory, a more reasonable processing mode for the computing process can be adopted when the memory fault occurs, the fault processing mode based on the influence of the memory fault is realized, and the influence of the memory fault on the use of cloud computing service by a user is greatly reduced, therefore, the problem of user loss caused by the fact that the key process cannot be used due to the fact that the fault memory is isolated in the prior art is solved.
EXAMPLE five
The internal functions and structure of the memory failure processing apparatus, which can be implemented as an electronic device, are described above. Fig. 5 is a schematic structural diagram of an embodiment of an electronic device provided in the present application. As shown in fig. 5, the electronic device includes a memory 51 and a processor 52.
The memory 51 stores programs. In addition to the above-described programs, the memory 51 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.
The memory 51 may be implemented by any type or combination of volatile and non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The processor 52 is not limited to a processor (CPU), but may be a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), an embedded neural Network Processor (NPU), or an Artificial Intelligence (AI) chip. The processor 52 is coupled to the memory 51 and executes the program stored in the memory 51 to execute the memory failure processing method of the second or third embodiment.
Further, as shown in fig. 5, the electronic device may further include: communication components 53, power components 54, audio components 55, display 56, and other components. Only some of the components are schematically shown in fig. 5, and the electronic device is not meant to include only the components shown in fig. 5.
The communication component 53 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 3G, 4G, or 5G, or a combination thereof. In an exemplary embodiment, the communication component 53 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 53 further comprises a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
A power supply component 54 provides power to the various components of the electronic device. The power components 54 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.
The audio component 55 is configured to output and/or input an audio signal. For example, the audio component 55 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 51 or transmitted via the communication component 53. In some embodiments, audio assembly 55 also includes a speaker for outputting audio signals.
The display 56 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A memory fault handling method, wherein the memory is allocated to at least one computing process, and the method comprises:
determining a computing process corresponding to a fault memory according to a page identifier of a page corresponding to the fault memory with the fault;
calculating damage parameters of the computing process according to fault data of the fault memory, wherein the damage parameters identify risks of damage of the computing process;
determining a processing mode aiming at the fault memory according to the damaged parameters;
and executing processing on the fault memory by using the processing mode.
2. The memory fault handling method according to claim 1, wherein the determining, according to the damage parameter, a handling manner for the faulty memory includes:
and determining a processing mode aiming at the fault memory according to the damaged parameters and the type or the grade of the computing process.
3. The memory fault handling method of claim 1, wherein the calculating the impairment parameter of the computing process comprises:
aiming at the memory data of the computing process, computing the memory failure mode of the computing process;
and calculating the damaged parameters of the computing process based on the memory failure mode.
4. The memory failure processing method according to claim 3, wherein the calculating the memory failure mode of the computing process for all the memory data of the computing process includes:
converting the log data and the register data of the computing process into standard format data;
and calculating the memory failure mode of the computing process according to the standard format data.
5. The memory fault handling method according to claim 3, wherein the determining, according to the damage parameter, a handling manner for the faulty memory includes:
storing the memory failure mode of the computing process into a risk database;
and determining a corresponding processing mode for the calculation process corresponding to the memory failure mode stored in the risk database according to a preset strategy.
6. The memory failure processing method according to claim 1, wherein the processing manner at least includes: VM migration, online repair of a computing process, reboot of the computing process, offline of a memory page, cold migration and hot migration.
7. The memory fault handling method according to claim 1, wherein the calculating the damaged parameter of the computing process according to the fault data of the faulty memory includes:
collecting fault data of the fault memory to a first server, wherein the first server is a server which currently executes the computing process;
sending, by the first server, the summarized fault data to a second server;
calculating, by the second server, impairment parameters of the computing process from the aggregated fault data.
8. A memory fault handling device, comprising:
the first determining module is used for determining a computing process corresponding to a fault memory according to a page identifier of a page corresponding to the fault memory with the fault;
the first calculation module is used for calculating damaged parameters of the calculation process according to the fault data of the fault memory, wherein the damaged parameters mark the risk of damage of the calculation process;
the second determining module is used for determining a processing mode aiming at the fault memory according to the damaged parameters of the computing process;
and the processing module is used for executing processing on the fault memory by using the processing mode.
9. An electronic device, comprising:
a memory for storing a program;
a processor for executing the program stored in the memory to perform the memory failure processing method according to any one of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program executable by a processor is stored, wherein the program, when executed by the processor, implements the memory fault handling method according to any one of claims 1 to 7.
CN202210273519.1A 2022-03-18 2022-03-18 Memory fault processing method and device, electronic equipment and computer readable storage medium Pending CN114780270A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210273519.1A CN114780270A (en) 2022-03-18 2022-03-18 Memory fault processing method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210273519.1A CN114780270A (en) 2022-03-18 2022-03-18 Memory fault processing method and device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114780270A true CN114780270A (en) 2022-07-22

Family

ID=82425282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210273519.1A Pending CN114780270A (en) 2022-03-18 2022-03-18 Memory fault processing method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114780270A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024124862A1 (en) * 2022-12-14 2024-06-20 苏州元脑智能科技有限公司 Server-based memory processing method and apparatus, processor and an electronic device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024124862A1 (en) * 2022-12-14 2024-06-20 苏州元脑智能科技有限公司 Server-based memory processing method and apparatus, processor and an electronic device

Similar Documents

Publication Publication Date Title
US10999139B2 (en) Online upgrade method, apparatus, and system
CN103201724B (en) Providing application high availability in highly-available virtual machine environments
US9934105B2 (en) Fault tolerance for complex distributed computing operations
CN103324582A (en) Memory migration method, memory migration device and equipment
CN115277566B (en) Load balancing method and device for data access, computer equipment and medium
CN107861691B (en) Load balancing method and device of multi-control storage system
CN111352806A (en) Log data monitoring method and device
US11662803B2 (en) Control method, apparatus, and electronic device
CN114780270A (en) Memory fault processing method and device, electronic equipment and computer readable storage medium
CN112631994A (en) Data migration method and system
CN109284169B (en) Big data platform process management method based on process virtualization and computer equipment
CN115686831A (en) Task processing method and device based on distributed system, equipment and medium
CN109067611B (en) Method, device, storage medium and processor for detecting communication state between systems
CN113242302A (en) Data access request processing method and device, computer equipment and medium
CN114968505A (en) Task processing system, method, device, apparatus, storage medium, and program product
CN102662702B (en) Equipment management system, device, substrate management devices and method
KR102575524B1 (en) Distributed information processing device for virtualization based combat system and method for allocating resource thereof
CN111083719A (en) Method, device and storage medium for flexibly adjusting network element capacity
CN110022220A (en) Routing Activiation method and system in business card recognition
CN112114972B (en) Data inclination prediction method and device
CN112084827B (en) Data processing method and device
US20240028388A1 (en) Application usage and auto maintenance driven migration of applications and their dependencies
CN117632600A (en) Fault management method and device and electronic equipment
CN117891563A (en) Control method and device of virtual machine, storage medium and electronic device
CN112632033A (en) Cluster data migration method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination