CN117149474A - Heterogeneous acceleration resource exception processing method and device, storage medium and electronic device - Google Patents

Heterogeneous acceleration resource exception processing method and device, storage medium and electronic device Download PDF

Info

Publication number
CN117149474A
CN117149474A CN202210563855.XA CN202210563855A CN117149474A CN 117149474 A CN117149474 A CN 117149474A CN 202210563855 A CN202210563855 A CN 202210563855A CN 117149474 A CN117149474 A CN 117149474A
Authority
CN
China
Prior art keywords
resource
hardware
resources
heterogeneous acceleration
heterogeneous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210563855.XA
Other languages
Chinese (zh)
Inventor
陈克
朱荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN202210563855.XA priority Critical patent/CN117149474A/en
Priority to PCT/CN2023/086292 priority patent/WO2023226601A1/en
Publication of CN117149474A publication Critical patent/CN117149474A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a heterogeneous acceleration resource exception handling method, a heterogeneous acceleration resource exception handling device, a storage medium and an electronic device, wherein the heterogeneous acceleration resource exception handling method comprises the following steps: determining that heterogeneous acceleration resources of a cloud computing platform are hardware healthy resources or hardware unhealthy resources in a mode of hardware health monitoring of the heterogeneous acceleration resources; determining that the heterogeneous acceleration resource is a use health resource or a fault resource is allocated in a mode of monitoring equipment use health of the heterogeneous acceleration resource; performing hardware exception processing on the hardware unhealthy resources; and carrying out abnormal allocation processing on the allocation failure resources. By the method, the problem that the cloud computing platform and a user are lost due to the fact that the virtualized heterogeneous acceleration resource registration managed by the cloud computing platform is inconsistent with actual use can be solved by only focusing on common hardware resource detection of a traditional server in the related technology, and reliability, stability, timeliness and the like of the heterogeneous acceleration resource managed by the cloud computing platform are ensured.

Description

Heterogeneous acceleration resource exception processing method and device, storage medium and electronic device
Technical Field
The embodiment of the application relates to the field of cloud computing, in particular to a heterogeneous acceleration resource exception handling method, a heterogeneous acceleration resource exception handling device, a storage medium and an electronic device.
Background
With the development of advanced learning and other AI technologies, the demands of users on computing power and performance are more and more urgent, more and more users hope to acquire heterogeneous computing capability through a cloud computing platform to achieve performance acceleration of services, and heterogeneous computing services provided by the cloud computing platform become an indispensable function.
The heterogeneous acceleration resources of the cloud computing platform generally comprise a graphics processor (Graphics Processing Unit, referred to as GPU for short), an AI acceleration card (Neural-Network Processing Unit, referred to as NPU for short), a programmable acceleration card (Field Programmable Gate Array, referred to as FPGA for short) and an intelligent network card (Smart NIC), and compared with the traditional hardware, the heterogeneous acceleration resources of the cloud computing platform have the characteristics of multiple acceleration resource types, convenient pluggable, multiple virtualization modes, uniform allocation and recovery, frequent use, special bearing service and the like.
When heterogeneous acceleration hardware is abnormal, if the heterogeneous acceleration hardware cannot be timely identified, reported and recovered, serious loss is brought to customer service borne on the cloud computing platform. In particular, heterogeneous acceleration resources such as GPU, NPU, FPGA distributed in a virtualization manner may have problems of lost recovery information or untimely recovery of resources due to abnormal communication during frequent resource distribution and frequent recovery, and thus registration and inconsistent actual use of heterogeneous acceleration resources may easily occur, which results in abnormal cloud platform resource distribution and loss to cloud computing platforms and clients.
At present, most of traditional hardware detection means detect and judge through a server own system, on one hand, judgment is inaccurate, on the other hand, management cannot be very good along with the increase of types, and most of all, the situation that virtualized heterogeneous acceleration resource registration managed by a cloud computing platform is inconsistent with actual use cannot be identified.
In the related art, there is no method for detecting and processing the abnormality of heterogeneous acceleration resources of a cloud computing platform, especially, registration abnormality of virtualized acceleration hardware (GPU, NPU), abnormal virtualized allocation when an administrator maintains acceleration equipment, abnormal health condition of the equipment, and abnormal situations such as misoperation of the equipment cannot be timely sensed and processed, so that normal use of the cloud computing platform is affected, and loss is brought to the cloud computing platform and users.
Aiming at the problems that only the detection of common hardware resources of a traditional server is focused in the related technology, the registration and actual use inconsistency of virtualized heterogeneous acceleration resources managed by a cloud computing platform cannot be identified, and thus loss is brought to the cloud computing platform and a user, and no solution is proposed yet.
Disclosure of Invention
The embodiment of the application provides a heterogeneous acceleration resource exception handling method, a heterogeneous acceleration resource exception handling device, a storage medium and an electronic device, which at least solve the problems that in the related technology, only common hardware resource detection of a traditional server is concerned, virtual heterogeneous acceleration resource registration managed by a cloud computing platform cannot be identified and actual use is inconsistent, so that loss is brought to the cloud computing platform and a user. When the heterogeneous acceleration resource is abnormal, the unhealthy state of the heterogeneous acceleration resource can be quickly perceived, the alarm and the recovery can be timely carried out, and the reliability, the stability, the timeliness and the like of the cloud platform management of the heterogeneous acceleration resource are ensured.
According to an embodiment of the present application, there is provided a heterogeneous accelerated resource exception handling method, including:
determining that heterogeneous acceleration resources of a cloud computing platform are hardware healthy resources or hardware unhealthy resources in a mode of hardware health monitoring of the heterogeneous acceleration resources;
determining that the heterogeneous acceleration resource is a use health resource or a fault resource is allocated in a mode of monitoring equipment use health of the heterogeneous acceleration resource;
performing hardware exception processing on the hardware unhealthy resources;
and carrying out abnormal allocation processing on the allocation failure resources.
According to another embodiment of the present application, there is also provided a heterogeneous accelerated resource exception handling apparatus, including:
the first monitoring module is used for determining that the heterogeneous acceleration resources of the cloud computing platform are hardware healthy resources or hardware unhealthy resources in a mode of carrying out hardware health monitoring on the heterogeneous acceleration resources;
the second monitoring module is used for determining that the heterogeneous acceleration resources are used healthy resources or distributing fault resources in a mode of monitoring equipment use health of the heterogeneous acceleration resources;
the first response module is used for carrying out hardware exception processing on the hardware unhealthy resources;
and the second response module is used for carrying out abnormal allocation processing on the allocation failure resources.
According to a further embodiment of the application, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the application, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the embodiment of the application, through unified management of heterogeneous acceleration resource hardware detection methods of different types and regular monitoring of virtualized heterogeneous acceleration resource allocation registration and actual use conditions, abnormal resources are fed back in time, and the problem that the virtualized heterogeneous acceleration resource registration and actual use of cloud computing platform management cannot be identified only by focusing on common hardware resource detection of a traditional server in related technologies, so that loss is brought to a cloud computing platform and a user can be solved. When the heterogeneous acceleration resource is abnormal, the unhealthy state of the heterogeneous acceleration resource can be quickly perceived, the alarm and the recovery can be timely carried out, and the reliability, the stability, the timeliness and the like of the cloud platform management of the heterogeneous acceleration resource are ensured.
Drawings
FIG. 1 is a hardware block diagram of a computer terminal of a heterogeneous accelerated resource exception handling method according to an embodiment of the present application;
FIG. 2 is a flow chart of a heterogeneous accelerated resource exception handling method of an embodiment of the present application;
FIG. 3 is a flow chart of a heterogeneous acceleration resource hardware health monitoring method according to an embodiment of the present application;
FIG. 4 is a flow chart of a method for health monitoring of heterogeneous accelerated resource devices in accordance with an embodiment of the present application;
FIG. 5 is a timing diagram of device usage health monitoring and processing in accordance with an alternative embodiment of the present application;
FIG. 6 is a timing diagram of heterogeneous accelerated resource exception recovery processing in accordance with an alternative embodiment of the present application;
FIG. 7 is a block diagram of a heterogeneous accelerated resource exception handling device of an embodiment of the present application;
FIG. 8 is a heterogeneous accelerated resource health monitoring and exception handling architecture in accordance with an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
The method embodiments provided in the embodiments of the present application may be performed in a mobile terminal, a computer terminal or similar computing device. Taking a computer terminal as an example, fig. 1 is a block diagram of a hardware structure of a computer terminal of a heterogeneous accelerated resource exception handling method according to an embodiment of the present application, as shown in fig. 1, the computer terminal may include one or more (only one is shown in fig. 1) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU or a programmable logic device FPGA, etc.) and a memory 104 for storing data, where the computer terminal may further include a transmission device 106 for a communication function and an input/output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the computer terminal described above. For example, the computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of an application software and a module, such as a computer program corresponding to a heterogeneous accelerated resource exception handling method in an embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104, thereby performing various functional applications and a service chain address pool slicing process, that is, implementing the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means 106 is arranged to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.
In this embodiment, a heterogeneous accelerated resource exception handling method running on the computer terminal or the network architecture is provided, and fig. 2 is a flowchart of the heterogeneous accelerated resource exception handling method according to the embodiment of the present application, as shown in fig. 2, where the flowchart includes the following steps:
step S202, determining that heterogeneous acceleration resources of a cloud computing platform are hardware healthy resources or hardware unhealthy resources in a mode of performing hardware health monitoring on the heterogeneous acceleration resources;
step S204, determining that the heterogeneous acceleration resource is a use health resource or a fault resource is allocated in a mode of monitoring equipment use health of the heterogeneous acceleration resource;
step S206, performing hardware exception processing on the hardware unhealthy resources;
and step S208, carrying out abnormal allocation processing on the allocation failure resources.
In one embodiment, before the step S202, it is determined whether the heterogeneous acceleration resource exists by scanning a PCI slot; if the heterogeneous acceleration resource exists, acquiring resource information of the heterogeneous acceleration resource, specifically, identifying the resource information of the heterogeneous acceleration resource by combining with configuration of a cloud computing platform, wherein the heterogeneous acceleration resource comprises: GPU, NPU, FPGA, smart NIC, the resource information of the heterogeneous acceleration resource may include: PCI address, vendor information, device model, device ID, etc., wherein the PCI address includes a slot number.
In this embodiment, the step S202 may specifically include: calling a corresponding hardware health detection interface according to the resource information of the heterogeneous acceleration resource; judging the hardware state of the heterogeneous acceleration resource through a hardware health detection interface; if the hardware state is healthy, determining that the heterogeneous acceleration resource is a hardware healthy resource; and if the hardware state is unhealthy, determining that the heterogeneous acceleration resource is an unhealthy hardware resource.
Specifically, the hardware health detection interface of the heterogeneous acceleration resource which is subjected to security authentication in the cloud computing platform can be circularly called according to the type of the heterogeneous acceleration resource, manufacturer information and equipment model, and the hardware health detection interface judges the hardware state of the heterogeneous acceleration resource.
In another embodiment, the hardware health monitoring method in step S202 above may be performed for each heterogeneous acceleration resource according to a preset hardware health detection period.
Fig. 3 is a flowchart of a heterogeneous acceleration resource hardware health monitoring method according to an embodiment of the present application, where, as shown in fig. 3, the heterogeneous acceleration resource hardware health monitoring method specifically includes the following steps:
step S302: scanning all heterogeneous acceleration resources on a PCI slot on a computing node to acquire PCI addresses of the acceleration resources;
step S304: identifying manufacturer and model of specific acceleration resource (GPU, NPU, FPGA, smartNIC) by combining cloud platform configuration;
step S306: the PCI address, manufacturer and model are used as core identification parameters, a hardware health detection interface approved by a cloud platform is circularly called, and the hardware state of each heterogeneous acceleration resource is judged;
step S308: judging whether the hardware state of the heterogeneous acceleration resource is healthy; if the judgment result is yes, executing step S310a, and if the judgment result is no, executing step S310b;
step S310a: determining the heterogeneous acceleration resource as a hardware health resource;
step S310b: determining that the heterogeneous acceleration resource is a hardware unhealthy resource;
step S312: judging whether heterogeneous acceleration resources exist in the current node or not, and not carrying out hardware health detection;
step S314: outputting the hardware healthy resource and the hardware unhealthy resource.
In this embodiment, the step S302 may specifically include: scanning each PCI slot on the PCI slot to judge whether the slot is provided with an entity accelerating resource, if so, acquiring a PCI address corresponding to the accelerating resource, specifically, each PCI slot can only be provided with one entity accelerating resource, the PCI address comprises a slot number, and the entity accelerating resource can comprise: GPU, NPU, FPGA, smartNIC, etc.
By the method in the embodiment, the problems that detection can only be carried out on traditional hardware by relying on a system of the device in the related art, detection results of heterogeneous acceleration resources of various types are inaccurate and inconvenient to manage can be solved, and corresponding interfaces are called by detecting manufacturer information and device models of the heterogeneous acceleration resources, so that the accuracy of the detection results is improved, and unified management of the heterogeneous acceleration resources of various types is realized.
In another embodiment, the step S204 may specifically include: acquiring allocation data of the heterogeneous acceleration resources; and determining the using healthy resources and the allocation failure resources according to the allocation data.
In this embodiment, determining to use healthy resources or allocate failed resources according to allocation data includes: determining actual use data of heterogeneous acceleration resources; and carrying out data comparison on the allocation data and the actual use data of each heterogeneous acceleration resource in sequence, if the allocation data are consistent with the actual use data, determining that the heterogeneous acceleration resource is a used healthy resource, otherwise, determining that the heterogeneous acceleration resource is an allocation failure resource.
In an embodiment, the device usage health monitoring method in step S204 may be performed according to a preset device usage health detection period.
Specifically, each heterogeneous acceleration resource can be virtualized and allocated to multiple clients for use, allocation data comprises allocation clients and allocation quantity, and actual use data comprises use clients and use quantity.
Further, the allocation client and the using client of each heterogeneous acceleration resource are respectively compared, the allocation quantity and the using quantity are compared, if the data are all consistent, the heterogeneous acceleration resource is determined to be a used healthy resource, otherwise, the heterogeneous acceleration resource is determined to be an allocation failure resource.
Fig. 4 is a flowchart of a method for monitoring usage health of heterogeneous acceleration resource devices according to an embodiment of the present application, where, as shown in fig. 4, the method for monitoring usage health of heterogeneous acceleration resource devices specifically includes the following steps:
step S402: calling a cloud platform heterogeneous acceleration resource interface to acquire allocation data details (including allocation clients, allocation quantity and the like) of the heterogeneous acceleration resources;
step S404: detecting for each allocated acceleration resource;
step S406: judging whether the corresponding client exists, if yes, directly executing the step S410, and if no, executing the step S408;
step S408: adding the heterogeneous acceleration resource into an allocation fault list, and recording abnormally allocated clients;
step S410: judging whether heterogeneous acceleration resources exist or not, if yes, returning to the step S404, and if no, executing the step S412;
step S412: heterogeneous accelerated resources for output allocation failures
In this embodiment, each heterogeneous acceleration resource may be virtualized for use by multiple clients, and the client classes typically include: virtual machines, bare machines, containers, etc.
The step S406 may specifically include determining whether the virtual machine, bare metal machine, and container to which the heterogeneous acceleration resource is allocated exist, and if so, determining that the use of the client allocated to the heterogeneous acceleration resource is normal.
By the method in the embodiment, the problem that the resource allocation registration condition and the client actual use condition of the heterogeneous acceleration resources are inconsistent easily during virtualized allocation in the related technology can be solved, and the heterogeneous acceleration resources with allocation faults can be timely identified, so that the heterogeneous acceleration resources subjected to virtualized allocation are prevented from being repeatedly allocated to a plurality of clients, and the safety and stability of the cloud computing platform are ensured.
In one embodiment, performing allocation exception handling on allocation failure resources includes: and updating the data of the allocation failure resources according to the actual use data, specifically updating the allocation clients in the allocation data by using the use clients in the actual use data, and updating the allocation quantity in the allocation data by using the use quantity in the actual use data.
FIG. 5 is a timing diagram of the health monitoring and processing of equipment according to an alternative embodiment of the present application, as shown in FIG. 5, the method for monitoring and processing health of heterogeneous acceleration resource equipment specifically comprises the following steps:
step S502: outputting and distributing fault resources according to the equipment using health monitoring method;
step S504: calling a response module to perform abnormal allocation processing on allocation fault resources;
step S506: updating heterogeneous acceleration resource information of the allocated fault resources;
step S508: returning an updating result;
step S510: and (5) returning.
In another embodiment, the method for exception handling of heterogeneous accelerated resources further includes performing an exception alert on the hardware unhealthy resources and the allocated failed resources.
In one embodiment, the hardware exception handling for the hardware unhealthy resource specifically includes the following steps:
judging whether the use state of the non-healthy resource of the hardware is unavailable or not, if the use state of the non-healthy resource of the hardware is not available, setting the use state of the non-healthy resource of the hardware to be unavailable, and setting the recovery state of the non-healthy resource of the hardware to be recoverable;
and judging whether the hardware unhealthy resources are allocated to the clients, if so, informing the cloud computing platform to migrate the clients to which the hardware unhealthy resources are allocated, and/or setting the recovery state of the hardware unhealthy resources to be unrecoverable.
Specifically, the use states of heterogeneous acceleration resources are classified into usable and unusable, and the recovery states of heterogeneous acceleration resources are classified into recoverable and unrecoverable. When the use state of the heterogeneous acceleration resource is set, the system automatically records the setting source of the use state, if the use state is set by an administrator, the use state is marked as the administrator, and the corresponding recovery state is unrecoverable; if the use state is automatically set by the abnormal response module, the use state is marked as a response module, and the corresponding recovery state is recoverable.
In this embodiment, notifying the cloud computing platform of migration of the client to which the non-healthy resource of the hardware has been allocated may specifically include notifying an administrator associated with the cloud computing platform, judging the usage situation of the non-healthy resource of the hardware in time, and performing a thermomigration action (reassigning normal heterogeneous acceleration resources to clients) or other actions on all clients such as virtual machines, bare machines, containers, etc. that have used the non-healthy resource of the hardware.
In an embodiment, abnormal resource information corresponding to unhealthy hardware resources and allocated failure resources may be obtained; carrying out standardized processing on the abnormal resource information to obtain standardized abnormal information; the standardized abnormal information is reported to the cloud computing platform, so that related personnel can be informed of processing the abnormal information in time, and the standardized abnormal information can be stored in the cloud computing platform, so that subsequent searching is facilitated.
In another embodiment, standardized exception information may also be obtained from the cloud computing platform; acquiring health resource information corresponding to hardware health resources and used health resources; carrying out standardized processing on the health resource information to obtain standardized health information; determining recoverable resources from the standardized exception information based on the standardized health information; and if the recovery state of the recoverable resource is recoverable, carrying out recovery processing on the recoverable resource. Specifically, the standardized exception information and the standardized health information at least include a PCI address of the heterogeneous acceleration resource, vendor information, a device model number, a device ID, and the like, wherein the PCI address includes a slot number.
In this embodiment, determining the recoverable resource from the standardized exception information based on the standardized health information includes: matching the standardized health information and the standardized exception information according to a preset matching rule, wherein the preset matching rule comprises matching at least one of the following resource information: PCI address, vendor information, model; and determining the heterogeneous acceleration resource corresponding to the successfully matched standardized exception information as a recoverable resource.
In this embodiment, the recovering processing for the recoverable resource may specifically include: if the abnormal alarm corresponding to the recoverable resource exists, canceling the abnormal alarm; the usage status of the recoverable resources is set to available.
FIG. 6 is a timing diagram of a heterogeneous accelerated resource exception recovery process according to an alternative embodiment of the present application, as shown in FIG. 6, the heterogeneous accelerated resource exception recovery process method specifically includes the following steps:
step S601: outputting healthy heterogeneous acceleration resources according to a hardware health monitoring method;
step S602: sending healthy heterogeneous acceleration resource information;
step S603: acquiring reported unhealthy heterogeneous acceleration resource information;
step S604: returning reported unhealthy heterogeneous acceleration resource information;
step S605: identifying recoverable heterogeneous acceleration resources by a specific method;
step S606: standardized heterogeneous acceleration resource information, and calling an alarm recovery interface of the cloud computing platform;
step S607: returning;
step S608: judging whether the heterogeneous acceleration resource needs to be restored to be available;
step S609: returning;
in this embodiment, the specific method in step S605 may specifically include: and comparing data according to PCI address, manufacturer information, equipment model, equipment ID, official interface and the like, or identifying by a specific algorithm.
In this embodiment, the step S608 of determining whether the heterogeneous acceleration resource needs to be restored to be available may specifically include: judging according to the recovery state of the heterogeneous acceleration resource, and if the recovery state is recoverable, recovering the heterogeneous acceleration resource to be available.
In another embodiment, the heterogeneous accelerated resource exception recovery processing method in steps S601 to S609 described above may be performed according to a preset recovery period.
According to the heterogeneous acceleration resource exception recovery processing method in the embodiment, when abnormal conditions of heterogeneous acceleration resources are detected, alarm prompts can be timely sent to clients and cloud computing platform administrators, and serious losses are avoided. In addition, when the abnormal heterogeneous acceleration resource is possibly restored to a health state through human intervention or automatic system processing, for the situation, the embodiment can automatically restore the heterogeneous acceleration resource to a usable state, respond in time and process quickly, thereby reducing the adverse effect on users and improving the reliability of the cloud computing platform.
According to another aspect of the embodiment of the present application, there is further provided a heterogeneous accelerated resource exception handling apparatus, and fig. 7 is a block diagram of the heterogeneous accelerated resource exception handling apparatus according to the embodiment of the present application, as shown in fig. 7, where the apparatus includes:
the first monitoring module 702 is configured to determine that the heterogeneous acceleration resource of the cloud computing platform is a hardware healthy resource or a hardware unhealthy resource by performing hardware health monitoring on the heterogeneous acceleration resource;
a second monitoring module 704, configured to determine that the heterogeneous acceleration resource is a used healthy resource or an allocated failure resource by performing device usage health monitoring on the heterogeneous acceleration resource;
a first response module 706, configured to perform hardware exception processing on the hardware unhealthy resource;
and a second response module 708, configured to perform abnormal allocation processing on the allocated failed resource.
In an embodiment, the device further comprises:
the scanning module is used for determining whether the heterogeneous acceleration resource exists or not by scanning the PCI slot;
and the first acquisition module is used for acquiring the resource information of the heterogeneous acceleration resource if the heterogeneous acceleration resource exists.
In one embodiment, the first monitoring module 702 further comprises:
the calling unit is used for calling the corresponding hardware health detection interface according to the resource information of the heterogeneous acceleration resource;
the detection unit is used for judging the hardware state of the heterogeneous acceleration resource through the hardware health detection interface;
and the first judging unit is used for determining that the heterogeneous acceleration resource is the hardware healthy resource if the hardware state is healthy, and determining that the heterogeneous acceleration resource is the hardware unhealthy resource if the hardware state is unhealthy.
In an embodiment, the device further comprises:
and the abnormality alarm module is used for carrying out abnormality alarm on the hardware unhealthy resources and the distributed fault resources.
In an embodiment, the second monitoring module 704 further includes:
the first acquisition unit is used for acquiring the allocation data of the heterogeneous acceleration resources;
and the second judging unit is used for determining the using healthy resources and the allocation failure resources according to the allocation data.
In an embodiment, the second judging unit further includes:
the second acquisition unit is used for determining actual use data of the heterogeneous acceleration resources;
the data comparison unit is used for sequentially comparing the distribution data and the actual use data of each heterogeneous acceleration resource, if the distribution data is consistent with the actual use data, determining that the heterogeneous acceleration resource is a healthy resource, otherwise, determining that the heterogeneous acceleration resource is a fault resource.
In one embodiment, the second response module 708 is further configured to:
and carrying out data updating on the allocation data of the allocation failure resources according to the actual use data.
In one embodiment, the first response module 706 further includes:
the setting unit is used for judging whether the use state of the hardware unhealthy resource is unavailable, if not, setting the use state of the hardware unhealthy resource to be unavailable, and setting the recovery state of the hardware unhealthy resource to be recoverable;
and the processing unit is used for judging whether the hardware unhealthy resources are allocated to the clients, if so, notifying the cloud computing platform to migrate the clients to which the hardware unhealthy resources are allocated, and/or setting the recovery state of the hardware unhealthy resources to be unrecoverable.
In an embodiment, the device further comprises:
the second acquisition module is used for acquiring abnormal resource information corresponding to the hardware unhealthy resources and the allocated fault resources;
the first standardized module is used for carrying out standardized processing on the abnormal resource information to obtain standardized abnormal information;
and the reporting module is used for reporting the standardized abnormal information to the cloud computing platform.
In an embodiment, the device further comprises:
the third acquisition module is used for acquiring the standardized abnormal information from the cloud computing platform;
a fourth obtaining module, configured to obtain health resource information corresponding to the hardware health resource and the used health resource;
the second standardized module is used for carrying out standardized processing on the health resource information to obtain standardized health information;
the recovery judging module is used for determining recoverable resources from the standardized abnormal information according to the standardized health information;
and the recovery processing module is used for carrying out recovery processing on the recoverable resource if the recovery state of the recoverable resource is recoverable.
In one embodiment, the recovery determination module includes:
the matching unit is used for matching the standardized health information and the standardized abnormal information according to a preset matching rule, wherein the preset matching rule comprises matching at least one of the following resource information: PCI address, vendor information, model;
and the recovery judging unit is used for determining heterogeneous acceleration resources corresponding to the standardized abnormal information which are successfully matched as the recoverable resources.
In an embodiment, the recovery processing module includes:
a cancellation unit, configured to cancel an abnormal alarm corresponding to the recoverable resource if the abnormal alarm exists;
and the recovery unit is used for setting the use state of the recoverable resource to be available.
According to another aspect of the embodiment of the application, a heterogeneous accelerated resource health monitoring and exception handling architecture is also provided.
FIG. 8 is a heterogeneous accelerated resource health monitoring and exception handling architecture, as shown in FIG. 8, of an embodiment of the present application, the architecture comprising:
a health identification module 81 comprising: the hardware health monitoring module 811, the device usage health monitoring module 812, the cloud platform heterogeneous resource used interface 813;
an exception handling module 82, comprising: an anomaly alarm module 821, an anomaly response module 822, an anomaly recovery module 823, a cloud platform alarm interface 824 and a cloud platform heterogeneous resource management interface 825;
in this embodiment, the hardware health monitoring module 811 is configured to implement part or all of the functions of the first monitoring module 702; the device uses a health monitoring module 812 for implementing some or all of the functionality of the second monitoring module 704 described above; the cloud platform heterogeneous resource has used interface 813 for implementing part or all of the functions of the second acquisition unit described above.
Specifically, the hardware health monitoring module 811 is configured to determine that the heterogeneous acceleration resource is a hardware health resource or a hardware unhealthy resource by performing hardware health monitoring on the heterogeneous acceleration resource of the cloud computing platform; the device usage health monitoring module 812 is configured to determine that the heterogeneous acceleration resource is a usage health resource or allocate a failure resource by performing device usage health monitoring on the heterogeneous acceleration resource; cloud platform heterogeneous resource used interface 813 is used to determine actual usage data of the heterogeneous acceleration resource;
in another embodiment, an anomaly alarm module 821 is configured to perform anomaly alarms on the hardware unhealthy resources and the allocated failed resources; the exception response module 822 is configured to implement some or all of the functions of the first response module 706 and the second response module 708, including exception handling for hardware unhealthy resources and allocated failed resources; the cloud platform alert interface 824 is used for informing the cloud computing platform of the abnormal alert information; the cloud platform heterogeneous resource management interface 825 is configured to manage heterogeneous acceleration resources, including setting a usage state thereof.
The embodiment of the application can solve the problem that the cloud computing platform and the user are lost because the virtualized heterogeneous acceleration resource registration managed by the cloud computing platform cannot be identified and the actual use is inconsistent only by paying attention to the detection of the common hardware resources of the traditional server in the related technology. When the heterogeneous acceleration resource is abnormal, the unhealthy state of the heterogeneous acceleration resource can be quickly perceived, the alarm and the recovery can be timely carried out, and the reliability, the stability, the timeliness and the like of the cloud platform management of the heterogeneous acceleration resource are ensured.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
In an exemplary embodiment, the electronic apparatus may further include a transmission device connected to the processor, and an input/output device connected to the processor.
Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims (15)

1. A heterogeneous accelerated resource exception handling method, the method comprising:
determining that heterogeneous acceleration resources of a cloud computing platform are hardware healthy resources or hardware unhealthy resources in a mode of hardware health monitoring of the heterogeneous acceleration resources;
determining that the heterogeneous acceleration resource is a use health resource or a fault resource is allocated in a mode of monitoring equipment use health of the heterogeneous acceleration resource;
performing hardware exception processing on the hardware unhealthy resources;
and carrying out abnormal allocation processing on the allocation failure resources.
2. The method of claim 1, wherein prior to determining that the heterogeneous acceleration resource is a hardware healthy resource or a hardware unhealthy resource by hardware health monitoring of the heterogeneous acceleration resource of the cloud computing platform, the method further comprises:
determining whether the heterogeneous acceleration resource exists by scanning a PCI slot;
and if the heterogeneous acceleration resource exists, acquiring resource information of the heterogeneous acceleration resource.
3. The method of claim 2, wherein determining that the heterogeneous acceleration resource is a hardware healthy resource or a hardware unhealthy resource by means of hardware health monitoring of the heterogeneous acceleration resource of the cloud computing platform comprises:
calling a corresponding hardware health detection interface according to the resource information of the heterogeneous acceleration resource;
judging the hardware state of the heterogeneous acceleration resource through the hardware health detection interface;
if the hardware state is healthy, determining that the heterogeneous acceleration resource is the hardware healthy resource;
and if the hardware state is unhealthy, determining that the heterogeneous acceleration resource is the unhealthy hardware resource.
4. The method according to claim 1, wherein the method further comprises:
and carrying out abnormal alarm on the hardware unhealthy resources and the distributed fault resources.
5. The method of claim 1, wherein determining that the heterogeneous acceleration resource is a use health resource or an allocation failure resource by means of device use health monitoring of the heterogeneous acceleration resource comprises:
acquiring allocation data of the heterogeneous acceleration resources;
and determining the using healthy resources and the allocation failure resources according to the allocation data.
6. The method of claim 5, wherein determining to use healthy resources or allocate failed resources based on the allocation data comprises:
determining actual usage data of the heterogeneous acceleration resource;
and carrying out data comparison on the allocation data and the actual use data of each heterogeneous acceleration resource in sequence, if the allocation data are consistent with the actual use data, determining that the heterogeneous acceleration resource is a used healthy resource, otherwise, determining that the heterogeneous acceleration resource is an allocation failure resource.
7. The method of claim 6, wherein performing allocation exception handling on the allocation failure resource comprises:
and carrying out data updating on the allocation data of the allocation failure resources according to the actual use data.
8. The method of claim 1, wherein performing hardware exception handling on the hardware unhealthy resource comprises:
judging whether the use state of the hardware unhealthy resource is unavailable or not, if not, setting the use state of the hardware unhealthy resource to be unavailable, and setting the recovery state of the hardware unhealthy resource to be recoverable;
judging whether the hardware unhealthy resource is allocated to the client, if so, notifying the cloud computing platform to migrate the client to which the hardware unhealthy resource is allocated, and/or setting the recovery state of the hardware unhealthy resource to be unrecoverable.
9. The method according to claim 1, wherein the method further comprises:
acquiring abnormal resource information corresponding to the hardware unhealthy resources and the allocated fault resources;
carrying out standardized processing on the abnormal resource information to obtain standardized abnormal information;
and reporting the standardized abnormal information to a cloud computing platform.
10. The method according to claim 9, wherein the method further comprises:
acquiring the standardized exception information from the cloud computing platform;
acquiring health resource information corresponding to the hardware health resource and the used health resource;
carrying out standardized processing on the health resource information to obtain standardized health information;
determining recoverable resources from the standardized exception information according to the standardized health information;
and if the recovery state of the recoverable resource is recoverable, carrying out recovery processing on the recoverable resource.
11. The method of claim 10, wherein determining recoverable resources from the standardized exception information based on the standardized health information comprises:
matching the standardized health information and the standardized exception information according to a preset matching rule, wherein the preset matching rule comprises matching at least one of the following resource information: PCI address, vendor information, model;
and determining the heterogeneous acceleration resource corresponding to the successfully matched standardized exception information as the recoverable resource.
12. The method of claim 10, wherein if the recovery status of the recoverable resource is recoverable, performing recovery processing on the recoverable resource comprises:
if the abnormal alarm corresponding to the recoverable resource exists, canceling the abnormal alarm;
and setting the use state of the recoverable resource to be available.
13. A heterogeneous accelerated resource exception handling device, the device comprising:
the first monitoring module is used for determining that the heterogeneous acceleration resources of the cloud computing platform are hardware healthy resources or hardware unhealthy resources in a mode of carrying out hardware health monitoring on the heterogeneous acceleration resources;
the second monitoring module is used for determining that the heterogeneous acceleration resources are used healthy resources or distributing fault resources in a mode of monitoring equipment use health of the heterogeneous acceleration resources;
the first response module is used for carrying out hardware exception processing on the hardware unhealthy resources;
and the second response module is used for carrying out abnormal allocation processing on the allocation failure resources.
14. A computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of claims 1 to 12 when run.
15. An electronic device comprising a memory in which a computer program is stored and a processor arranged to run the computer program to perform the method of any of claims 1 to 12.
CN202210563855.XA 2022-05-23 2022-05-23 Heterogeneous acceleration resource exception processing method and device, storage medium and electronic device Pending CN117149474A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210563855.XA CN117149474A (en) 2022-05-23 2022-05-23 Heterogeneous acceleration resource exception processing method and device, storage medium and electronic device
PCT/CN2023/086292 WO2023226601A1 (en) 2022-05-23 2023-04-04 Anomaly processing method and apparatus for heterogeneous acceleration resource, and storage medium and electronic apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210563855.XA CN117149474A (en) 2022-05-23 2022-05-23 Heterogeneous acceleration resource exception processing method and device, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN117149474A true CN117149474A (en) 2023-12-01

Family

ID=88885425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210563855.XA Pending CN117149474A (en) 2022-05-23 2022-05-23 Heterogeneous acceleration resource exception processing method and device, storage medium and electronic device

Country Status (2)

Country Link
CN (1) CN117149474A (en)
WO (1) WO2023226601A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9141487B2 (en) * 2013-01-15 2015-09-22 Microsoft Technology Licensing, Llc Healing cloud services during upgrades
US9871857B2 (en) * 2015-04-29 2018-01-16 Microsoft Technology Licensing, Llc Optimal allocation of dynamic cloud computing platform resources
CN106612312A (en) * 2015-10-23 2017-05-03 中兴通讯股份有限公司 Virtualized data center scheduling system and method
CN111694789A (en) * 2020-04-22 2020-09-22 西安电子科技大学 Embedded reconfigurable heterogeneous determination method, system, storage medium and processor
CN114296943A (en) * 2021-12-31 2022-04-08 武汉路特斯汽车有限公司 Resource allocation method, device and equipment based on virtualization technology

Also Published As

Publication number Publication date
WO2023226601A1 (en) 2023-11-30

Similar Documents

Publication Publication Date Title
CN110224858B (en) Log-based alarm method and related device
CN108039964B (en) Fault processing method, device and system based on network function virtualization
US20200327045A1 (en) Test System and Test Method
CN110275992B (en) Emergency processing method, device, server and computer readable storage medium
CN112579356B (en) Fault processing method and server
CN114363151A (en) Fault detection method and device, electronic equipment and storage medium
CN110618853B (en) Detection method, device and equipment for zombie container
CN114363334A (en) Network configuration method, device and equipment for cloud system and cloud desktop virtual machine
CN110312245A (en) A kind of business monitoring method and device of transnational roaming terminal
US10547529B2 (en) Availability counting apparatus and method
CN107426012B (en) Fault recovery method and device based on super-fusion architecture
CN112306871A (en) Data processing method, device, equipment and storage medium
CN111478792B (en) Cutover information processing method, system and device
CN109408104B (en) Method and device for acquiring game integration information
CN117149474A (en) Heterogeneous acceleration resource exception processing method and device, storage medium and electronic device
CN114650218B (en) Data acquisition method, device, system and storage medium
CN115373916A (en) Abnormality detection method, abnormality detection device, electronic apparatus, and computer-readable storage medium
CN112068935A (en) Method, device and equipment for monitoring deployment of kubernets program
CN110752950A (en) Update detection method and device for cloud resource pool and terminal equipment
CN115460271B (en) Network control method and device based on edge calculation and storage medium
CN112530139B (en) Monitoring system, method, device, collector and storage medium
US11755444B2 (en) Monitoring higher-level service health in a scalable way
CN115941438A (en) Method and device for processing fault information, storage medium and electronic device
CN116126578A (en) Business service detection method and device, storage medium and electronic equipment
CN117667374A (en) Data processing method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication