WO2022151951A1 - Task scheduling method and management system - Google Patents

Task scheduling method and management system Download PDF

Info

Publication number
WO2022151951A1
WO2022151951A1 PCT/CN2021/141119 CN2021141119W WO2022151951A1 WO 2022151951 A1 WO2022151951 A1 WO 2022151951A1 CN 2021141119 W CN2021141119 W CN 2021141119W WO 2022151951 A1 WO2022151951 A1 WO 2022151951A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
management system
resource
quota
bmc
Prior art date
Application number
PCT/CN2021/141119
Other languages
French (fr)
Chinese (zh)
Inventor
朱杰
初雨
王亮
贺骞
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022151951A1 publication Critical patent/WO2022151951A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Definitions

  • the present application relates to the field of computer technology, and in particular, to a task scheduling method and management system.
  • Computing devices such as servers usually include an independent management module, that is, a Baseboard Management Controller (BMC).
  • BMC Baseboard Management Controller
  • AI artificial intelligence
  • the embodiments of the present application provide a task scheduling method and management system, which help to prevent the AI application from using too many resources and affecting the basic services of the BMC.
  • the technical solution is as follows.
  • a task scheduling method is provided, the method is applied in a BMC, the BMC includes at least one AI application, a management system and a basic AI library, and the method includes: the management system responds to the at least one AI application.
  • a request of the first AI application in an AI application use the basic AI library to generate at least one AI task;
  • the management system allocates a first resource for the first AI task from the resources of the BMC, and the first AI task is an AI task in the at least one AI task;
  • the management system uses the first resource to execute the first AI task, thereby obtaining an execution result;
  • the management system provides the execution result to the first AI application.
  • a management system for AI tasks is introduced into the BMC, thereby decoupling AI applications from AI tasks and the underlying basic AI library, and the management system uniformly manages resources for each AI task As well as task scheduling, it strengthens the control of AI tasks, prevents AI tasks from using more resources than expected without restrictions, and prevents AI tasks from taking up too many resources and affecting BMC's basic business, which helps to ensure the foundation of BMC to a certain extent. business stability.
  • the management system allocates the first resource for the first AI task from the resources of the BMC, including: the management system allocates the first resource for the first AI task according to the quota, and the first resource does not exceed the quota.
  • quotas are used to limit the resources that can be obtained by AI tasks, so as to prevent the resources used by AI tasks from exceeding the upper limit and affecting the basic business of BMC.
  • the quota includes an overall quota, the overall quota indicates a quota of the at least one AI task population, and the management system allocates the first resource for the first AI task according to the quota, including: the management system according to the quota.
  • the overall quota is allocated, and overall resources are allocated to the at least one AI task, and the overall resources do not exceed the overall quota: the management system allocates the first resource from the overall resources.
  • the quota further includes a proportional quota
  • the proportional quota indicates the ratio of the quota of the first AI task to the overall quota
  • the management system allocates the first resource from the overall resource
  • the method includes: the management system allocates the first resource from the overall resources according to the proportional quota, and the first resource does not exceed the product of the overall resource and the proportional quota.
  • the method further includes: if the resource occupied by the first AI task exceeds the quota, the management system kills the The first AI task; or, if the resources occupied by the first AI task exceed the quota, the management system saves the data of the first AI task in the memory of the BMC to the swap partition, and releases all the resources.
  • the space occupied by the data in the memory of the BMC is the space occupied by the data in the memory of the BMC.
  • the resources occupied by AI tasks can be released in time when they occupy too many resources, so that resources can be reserved for other AI tasks, thereby helping to improve the overall resource utilization.
  • the first resource is a resource originally occupied by a second AI task in the at least one AI task, the priority of the second AI task is lower than the priority of the first AI task, and the management The system allocates the first resource for the first AI task from the resources of the BMC, including: if the remaining resources do not meet the resource requirements of the first AI task, the management system kills the second AI task, In the resources released by the second AI task, the first resource is allocated to the first AI task; or, if the remaining resources do not meet the resource requirements of the first AI task, the management system calls the control group Cgroup to adjust the first AI task.
  • the resources of the AI task and the resources of the second task are examples of the resources of the second task.
  • high-priority AI tasks are allowed to preempt the resources of low-priority AI tasks, and high-priority AI tasks are guaranteed to obtain more resources, thereby meeting the requirements of quality of service (QoS).
  • QoS quality of service
  • the method further includes: the management system determines a service period of the BMC based on historical information, and the service period indicates a service period. The corresponding relationship between the resource overhead and time; the management system determines the first AI task from the at least one AI task according to the service period.
  • the business cycle of the basic business is considered when selecting the AI task to be executed, it helps to avoid executing AI tasks with high resource consumption during the peak period of the basic business, thereby ensuring the stability of the BMC basic business.
  • the business cycle includes a business cycle of the basic service of the BMC and a business cycle of the at least one AI task
  • the management system determines the first AI from the at least one AI task according to the business cycle.
  • the task includes: the management system determines the peak period of the basic business from the business cycle of the basic business, and the peak period refers to the time period corresponding to the maximum value of the resource overhead in the business cycle; the management system From the business cycle of the at least one AI task, determine the peak period of the at least one AI task; the management system determines the first peak period according to the peak period of the basic business and the peak period of the at least one AI task For the AI task, the peak period of the first AI task is different from the peak period of the basic service.
  • the management system uses the first resource to perform the first AI task, including: the management system determines a target time according to the service cycle of the basic service of the BMC, and the target time is located in the basic service. A time period other than a peak period of business; the management system uses the first resource to perform the first AI task at the target time.
  • the execution time of the AI task can be staggered from the peak period of the basic business, avoiding the execution of the AI task during the peak period of the basic business, thereby avoiding the impact of the AI task on the basic business.
  • the management system determining the business period of the management system based on historical information includes: the management system determining the business period based on the historical information by employing a regression learning algorithm.
  • the execution of the first AI task by the management system includes: the management system executes the first AI task according to an execution plan of the first AI task, and the execution plan instructs to start executing the first AI task.
  • the time point of an AI task is: the management system executes the first AI task according to an execution plan of the first AI task, and the execution plan instructs to start executing the first AI task. The time point of an AI task.
  • the execution plan includes a timing execution plan and an on-demand execution plan, and the timing execution plan indicates that the first AI task is executed at a preset time point or the first AI task is executed every preset period,
  • the on-demand execution plan directs execution of the first AI task when an instruction is received.
  • the at least one AI task is a plurality of AI tasks
  • the method further includes: the management system determines, according to the priorities of the plurality of AI tasks, all AI tasks. According to the execution order of the multiple AI tasks, the higher the priority of the AI task, the earlier the AI task is executed.
  • a management system in a second aspect, is provided, and the management system has the function of implementing the first aspect or any optional manner of the first aspect.
  • the management system includes at least one unit, and the at least one unit is configured to implement the method provided in the first aspect or any optional manner of the first aspect.
  • the elements in the management system are implemented in software, and the elements in the management system are program modules. In other embodiments, the elements in the management system are implemented by hardware or firmware.
  • a third aspect provides a BMC, where the BMC includes the management system described in the second aspect, at least one AI application, and a basic AI library.
  • a BMC including a processor and a memory.
  • the memory stores computer instructions; the processor executes the computer instructions stored in the memory, so that the BMC executes the method provided in the first aspect or various optional manners of the first aspect.
  • a fifth aspect provides a computing device, where the computing device is, for example, a server, and the computing device includes the BMC provided in the fourth aspect.
  • the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and the computer instructions instruct the BMC to perform the above-mentioned first aspect or various optional manners of the first aspect. method.
  • the present application provides a computer program product comprising computer instructions stored in a computer-readable storage medium.
  • the processor of the BMC may read the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the BMC performs the method provided in the first aspect or various optional manners of the first aspect.
  • a chip is provided, the chip may include programmable logic circuits and/or program instructions, and when the chip is running, it is used to implement the method provided in the first aspect or various optional manners of the first aspect.
  • FIG. 1 is a schematic diagram of a system architecture of a BMC provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of a task scheduling method provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a task scheduling method provided by an embodiment of the present application.
  • FIG. 5 is an effect diagram of resource overhead when a management system is provided according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a BMC provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a management system provided by an embodiment of the present application.
  • BMC Supports the industry-standard Intelligent Platform Management Interface (IPMI) specification. This specification describes the management functions already built into the motherboard. These functions include: local and remote diagnostics, console support, configuration management, hardware management and troubleshooting. BMC provides the following features: IPMI 1.0 compatibility, tachometer input for fan speed monitoring, pulse width regulator output for fan speed control, button input for front panel buttons and switches, and server console port One serial port for multiplexing, remote access and Intelligent Chassis Management Bus (ICMB) support, three I2C master and backup ports (one of which is for the Intelligent Chassis Management Bus), LPC (Low Pin Count, low pin count) bus provides access to three KCS (Keyboard Controller Style, keyboard controller mode) and BT (One-Block Transfer, single block transfer) interfaces, 32-bit ARM7 processor, 160-pin LQFP (Low Profile Flat Pack), provides firmware for the following interfaces: IPMI, IPMB.
  • IPMI Intelligent Platform Management Interface
  • Batch processing A way for a computer system to execute programs, capable of executing a series of predefined or randomly input programs in an orderly manner according to specified rules without human intervention. Batch execution is usually used in various scheduling systems, which can improve resource utilization and reduce human-computer interaction overhead.
  • Job A set of program instances that need to be executed to complete a specific computing business, usually corresponding to a set of processes, containers or other runtime entities on one or more computers. In batch systems, jobs are also called "batch jobs”.
  • Task An instance individual in a set of program instances within a job.
  • a task corresponds to a process, container, or other runtime entity on a computer.
  • BMC There is an independent management module BMC in the current server equipment.
  • BMC usually adopts the architecture of chimney AI application.
  • the chimney type means that there is no unified management and control between various AI applications, no communication with each other, and no data integration.
  • Each AI application is like a chimney, so it is called "chimney application”.
  • the BMC in the device is limited by its extremely limited resources and service processing capabilities, and it is difficult to carry the increasing number of AI applications, because AI applications often consume a lot of system resources and processing capabilities.
  • the processing chips based on embedded systems (such as ARM64, ARM32, etc.) rewrite the industry's general algorithm math library, data processing module, etc., and optimize these libraries and modules based on the instruction set as much as possible.
  • the processing chips based on embedded systems such as ARM64, ARM32, etc.
  • rewrite the industry's general algorithm math library, data processing module, etc. and optimize these libraries and modules based on the instruction set as much as possible.
  • the corresponding AI applications such as intelligent operation and maintenance can directly call these mathematical libraries, so as to reduce the resource overhead of AI applications during execution as much as possible.
  • the total memory of BMC is 1C 512M, of which the main system is running, and 70% of the CPU and 380M of memory are occupied when the main system is running, plus conventional system business execution.
  • AI inference applications such as: memory failure prediction (15%, 50M), mechanical hard disk ( hard disk drive, HDD) failure prediction (5%, 50M), solid state drive (solid state drive, SSD) failure prediction (5%, 50M) and federated learning (20%, 75M), etc., if these 4 AI reasoning applications If the conditions are triggered at the same time, they will be executed concurrently.
  • this embodiment proposes a management idea for decoupling AI applications from their AI tasks in BMC, and provides a corresponding AI task management system design to make up for the above solutions through resource management and control, task scheduling, etc. of insufficiency.
  • the solution provided by this embodiment helps the BMC to efficiently execute multiple AI applications in limited resources, limits the resource overhead of the AI applications, and avoids affecting the execution of BMC basic service applications.
  • FIG. 1 is a schematic diagram of a system architecture of a BMC provided by an embodiment of the present application.
  • BMC includes at least one AI application, management system and basic AI library. Each component in FIG. 1 will be described in detail below through (1) to (3).
  • the AI application in the BMC is, for example, an intelligent operation and maintenance application.
  • AI applications include, but are not limited to, fault prediction applications, performance analysis applications, power consumption analysis applications, and the like.
  • the failure prediction application is used to predict the probability of failure of the system hardware managed by the BMC according to the AI algorithm.
  • failure prediction applications include, but are not limited to, memory failure prediction applications and hard disk failure prediction applications. Hard disk failure prediction applications such as HDD failure prediction applications, SSD failure prediction applications, etc.
  • the performance analysis application is used to analyze the performance of the system hardware managed by the BMC according to the AI algorithm.
  • the power consumption analysis application is used to analyze the power consumption of the system hardware managed by the BMC according to the AI algorithm.
  • the power consumption analysis application is further used to find reasonable power consumption for each hardware of the system and provide the found power consumption to the BMC, and the BMC adjusts the power consumption of the system hardware to the power consumption found by the power consumption analysis application.
  • the management system is also called AI task and model management system.
  • the management system is used for AI tasks to allocate resources and schedule tasks. Specifically, the management system will monitor the resource consumption of AI tasks in BMC and the resource consumption of BMC's basic services, and intelligently and dynamically schedule jobs in combination with quotas, priorities, resource consumption, execution plans, and business cycles. Management systems are software.
  • the management system is the AI runtime framework within the BMC.
  • the management system provides BMC with a unified runtime control capability for safe and high-performance execution of AI applications, which avoids resource competition caused by the large resource overhead of AI applications and the disorderly concurrency of multiple AI applications, which affects the effective execution of BMC's basic services. question.
  • the management system includes a historical information regression learning module, a dynamic resource quota management module and a task management (scheduling execution) module.
  • the historical information regression learning module is used for self-learning regression optimization based on historical information. Specifically, the historical information regression learning module selects AI tasks reasonably based on the system business cycle with features that are memory-intensive (MEM intensive, such as training, federated learning, etc.) or CPU-intensive (CPU intensive, such as inference, batch data processing). schedule. After enabling the self-learning regression optimization function of historical information, the management system will automatically learn the current BMC load cycle, identify the load characteristics in different scenarios, and dynamically drive AI tasks (such as distinguishing training, reasoning, etc.), and perform them within a reasonable resource usage time period. Differentiate scheduling execution.
  • the dynamic resource quota management module is used to manage and control the overall resource overhead of each AI task, so as to realize the functions of total resource package and overall resource capping.
  • the dynamic resource quota management module When executed, it will dynamically control the resource allocation and adjustment of AI tasks based on resource quotas.
  • the task management (scheduling execution) module supports the execution plan management of AI tasks.
  • the execution plan includes a timed execution plan, an on-demand execution plan, and the like.
  • the Task Management (Scheduled Execution) module also supports priority-based execution management.
  • the task management (scheduling execution) module cooperates with the dynamic resource quota management module to improve the utilization of resources.
  • the basic AI library includes basic math library and data acquisition and processing module.
  • the base math library includes at least one model.
  • the models in the basic math library are, for example, models trained by machine learning algorithms.
  • basic math libraries include a federated learning (Collaborative Metric Learning, CML) model, a neural network (Neural Network, NN) model, a random forest model, a K-means (K-means) model, and the like.
  • FIG. 2 is a flowchart of a task scheduling method 200 provided by an embodiment of the present application.
  • the method 200 includes the following steps S201 to S205.
  • the system architecture on which the method 200 is based is as shown in FIG. 1 above.
  • the BMC in the method 200 is the BMC in FIG. 1
  • at least one AI application in the method 200 includes AI application 1, AI application 2, AI application 3, and AI application in FIG. 1 n
  • the basic AI library in the method 200 is the basic AI library in FIG. 1 .
  • the method 200 involves multiple AI tasks and multiple AI applications.
  • first AI task and “second AI task” are used to distinguish and describe different AI tasks.
  • first AI application and “second AI application” are used to distinguish and describe different AI applications.
  • the method 200 is used in a scenario where multiple AI applications are executed concurrently.
  • the method 200 is described by taking the interaction between the management system and the first AI application as an example.
  • the interaction between the management system and other AI applications refer to the interaction process with the first AI application.
  • Step S201 the first AI application generates and sends a request to the management system.
  • the first AI application is an AI application in the BMC.
  • the request of the first AI application is used to instruct the management system to perform at least one AI task.
  • the request of the first AI application includes an identification of at least one model in the base AI library and at least one input parameter of the model.
  • the ID of the model is used to identify the corresponding model in the basic AI library.
  • the first AI application specifies which model in the basic AI library to call for calculation by carrying the model identifier in the request.
  • the request includes the identification of the random forest model, and the request instructs the management system to invoke the random forest model to perform the AI task.
  • the input parameter in the request includes the attribute of the system hardware managed by the BMC, and the specific type of the attribute is related to the specific business logic of the first AI application and the type of the targeted system hardware.
  • the system hardware as a hard disk
  • the input parameters carried in the request are used to predict whether the hard disk is faulty, specifically the health status information of the hard disk, such as the number of scan errors of the hard disk, the reallocation count and Trial count, etc.
  • the input parameters carried in the request are the performance parameters of the hard disk, such as the rotational speed, capacity, average seek time, and transfer rate of the hard disk.
  • Step S202 the management system generates at least one AI task by using the basic AI library in response to the request of the first AI application.
  • AI tasks are tasks that call models in the basic AI library to perform operations. AI tasks are sometimes also referred to as AI jobs. AI tasks include, but are not limited to, training tasks and inference tasks. Training tasks include, but are not limited to, calculating gradient values of the model, calculating model parameters, and the like.
  • the inference task is the task of inference through the trained model.
  • the request instruction invokes a classification model in the basic AI library, and the reasoning task corresponding to the classification model is to determine a category or a probability of a category.
  • the request indicates to call a regression model in the basic AI library, and the reasoning task corresponding to the regression model is to determine the target value.
  • the AI task is specifically a task of performing operations according to input parameters in the request of the AI application and using the model indicated in the request of the AI application.
  • the management system obtains the identifier of the model and the input parameters of the model from the request of the first AI application.
  • the management system selects a corresponding model from at least one model in the basic AI library according to the identification of the model, inputs the input parameters carried in the request into the model, and processes the input parameters through the model.
  • the process of processing through the model is the AI task.
  • Step S203 the management system allocates the first resource for the first AI task from the resources of the BMC.
  • the resources of the BMC include but are not limited to: computing resources, storage resources and network resources.
  • Computing resources include, but are not limited to, CPU, memory, GPU, and the like.
  • Storage resources include hard disks, such as HDDs, SSDs, and the like.
  • Network resources include bandwidth, internet protocol (IP) addresses, port numbers, and the like.
  • IP internet protocol
  • the first AI task is an AI task of at least one AI task.
  • the first resource refers to a resource allocated for the first AI task.
  • the first resource is a part of the resources of the BMC.
  • the first resource is a certain amount of CPU and a certain amount of memory space.
  • the management system manages resources occupied by AI tasks based on a resource quota mechanism. Specifically, quotas are used to provide limits on the total resource consumption of AI tasks.
  • the management system monitors the resources occupied by AI tasks to ensure that the resource usage of AI tasks does not exceed the limit. For example, when the management system allocates resources for the first AI task, the management system allocates the first resource for the first AI task according to the quota. The first resource does not exceed the quota. For example, if the resource required by the first AI task is memory and the quota is n megabytes, then the memory allocated by the management system (the first resource) will not exceed n megabytes.
  • the quota is the content of the BMC's configuration file. Quotas are preset by the user.
  • the management system manages the resources of AI tasks through quotas, which can prevent AI tasks from using resources without restrictions, thereby preventing the resources used by AI tasks from exceeding the upper limit and affecting the basic business of BMC.
  • the mechanism of resource quota includes multiple implementations, and the following is an example of the two implementations.
  • Implementation method 1 Take at least one AI task as a whole, and realize the function of capping the total resources by introducing an overall quota for these AI tasks.
  • the above quota includes an overall quota.
  • the management system allocates resources, it will allocate overall resources for at least one AI task according to the overall quota.
  • resource allocation is performed for the first AI task, the management system allocates the first resource from the overall resources.
  • the overall quota indicates the quota of at least one AI task population.
  • the overall quota is the sum of the quotas of the individual AI tasks.
  • the overall resource allocated by the management system for at least one AI task does not exceed the overall quota.
  • the resources allocated by a single AI task are part of the overall resources, and the resources allocated by a single AI task are at most the overall resources.
  • the resources allocated by the management system for this AI task are at most the overall quota.
  • the resources allocated by the management system for each AI task are less than the overall quota, and the sum of resources allocated by the management system for all AI tasks does not exceed the overall quota.
  • the above resource is memory, and the overall quota is 20M. If there are currently n AI tasks in total, the management system allocates a total of 20M memory for the n AI tasks, and each AI task is equivalent to dividing resources in the 20M memory. If there is only one AI task that needs to be executed at present, then this AI task can obtain up to 20M of memory. If multiple AI tasks are currently being executed concurrently, the memory allocated for each AI task is less than 20M, and the sum of the memory allocated for all AI tasks does not exceed 20M.
  • a corresponding proportional quota is introduced for each AI task on the basis of the overall quota.
  • the above quotas include not only overall quotas, but also proportional quotas.
  • Proportional quota is also called resource allocation.
  • Proportional quota indicates the proportion of an AI task's quota to the overall quota. The larger the proportional quota, the more resources an AI task can obtain.
  • the management system allocates resources for the first AI task, it will allocate the first resource from the overall resources according to the proportional quota of the first AI task.
  • the proportion quota of the first AI task indicates the proportion of the quota of the first AI task to the overall quota.
  • the first resource does not exceed the product of the overall resource and the proportional quota. For example, if the resource is memory and the overall quota is 20M, if the proportional quota of the first AI task is 0.7, the first resource does not exceed 14M.
  • proportional quotas are used where multiple AI tasks are executing concurrently. Specifically, if the management system needs to perform multiple AI tasks, for example, the AI tasks that need to be performed include not only the first AI task exemplified above, but also other AI tasks such as the second AI task, then the management system is based on the proportion of the first AI tasks. Quota allocates the first resource for the first AI task. If the management system executes the first AI task, the management system optionally allocates resources exceeding the proportional quota to the first AI task, and the resources allocated to the first AI task are at most the total resources of all AI tasks.
  • Implementation method 2 Set a specific quota for each AI task.
  • the management system When allocating resources for the first AI task, the management system will allocate the first resource for the first AI task according to the corresponding relationship between the AI task and the quota. Wherein, the first resource does not exceed the quota of the first AI task.
  • the management system when the resource occupied by the AI task exceeds the quota, performs a specified behavior on the AI task to release the resource occupied by the AI task excessively.
  • Specified behaviors include, but are not limited to, kill jobs, swap partitions, and the like.
  • the management system kills the first AI task, thereby releasing the resources occupied by the first AI task.
  • the above-mentioned resources include memory. If the resources occupied by the first AI task exceed the quota, the management system saves the data of the first AI task in the memory of the BMC to the swap partition, and releases the data of the first AI task in the memory of the BMC. This frees up memory space for other AI tasks other than the first AI task.
  • the management system will govern the resource allocation of AI tasks based on the priority of the AI tasks. Specifically, high-priority AI tasks are allowed to preempt the resources of low-priority AI tasks.
  • the implementation of resource preemption includes, but is not limited to, kill tasks or invoking control groups (control groups, Cgroups) in linux to dynamically adjust resources. For example, there is a first AI task and a second task, the second AI task has a lower priority than the first AI task.
  • the management system allocates resources for the first AI task, if the remaining resources do not meet the resource requirements of the first AI task, the management system kills the second AI task, and allocates the first AI task from the resources released by the second AI task. a resource; or, if the remaining resources do not meet the resource requirements of the first AI task, the management system calls Cgroup to adjust the resources of the first AI task and the resources of the second task. In this way, it helps to improve the utilization of resources.
  • the management system also regresses learning through historical information, and selects reasonable AI tasks to execute in a reasonable time period. Specifically, the management system determines the service period of the BMC based on the historical information; the management system determines the first AI task from at least one AI task according to the service period.
  • the historical information includes the resource overhead of the service in multiple historical time periods.
  • the business includes BMC's basic business and AI tasks.
  • Basic services such as monitoring fan status, logging, etc.
  • the historical information of the basic service of the BMC includes the resource expenditure of the basic service in the historical time period.
  • the historical information of AI tasks includes the resource consumption of AI tasks in historical time periods.
  • the historical information is pre-collected by the BMC.
  • the data acquisition and processing module in the basic AI library in BMC collects and saves the historical information of the system hardware in advance.
  • the service period indicates the correspondence between the resource overhead of the service and the time.
  • the data form of the business cycle is a curve
  • the peak of the curve is the maximum resource cost
  • the trough of the curve is the minimum resource cost.
  • the business cycle includes multiple dimensions such as CPU and memory.
  • the service period of the CPU dimension indicates the corresponding relationship between the CPU overhead and time of the service.
  • the business cycle of the memory dimension indicates the correspondence between the business's memory overhead and time.
  • the business period includes the business period of the basic business of the BMC and the business period of at least one AI task.
  • the management system employs a regression learning algorithm to determine the business period based on historical information.
  • Regression learning algorithms are, for example, time series forecasting algorithms.
  • Time series forecasting algorithms include but are not limited to differential integrated moving average autoregressive models (Autoregressive Integrated Moving Average model, ARIMA), exponential smoothing, cycle identification algorithms, and so on.
  • the management system selects the AI task schedule based on the learned peak period of the underlying business. For example, the management system determines the peak period of the basic business from the business cycle of the basic business; the management system determines the peak period of at least one AI task from the business cycle of at least one AI task; the management system determines the peak period of the basic business according to the peak period and at least one AI task. The peak period of an AI task determines the first AI task.
  • the peak period refers to the time period corresponding to the maximum resource overhead in the business cycle.
  • a business cycle is a curve
  • the ordinate of the curve represents resource overhead
  • the abscissa of the curve represents time
  • the peak period is the time period corresponding to the peak of the curve.
  • the management system finds an AI task whose peak period is staggered from the peak period of the basic business from at least one AI task, and schedules the found AI task (the first AI task). task) to execute.
  • the peak period of the above-mentioned first AI task is different from the peak period of the underlying business. In this way, it is helpful to realize off-peak processing and avoid the peak period of resource consumption of AI tasks when the resource consumption of basic services also peaks, thereby reducing the impact of AI task execution on basic services.
  • Step S204 the management system executes the first AI task by using the first resource, thereby obtaining an execution result.
  • the execution result refers to the result obtained by executing the first AI task.
  • the execution result is the output parameter of the model in the basic AI library.
  • the execution result includes, but is not limited to, a binary classification result, a classification probability, a predicted target value, and the like.
  • the first AI application is an SSD failure prediction application
  • the model in the basic AI library for managing system calls is a classification model.
  • the execution result is a binary classification result.
  • the execution result is a probability, and the value range of the execution result is between (0, 1). The larger the execution result, the higher the probability that the SSD is a faulty disk.
  • the management system determines the execution order of the plurality of AI tasks according to the priorities of the plurality of AI tasks. Among them, the higher the priority of the AI task, the earlier the AI task is executed.
  • the priority includes three types, namely, high priority, medium priority, and low priority. All AI tasks are divided into high-priority AI tasks, medium-priority AI tasks, and low-priority AI tasks according to their priorities.
  • the management system will set up three kinds of queues, namely high priority queues, medium priority queues and low priority queues. Each queue is used to cache AI tasks with corresponding priorities. When the management system wants to execute a task, it will first obtain and execute the task in the high-priority queue.
  • the management system first determines whether a high-priority queue contains AI tasks. If the high-priority queue contains AI tasks, the management system obtains the AI tasks from the high-priority queue and executes them. If the high-priority queue does not contain AI tasks, the management system continues to determine whether the medium-priority queue contains AI tasks. If the medium-priority queue does not contain AI tasks, the management system continues to judge whether the low-priority queue contains AI tasks. Helps improve resource utilization by supporting priority-based execution management.
  • the management system obtains the peak period of the basic business based on historical learning, and determines the execution time of the AI task. Specifically, the management system determines the target time according to the service cycle of the basic service of the BMC; at the target time, the management system uses the first resource to execute the first AI task.
  • the target time is the execution time point of the first AI task, and the target time is in a time period outside the peak period of the basic business. In this way, the execution time of AI tasks can be staggered from the peak period of the basic business, avoiding the execution of AI tasks during the peak period of the basic business, thereby avoiding the impact of the AI task on the basic business.
  • setting execution plans for AI tasks is supported.
  • the execution plan indicates the point in time when the execution of the AI task will begin.
  • Execution plans include, but are not limited to, timed execution plans and on-demand execution plans.
  • the execution plan is, for example, a preset configuration.
  • the management system obtains the execution plan of the first AI task, and the management system executes the first AI task according to the execution plan of the first AI task.
  • the execution plan indicates the point in time when the execution of the first AI task is started.
  • the timing execution plan specifies a specific time point, or specifies a time period.
  • the execution plan of the first AI task is a timing execution plan.
  • the timing execution plan indicates that the first AI task is executed at a preset time point.
  • the management system will obtain the preset time point from the timing execution plan and start at the preset time point. Perform the first AI mission.
  • the execution plan of the first AI task is a timed execution plan, and the timed execution plan instructs to execute the first AI task every preset period.
  • the management system will obtain the preset period from the timed execution plan, start the timer, and every The preset period starts to execute the first AI task.
  • the execution plan of the first AI task is an on-demand execution plan
  • the on-demand execution plan indicates that the first AI task is executed when an instruction is received.
  • the instruction is used to instruct the management system to start executing the first AI task.
  • instructions can be generated and sent to the management system.
  • the management system receives the instruction, it will start executing the first AI task according to the on-demand execution plan.
  • Step S205 the management system provides the execution result to the first AI application.
  • the management system sends the execution result to the first AI task.
  • the management system provides a query interface for execution results.
  • the first AI application calls the query interface and sends a query request to the management system.
  • the management system sends a query response to the first AI application, where the query response includes an execution result.
  • the first AI application receives the query response, and obtains an execution result from the query response.
  • the management system saves the execution result to a specified address, and the first AI application accesses the specified address to obtain the execution result.
  • a management system for AI tasks is introduced into the BMC, thereby decoupling AI applications from AI tasks and the underlying basic AI library, and the management system uniformly manages resources for each AI task As well as task scheduling, it strengthens the control of AI tasks, prevents AI tasks from using more resources than expected without restrictions, and prevents AI tasks from taking up too many resources to affect BMC's basic business, which helps to ensure the foundation of BMC to a certain extent. business stability.
  • the above-mentioned method 200 is illustrated below with reference to an example.
  • the resources in the method 200 shown in FIG. 2 are the CPU and memory in the example 1.
  • the first AI task in the method 200 shown in FIG. 2 is AI-1 in the example 1
  • the second AI task in the method 200 is the AI-2 in the example 1.
  • FIG. 3 is a flowchart of Example 1.
  • FIG. 3 First, input the configuration file to the management system of the BMC, and then execute the following steps 301 to 313.
  • the configuration file includes the overall quota for AI features within the BMC as well as the respective proportional quota for each AI task.
  • the overall quota of AI features includes the overall quota of the CPU dimension and the overall quota of the memory dimension.
  • the overall quota for the CPU dimension is 50% CPU.
  • the overall quota for the memory dimension is 20M.
  • the BMC includes AI task memory failure prediction (AI-1 for short), AI task HDD failure prediction (AI-2 for short) and AI-3.
  • AI-1 AI task memory failure prediction
  • AI-2 AI task HDD failure prediction
  • AI-3 AI-3.
  • the configuration file sets the proportional quota of AI-1 to 0.7, the proportional quota of AI-2 to 0.2, and the proportional quota of AI-3 to 0.1.
  • AI-1 has high priority
  • AI-2 has medium priority
  • AI-3 has low priority.
  • Step 301 the management system starts to load the set quota of the AI task. Specifically, the management system allocates a total of 20M of memory for each AI task, or allows the AI task to occupy 50% of the CPU at most.
  • Step 302 the management system detects whether the historical information regression learning is enabled. If the historical information regression learning is enabled, the following step 303 is performed. If the historical information regression learning is not enabled, the following step 304 is performed.
  • Step 303 The management system selects Al-1, Al-2, and Al-3 based on the service cycle.
  • Step 304 The management system loads Al-1, Al-2, and Al-3 into the task execution queue.
  • Step 305 the management system executes tasks A1-1, A1-2 and A1-3 (executed according to the quota setting). Specifically, if AI-1 is not executed, or AI-1 has a small business volume and occupies less resources, AI-2 can occupy more than 20x0.3 memory and 50%x0.3 CPU. If both AI-1 and AI-2 are executing and the computing density is high, the management system will limit the CPU occupied by AI-1 to (50% x 0.7) and the memory occupied by AI-1 to ( 20Mx0.7); the management system will limit the CPU occupied by AI-2 to (50%x0.3) and the memory occupied by AI-2 to (20Mx0.3). For details, refer to steps 306 to 313 in FIG. 3 .
  • Step 306 Task AI-2 needs to use additional resources.
  • Step 307 The management system detects whether the overall resources corresponding to the overall quota remain. If there are remaining total resources, go to step 308 . If the total resource is not left, then go to step 313 .
  • Step 308 The management system determines that the task AI-1 has not consumed the total resources.
  • Step 309 The management system allocates the remaining resources in the total resources (resources not used by AI-1) to AI-2.
  • Step 310 Task AI-1 needs to consume more resources (full quota usage).
  • Step 311 the management system reduces the resource of AI-2 to (50% 0.2), and saves the extra data in the memory of AI-2 in Swap.
  • Step 312 the AI-1 task uses the given resource when the quota is full.
  • Step 313 AI-2 maintains the predetermined quota for execution (no additional resources are allocated).
  • the management system when the self-learning regression optimization function of historical information is turned on, the management system will automatically learn the load cycle of the current equipment, identify the load characteristics in different scenarios, and dynamically drive AI operations (such as distinguishing training, reasoning, etc.) Resource usage time periods are used to differentiate scheduling execution.
  • the abscissas in Fig. 4 and Fig. 5 both represent time.
  • the ordinates in both Figures 4 and 5 represent the resource overhead (such as memory usage or memory ratio) of AI tasks.
  • the two curves in FIG. 4 and FIG. 5 respectively represent the resource overhead of AI-1 in each time period and the resource overhead of AI-2 in each time period.
  • Figure 4 shows the resource overhead when there is no management system.
  • Figure 5 shows the resource overhead when scheduling in a managed system.
  • the bold line in Figure 4 is also called the expected cap value, which means the maximum expected resource overhead, that is, how much resources are expected to be occupied by all AI tasks at most.
  • the bold straight line in Figure 5 represents the overall quota, also known as the total package of AI tasks, and the unit is M.
  • the management system manages and controls the concurrent execution of multiple AI jobs in an orderly manner, and staggers the business peak periods based on historical learning to ensure the stable operation of the basic capabilities of the BMC basic system. And resource-intensive AI jobs are executed correctly.
  • the following is an example of the basic hardware structure of the BMC.
  • FIG. 6 is a schematic structural diagram of a BMC provided by an embodiment of the present application.
  • the BMC 600 shown in FIG. 6 is used to implement the method described in FIG. 2 or FIG. 3 above.
  • the BMC 600 shown in FIG. 6 is the BMC in FIG. 1 .
  • BMC 600 includes at least one processor 601 , communication bus 602 , memory 603 and at least one network interface 604 .
  • the processor 601 is, for example, a general-purpose central processing unit (central processing unit, CPU), a network processor (network processor, NP), a graphics processing unit (graphics processing unit, GPU), a neural-network processing unit (neural-network processing units, NPU) ), a data processing unit (DPU), a microprocessor or one or more integrated circuits for implementing the solution of the present application.
  • the processor 601 includes an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof.
  • the PLD is, for example, a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof.
  • the communication bus 602 is used to transfer information between the aforementioned components.
  • the communication bus 602 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in FIG. 6, but it does not mean that there is only one bus or one type of bus.
  • the memory 603 is, for example, a read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, or a random access memory (random access memory, RAM) or a memory device that can store information and instructions.
  • Other types of dynamic storage devices such as electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, optical disks storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, or other magnetic storage devices, or are capable of carrying or storing desired computer instructions in the form of instructions or data structures and capable of Any other medium accessed by a computer without limitation.
  • the memory 603 exists independently, for example, and is connected to the processor 601 through the communication bus 602 .
  • the memory 603 may also be integrated with the processor 601 .
  • the network interface 604 uses any transceiver-like device for communicating with other devices or communication networks.
  • the network interface 604 includes a wired network interface and may also include a wireless network interface.
  • the wired network interface may be, for example, an Ethernet interface.
  • the Ethernet interface can be an optical interface, an electrical interface or a combination thereof.
  • the wireless network interface may be a wireless local area network (wireless local area networks, WLAN) interface, a cellular network network interface or a combination thereof, and the like.
  • the processor 601 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 6 .
  • the BMC 600 may include multiple processors, such as the processor 601 and the processor 605 shown in FIG. 6 .
  • processors can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU).
  • a processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
  • the BMC600 may further include an output device and an input device.
  • the output device communicates with the processor 601 and can display information in a variety of ways.
  • the output device may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, a projector, or the like.
  • the input device communicates with the processor 601 and can receive user input in a variety of ways.
  • the input device may be a mouse, a keyboard, a touch screen device, or a sensor device, or the like.
  • the processor 601 implements the methods in the above embodiments by reading the computer instructions 610 stored in the memory 603, or the processor 601 implements the methods in the above embodiments by using internally stored computer instructions.
  • the processor 601 implements the methods in the above embodiments by reading the computer instructions 610 stored in the memory 603, the memory 603 stores computer instructions for implementing the methods provided in the embodiments of the present application.
  • FIG. 7 is a schematic structural diagram of a management system 70 provided by an embodiment of the present application.
  • the management system 70 shown in FIG. 7 for example, implements the functions of the management system in the method 200 .
  • the management system 70 includes a processing unit 701 and a providing unit 702 .
  • Each unit in the management system 70 is implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • Each unit in the management system 70 is used to perform the corresponding functions of the management system in the above-mentioned method 200 .
  • the processing unit 701 is configured to support the management system 70 to execute S202 to S204.
  • the providing unit 702 is used to support the management system 70 to perform S205.
  • the apparatus embodiment described in FIG. 7 is only illustrative.
  • the division of the above-mentioned units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or may be Integration into another system, or some features can be ignored, or not implemented.
  • Each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned units in FIG. 7 can be implemented either in the form of hardware or in the form of software functional units.
  • the above-mentioned processing unit 701 may be implemented by a software functional unit generated after at least one processor 601 in FIG.
  • the above-mentioned units in FIG. 7 can also be implemented by different hardware in the management system 70, for example, the processing unit 701 is composed of a part of the processing resources in at least one processor 601 in FIG. two cores), while the providing unit 702 is implemented by the rest of the processing resources in at least one processor 601 in FIG. , FPGA), or programmable devices such as coprocessors. Alternatively, the providing unit 702 is implemented by the network interface 604 in FIG. 6 . Obviously, the above functional units can also be implemented by a combination of software and hardware. For example, the providing unit 702 is implemented by a hardware programmable device, and the processing unit 701 is a software functional unit generated after the CPU reads the program code stored in the memory.
  • the meaning of "at least one” refers to one or more.
  • the meaning of "plurality” means two or more.
  • multiple AI applications refers to two or more AI applications.
  • A refers to B, which means that A is the same as B or A is a simple variation of B.
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the procedures or functions according to the embodiments of the present application are generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored on or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted over a wire from a website site, computer, server or data center (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) to another website site, computer, server or data center.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more available media integrated.
  • the available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), among others.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present application relates to the technical field of computers, and provides a task scheduling method and a management system. In the present application, a management system for AI tasks is introduced into a system architecture of a BMC, and the management system interacts with an AI application on the upper layer, and generates an AI task by using an underlying basic AI library. The management system allocates resources of the BMC to each AI task in a unified manner, and schedules each AI task in a unified manner, thereby decoupling the AI application from the AI task and the underlying basic AI library, so that the management and control on AI tasks are enhanced, the use of AI tasks is prevented from being not limited and using more than expected resources, and the AI tasks are prevented from affecting a basic service of the BMC due to excessive occupation of resources, helping to guarantee the stability of the basic service of the BMC.

Description

任务调度方法及管理***Task scheduling method and management system 技术领域technical field
本申请涉及计算机技术领域,特别涉及一种任务调度方法及管理***。The present application relates to the field of computer technology, and in particular, to a task scheduling method and management system.
背景技术Background technique
服务器等计算设备通常包括独立的管理模块,即基板管理控制器(Baseboard Management Controller,BMC)。随着云与大数据集群规模的日益增长,以及云边协同等场景的逐渐增多,为了适应多种不同的场景中运维的快速需求响应(如故障预测、故障自愈、性能亚健康发现等),越来越多的厂商为BMC中增加了基于机器学习、深度学习等多种人工智能(artificial intelligence,AI)应用。Computing devices such as servers usually include an independent management module, that is, a Baseboard Management Controller (BMC). With the increasing scale of cloud and big data clusters and the increasing number of scenarios such as cloud-side collaboration, in order to adapt to the rapid demand response of operation and maintenance in various scenarios (such as fault prediction, fault self-healing, performance sub-health discovery, etc. ), more and more manufacturers have added various artificial intelligence (AI) applications based on machine learning and deep learning to BMC.
然而,目前AI应用经常过多的占用BMC的资源,容易影响BMC基础业务的执行。However, at present, AI applications often occupy BMC resources too much, which easily affects the execution of BMC basic services.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供了一种任务调度方法及管理***,有助于避免AI应用使用过多资源导致影响BMC的基础业务。所述技术方案如下。The embodiments of the present application provide a task scheduling method and management system, which help to prevent the AI application from using too many resources and affecting the basic services of the BMC. The technical solution is as follows.
第一方面,提供了一种任务调度方法,所述方法应用于BMC中,所述BMC包括至少一个AI应用、管理***和基础AI库,所述方法包括:所述管理***响应于所述至少一个AI应用中第一AI应用的请求,使用所述基础AI库生成至少一个AI任务;所述管理***从所述BMC的资源中为第一AI任务分配第一资源,所述第一AI任务是所述至少一个AI任务中的一个AI任务;所述管理***使用所述第一资源执行所述第一AI任务,从而得到执行结果;In a first aspect, a task scheduling method is provided, the method is applied in a BMC, the BMC includes at least one AI application, a management system and a basic AI library, and the method includes: the management system responds to the at least one AI application. A request of the first AI application in an AI application, use the basic AI library to generate at least one AI task; the management system allocates a first resource for the first AI task from the resources of the BMC, and the first AI task is an AI task in the at least one AI task; the management system uses the first resource to execute the first AI task, thereby obtaining an execution result;
所述管理***向所述第一AI应用提供所述执行结果。The management system provides the execution result to the first AI application.
本实施例提供的方法,通过在BMC内引入了针对AI任务的管理***,从而将AI应用与AI任务以及底层的基础AI库解耦开来,由管理***来统一为各个AI任务进行资源管理以及任务调度,从而强化了对AI任务的管控,防止AI任务不加限制的使用超过期望的资源,避免AI任务过多占用资源影响到BMC的基础业务,一定程度上有助于保障BMC的基础业务的稳定性。In the method provided in this embodiment, a management system for AI tasks is introduced into the BMC, thereby decoupling AI applications from AI tasks and the underlying basic AI library, and the management system uniformly manages resources for each AI task As well as task scheduling, it strengthens the control of AI tasks, prevents AI tasks from using more resources than expected without restrictions, and prevents AI tasks from taking up too many resources and affecting BMC's basic business, which helps to ensure the foundation of BMC to a certain extent. business stability.
可选地,所述管理***从所述BMC的资源中为第一AI任务分配第一资源,包括:所述管理***根据配额为第一AI任务分配第一资源,所述第一资源不超过所述配额。Optionally, the management system allocates the first resource for the first AI task from the resources of the BMC, including: the management system allocates the first resource for the first AI task according to the quota, and the first resource does not exceed the quota.
通过这种可选方式,用配额限制了AI任务能获得的资源,从而避免AI任务使用的资源超过上限而影响BMC的基础业务。In this optional way, quotas are used to limit the resources that can be obtained by AI tasks, so as to prevent the resources used by AI tasks from exceeding the upper limit and affecting the basic business of BMC.
可选地,所述配额包括总体配额,所述总体配额指示所述至少一个AI任务总体的配额,所述管理***根据配额为第一AI任务分配第一资源,包括:所述管理***根据所述总体配额,为所述至少一个AI任务分配总体资源,所述总体资源不超过所述总体配额:所述管理***从所述总体资源中分配所述第一资源。Optionally, the quota includes an overall quota, the overall quota indicates a quota of the at least one AI task population, and the management system allocates the first resource for the first AI task according to the quota, including: the management system according to the quota. The overall quota is allocated, and overall resources are allocated to the at least one AI task, and the overall resources do not exceed the overall quota: the management system allocates the first resource from the overall resources.
通过引入上述总体配额的机制,有助于对BMC内所有AI任务的总体资源开销进行管控,便于对各个AI任务统一进行资源管理。By introducing the above overall quota mechanism, it is helpful to control the overall resource expenditure of all AI tasks in the BMC, and facilitate the unified resource management of each AI task.
可选地,所述配额还包括比例配额,所述比例配额指示所述第一AI任务的配额占所述总体配额的比例,所述管理***从所述总体资源中分配所述第一资源,包括:所述管理*** 根据所述比例配额,从所述总体资源中分配所述第一资源,所述第一资源不超过所述总体资源与所述比例配额的乘积。Optionally, the quota further includes a proportional quota, the proportional quota indicates the ratio of the quota of the first AI task to the overall quota, and the management system allocates the first resource from the overall resource, The method includes: the management system allocates the first resource from the overall resources according to the proportional quota, and the first resource does not exceed the product of the overall resource and the proportional quota.
通过这种方式,更加精细地限制了各个AI任务分别使用的资源,有助于提高总体的资源利用率。In this way, the resources used by each AI task are more finely restricted, which helps to improve the overall resource utilization.
可选地,所述管理***根据配额为第一AI任务分配第一资源之后,所述方法还包括:如果所述第一AI任务占用的资源超过所述配额,所述管理***杀死所述第一AI任务;或者,如果所述第一AI任务占用的资源超过所述配额,所述管理***将所述BMC的内存中所述第一AI任务的数据保存至swap分区中,并释放所述数据在所述BMC的内存中占用的空间。Optionally, after the management system allocates the first resource to the first AI task according to the quota, the method further includes: if the resource occupied by the first AI task exceeds the quota, the management system kills the The first AI task; or, if the resources occupied by the first AI task exceed the quota, the management system saves the data of the first AI task in the memory of the BMC to the swap partition, and releases all the resources. The space occupied by the data in the memory of the BMC.
通过这种方式,在AI任务过多占用资源时及时释放其占用的资源,便于为其他AI任务留出可分配的资源,从而有助于提高总体的资源利用率。In this way, the resources occupied by AI tasks can be released in time when they occupy too many resources, so that resources can be reserved for other AI tasks, thereby helping to improve the overall resource utilization.
可选地,所述第一资源是所述至少一个AI任务中第二AI任务原本占用的资源,所述第二AI任务的优先级低于所述第一AI任务的优先级,所述管理***从所述BMC的资源中为第一AI任务分配第一资源,包括:若剩余资源不满足所述第一AI任务的资源需求,所述管理***杀死所述第二AI任务,从所述第二AI任务释放的资源中为第一AI任务分配第一资源;或者,若剩余资源不满足所述第一AI任务的资源需求,所述管理***调用控制组Cgroup从而调整所述第一AI任务的资源和所述第二任务的资源。Optionally, the first resource is a resource originally occupied by a second AI task in the at least one AI task, the priority of the second AI task is lower than the priority of the first AI task, and the management The system allocates the first resource for the first AI task from the resources of the BMC, including: if the remaining resources do not meet the resource requirements of the first AI task, the management system kills the second AI task, In the resources released by the second AI task, the first resource is allocated to the first AI task; or, if the remaining resources do not meet the resource requirements of the first AI task, the management system calls the control group Cgroup to adjust the first AI task. The resources of the AI task and the resources of the second task.
通过这种方式,允许高优先级的AI任务抢占低优先级的AI任务的资源,保证高优先级的AI任务获得更多资源,从而满足服务质量(QoS)的需求。In this way, high-priority AI tasks are allowed to preempt the resources of low-priority AI tasks, and high-priority AI tasks are guaranteed to obtain more resources, thereby meeting the requirements of quality of service (QoS).
可选地,所述管理***使用所述第一资源执行所述第一AI任务之前,所述方法还包括:所述管理***基于历史信息确定所述BMC的业务周期,所述业务周期指示业务的资源开销与时间之间的对应关系;所述管理***根据所述业务周期从所述至少一个AI任务中确定第一AI任务。Optionally, before the management system uses the first resource to perform the first AI task, the method further includes: the management system determines a service period of the BMC based on historical information, and the service period indicates a service period. The corresponding relationship between the resource overhead and time; the management system determines the first AI task from the at least one AI task according to the service period.
通过这种方式,由于选择待执行的AI任务时考虑了基础业务的业务周期,有助于在基础业务高峰期时避免执行资源开销大的AI任务,从而保障BMC基础业务的稳定性。In this way, since the business cycle of the basic business is considered when selecting the AI task to be executed, it helps to avoid executing AI tasks with high resource consumption during the peak period of the basic business, thereby ensuring the stability of the BMC basic business.
可选地,所述业务周期包括所述BMC的基础业务的业务周期以及所述至少一个AI任务的业务周期,所述管理***根据所述业务周期从所述至少一个AI任务中确定第一AI任务,包括:所述管理***从所述基础业务的业务周期中,确定所述基础业务的高峰期,所述高峰期是指业务周期中资源开销的最大值对应的时间段;所述管理***从所述至少一个AI任务的业务周期中,确定所述至少一个AI任务的高峰期;所述管理***根据所述基础业务的高峰期以及所述至少一个AI任务的高峰期确定所述第一AI任务,所述第一AI任务的高峰期与所述基础业务的高峰期不同。Optionally, the business cycle includes a business cycle of the basic service of the BMC and a business cycle of the at least one AI task, and the management system determines the first AI from the at least one AI task according to the business cycle. The task includes: the management system determines the peak period of the basic business from the business cycle of the basic business, and the peak period refers to the time period corresponding to the maximum value of the resource overhead in the business cycle; the management system From the business cycle of the at least one AI task, determine the peak period of the at least one AI task; the management system determines the first peak period according to the peak period of the basic business and the peak period of the at least one AI task For the AI task, the peak period of the first AI task is different from the peak period of the basic service.
通过这种方式,有助于实现错峰处理,避免AI任务的资源开销出现高峰期时基础业务的资源开销也出现高峰期,从而降低AI任务的执行对基础业务造成的影响。In this way, it is helpful to realize off-peak processing and avoid the peak period of resource consumption of AI tasks when the resource consumption of basic services also peaks, thereby reducing the impact of the execution of AI tasks on basic services.
可选地,所述管理***使用所述第一资源执行所述第一AI任务,包括:所述管理***根据所述BMC的基础业务的业务周期确定目标时间,所述目标时间位于所述基础业务的高峰期之外的时间段;所述管理***在所述目标时间,使用所述第一资源执行所述第一AI任务。Optionally, the management system uses the first resource to perform the first AI task, including: the management system determines a target time according to the service cycle of the basic service of the BMC, and the target time is located in the basic service. A time period other than a peak period of business; the management system uses the first resource to perform the first AI task at the target time.
通过这种方式,能让AI任务的执行时间与基础业务的高峰期错开,避免在基础业务的高峰期执行AI任务,从而避免AI任务对基础业务造成影响。In this way, the execution time of the AI task can be staggered from the peak period of the basic business, avoiding the execution of the AI task during the peak period of the basic business, thereby avoiding the impact of the AI task on the basic business.
可选地,所述管理***基于历史信息确定所述管理***的业务周期,包括:所述管理***采用回归学习算法基于历史信息确定所述业务周期。Optionally, the management system determining the business period of the management system based on historical information includes: the management system determining the business period based on the historical information by employing a regression learning algorithm.
通过这种方式,能够动态地自动地学习出业务周期,降低配置复杂度。In this way, the service cycle can be learned dynamically and automatically, and the configuration complexity can be reduced.
可选地,所述管理***执行所述第一AI任务,包括:所述管理***按照所述第一AI任务的执行计划执行所述第一AI任务,所述执行计划指示开始执行所述第一AI任务的时间点。Optionally, the execution of the first AI task by the management system includes: the management system executes the first AI task according to an execution plan of the first AI task, and the execution plan instructs to start executing the first AI task. The time point of an AI task.
通过这种方式,支持设定AI任务的执行计划,提高灵活性。In this way, it supports setting the execution plan of AI tasks and improves flexibility.
可选地,所述执行计划包括定时执行计划以及按需执行计划,所述定时执行计划指示在预设时间点执行所述第一AI任务或者每隔预设周期执行所述第一AI任务,所述按需执行计划指示当接收到指令时执行所述第一AI任务。Optionally, the execution plan includes a timing execution plan and an on-demand execution plan, and the timing execution plan indicates that the first AI task is executed at a preset time point or the first AI task is executed every preset period, The on-demand execution plan directs execution of the first AI task when an instruction is received.
通过这种方式,支持设定定时执行计划以及按需执行计划,满足更多的应用场景。In this way, it supports setting timed execution plans and on-demand execution plans to meet more application scenarios.
可选地,所述至少一个AI任务为多个AI任务,所述管理***获取至少一个AI任务之后,所述方法还包括:所述管理***根据所述多个AI任务的优先级,确定所述多个AI任务的执行顺序,所述AI任务的优先级越高,所述AI任务越先执行。Optionally, the at least one AI task is a plurality of AI tasks, and after the management system acquires the at least one AI task, the method further includes: the management system determines, according to the priorities of the plurality of AI tasks, all AI tasks. According to the execution order of the multiple AI tasks, the higher the priority of the AI task, the earlier the AI task is executed.
通过这种方式,支持基于优先级来执行AI任务,有助于提高资源的总体利用率。In this way, enabling the execution of AI tasks based on priority helps improve the overall utilization of resources.
第二方面,提供了一种管理***,该管理***具有实现上述第一方面或第一方面任一种可选方式的功能。该管理***包括至少一个单元,至少一个单元用于实现上述第一方面或第一方面任一种可选方式所提供的方法。In a second aspect, a management system is provided, and the management system has the function of implementing the first aspect or any optional manner of the first aspect. The management system includes at least one unit, and the at least one unit is configured to implement the method provided in the first aspect or any optional manner of the first aspect.
在一些实施例中,管理***中的单元通过软件实现,管理***中的单元是程序模块。在另一些实施例中,管理***中的单元通过硬件或固件实现。第二方面提供的管理***的具体细节可参见上述第一方面或第一方面任一种可选方式,此处不再赘述。In some embodiments, the elements in the management system are implemented in software, and the elements in the management system are program modules. In other embodiments, the elements in the management system are implemented by hardware or firmware. For the specific details of the management system provided by the second aspect, reference may be made to the first aspect or any optional manner of the first aspect, which will not be repeated here.
第三方面,提供了一种BMC,该BMC包括上述第二方面所述的管理***、至少一个AI应用和基础AI库。A third aspect provides a BMC, where the BMC includes the management system described in the second aspect, at least one AI application, and a basic AI library.
第四方面,提供一种BMC,该BMC包括处理器和存储器。该存储器存储计算机指令;该处理器执行该存储器存储的计算机指令,使得该BMC执行上述第一方面或者第一方面的各种可选方式提供的方法。In a fourth aspect, a BMC is provided, the BMC including a processor and a memory. The memory stores computer instructions; the processor executes the computer instructions stored in the memory, so that the BMC executes the method provided in the first aspect or various optional manners of the first aspect.
第五方面,提供一种计算设备,该计算设备例如为服务器,该计算设备包括上述第四方面提供的BMC。A fifth aspect provides a computing device, where the computing device is, for example, a server, and the computing device includes the BMC provided in the fourth aspect.
第六方面,本申请提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,该计算机指令指示该BMC执行上述第一方面或者第一方面的各种可选方式提供的方法。In a sixth aspect, the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and the computer instructions instruct the BMC to perform the above-mentioned first aspect or various optional manners of the first aspect. method.
第七方面,本申请提供一种计算机程序产品,该计算机程序产品包括计算机指令,该计算机指令存储在计算机可读存储介质中。BMC的处理器可以从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该BMC执行上述第一方面或者第一方面的各种可选方式提供的方法。In a seventh aspect, the present application provides a computer program product comprising computer instructions stored in a computer-readable storage medium. The processor of the BMC may read the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the BMC performs the method provided in the first aspect or various optional manners of the first aspect.
第八方面,提供一种芯片,该芯片可以包括可编程逻辑电路和/或程序指令,当该芯片运行时用于实现上述第一方面或者第一方面的各种可选方式提供的方法。In an eighth aspect, a chip is provided, the chip may include programmable logic circuits and/or program instructions, and when the chip is running, it is used to implement the method provided in the first aspect or various optional manners of the first aspect.
附图说明Description of drawings
图1是本申请实施例提供的一种BMC的***架构的示意图;1 is a schematic diagram of a system architecture of a BMC provided by an embodiment of the present application;
图2是本申请实施例提供的一种任务调度方法的流程图;2 is a flowchart of a task scheduling method provided by an embodiment of the present application;
图3是本申请实施例提供的一种任务调度方法的流程图;3 is a flowchart of a task scheduling method provided by an embodiment of the present application;
图4是本申请实施例提供的一种无管理***时资源开销的效果图;4 is an effect diagram of resource overhead when a management system is not provided according to an embodiment of the present application;
图5是本申请实施例提供的一种有管理***时资源开销的效果图;5 is an effect diagram of resource overhead when a management system is provided according to an embodiment of the present application;
图6是本申请实施例提供的一种BMC的结构示意图;6 is a schematic structural diagram of a BMC provided by an embodiment of the present application;
图7是本申请实施例提供的一种管理***的结构示意图。FIG. 7 is a schematic structural diagram of a management system provided by an embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application clearer, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.
下面对本申请实施例涉及的一些概念做解释说明。Some concepts involved in the embodiments of the present application are explained below.
BMC:支持行业标准的智能平台管理接口(Intelligent Platform Management Interface,IPMI)规范。该规范描述了已经内置到主板上的管理功能.这些功能包括:本地和远程诊断、控制台支持、配置管理、硬件管理和故障排除。BMC提供下列功能:与IPMI 1.0的兼容性、用于风扇转速监视的转速计输入、用于风扇转速控制的脉冲宽度调节器输出、用于前面板按钮和开关的按钮输入、与服务器控制台端口进行多路传输的一个串行端口、远程访问和智能机箱管理总线(Intelligent Chassis Management Bus,ICMB)支持、三个I2C主端口和备用端口(其中一个端口用于智能机箱管理总线)、LPC(Low Pin Count,低针计数)总线提供对三种KCS(Keyboard Controller Style,键盘控制器方式)和BT(One-Block Transfer,单块传输)接口的访问、32位ARM7处理器、160针LQFP(Low Profile Flat Pack,薄形扁平封装)、为下列接口提供固件:IPMI、IPMB。BMC: Supports the industry-standard Intelligent Platform Management Interface (IPMI) specification. This specification describes the management functions already built into the motherboard. These functions include: local and remote diagnostics, console support, configuration management, hardware management and troubleshooting. BMC provides the following features: IPMI 1.0 compatibility, tachometer input for fan speed monitoring, pulse width regulator output for fan speed control, button input for front panel buttons and switches, and server console port One serial port for multiplexing, remote access and Intelligent Chassis Management Bus (ICMB) support, three I2C master and backup ports (one of which is for the Intelligent Chassis Management Bus), LPC (Low Pin Count, low pin count) bus provides access to three KCS (Keyboard Controller Style, keyboard controller mode) and BT (One-Block Transfer, single block transfer) interfaces, 32-bit ARM7 processor, 160-pin LQFP (Low Profile Flat Pack), provides firmware for the following interfaces: IPMI, IPMB.
批处理(batch processing):计算机***执行程序的一种方式,能够在非人工干预的情况下按照指定的规则有序执行一系列预先定义或随机输入的程序。批处理执行方式通常用于各类调度***,它能够提高资源利用率、降低人机交互开销。Batch processing: A way for a computer system to execute programs, capable of executing a series of predefined or randomly input programs in an orderly manner according to specified rules without human intervention. Batch execution is usually used in various scheduling systems, which can improve resource utilization and reduce human-computer interaction overhead.
作业(job):完成一项特定的计算业务所需要执行的一组程序实例的集合,通常对应于一台或多台计算机上的一组进程、容器或其他运行时实体。在批处理***中,作业也称为“批作业”(batch job)。Job: A set of program instances that need to be executed to complete a specific computing business, usually corresponding to a set of processes, containers or other runtime entities on one or more computers. In batch systems, jobs are also called "batch jobs".
任务(task):一个作业内程序实例集合中的实例个体。通常,任务对应于一台计算机上的一个进程、容器或其他运行时实体。Task: An instance individual in a set of program instances within a job. Typically, a task corresponds to a process, container, or other runtime entity on a computer.
下面对AI技术在BMC的应用做简单介绍。The following is a brief introduction to the application of AI technology in BMC.
当前的服务器设备中存在独立的管理模块BMC。随着云与大数据集群规模的日益增长,以及云边协同等场景的逐渐增多,为了适应多种不同的场景中运维的快速需求响应(如故障预测、故障自愈、性能亚健康发现等),越来越多的厂商为BMC中增加了多种基于机器学习、深度学习等多种AI应用。其中,BMC通常采用烟囱式AI应用的架构。烟囱式是指,各个AI应用之间没有统一的管控,相互之间没有联通、数据更没有整合,各个AI应用像一个个烟囱,因此称其为“烟囱式应用”。There is an independent management module BMC in the current server equipment. With the increasing scale of cloud and big data clusters and the increasing number of scenarios such as cloud-side collaboration, in order to adapt to the rapid demand response of operation and maintenance in various scenarios (such as fault prediction, fault self-healing, performance sub-health discovery, etc. ), more and more manufacturers have added a variety of AI applications based on machine learning and deep learning to BMC. Among them, BMC usually adopts the architecture of chimney AI application. The chimney type means that there is no unified management and control between various AI applications, no communication with each other, and no data integration. Each AI application is like a chimney, so it is called "chimney application".
然而,设备中的BMC受限于自身极为有限的资源以及业务处理能力,难以承载日益增加的AI应用,因为AI应用往往大量消耗着***资源与处理能力。However, the BMC in the device is limited by its extremely limited resources and service processing capabilities, and it is difficult to carry the increasing number of AI applications, because AI applications often consume a lot of system resources and processing capabilities.
为了解决该问题,主要是基于嵌入式***的处理芯片(如ARM64、ARM32等)通过重写业界的通用算法数学库,数据处理模块等,尽可能地对这些库、模块进行基于指令集的优化以降低算法库在被调用时所占的内存空间以及计算开销等,并由对应的智能运维等AI应用 直接调用这些数学库,从而尽可能地降低AI应用在执行时的资源开销。In order to solve this problem, the processing chips based on embedded systems (such as ARM64, ARM32, etc.) rewrite the industry's general algorithm math library, data processing module, etc., and optimize these libraries and modules based on the instruction set as much as possible. In order to reduce the memory space and computing overhead occupied by the algorithm library when it is called, and the corresponding AI applications such as intelligent operation and maintenance can directly call these mathematical libraries, so as to reduce the resource overhead of AI applications during execution as much as possible.
然而,这种方式难以解决当BMC***中的AI应用越来越多时的多AI应用竞争***资源的问题,过多的AI应用在并发执行时依然会对BMC***带来过大的负载压力从而(甚至)会影响***中基础的***管理功能的执行。However, this method is difficult to solve the problem of multiple AI applications competing for system resources when there are more and more AI applications in the BMC system. Too many AI applications will still bring excessive load pressure to the BMC system when executed concurrently. (even) can affect the performance of the basic system management functions in the system.
举例说明,BMC总内存1C 512M,其中主体***运行,加常规***业务执行时已占用70%CPU,380M内存,由于需要引入AI推理应用如:内存故障预测(15%,50M)、机械硬盘(hard disk drive,HDD)故障预测(5%,50M)、固态硬盘(solid state drive,SSD)故障预测(5%,50M)以及联邦学习(20%,75M)等,如果这4个AI推理应用在同一时间的条件都被触发,就会并发执行,这时CPU:70%+15%+5%+5%+20%=115%,内存:380+50+50+50+75=605M,均已超过了资源使用上限,势必影响到基础核心的BMC***与业务的执行。For example, the total memory of BMC is 1C 512M, of which the main system is running, and 70% of the CPU and 380M of memory are occupied when the main system is running, plus conventional system business execution. Due to the need to introduce AI inference applications such as: memory failure prediction (15%, 50M), mechanical hard disk ( hard disk drive, HDD) failure prediction (5%, 50M), solid state drive (solid state drive, SSD) failure prediction (5%, 50M) and federated learning (20%, 75M), etc., if these 4 AI reasoning applications If the conditions are triggered at the same time, they will be executed concurrently. At this time, CPU: 70%+15%+5%+5%+20%=115%, memory: 380+50+50+50+75=605M, All have exceeded the upper limit of resource usage, which will inevitably affect the execution of the basic core BMC system and business.
有鉴于此,本实施例提出了一种BMC内对于AI应用中与其AI任务解耦的管理思路,并给出了相应的AI任务管理***设计,通过资源管控、任务调度等方式来弥补以上方案的不足。通过本实施例提供的方案,有助于BMC在有限资源中高效的执行多个AI应用,并限制AI应用的资源开销,避免影响BMC基础业务应用的执行。In view of this, this embodiment proposes a management idea for decoupling AI applications from their AI tasks in BMC, and provides a corresponding AI task management system design to make up for the above solutions through resource management and control, task scheduling, etc. of insufficiency. The solution provided by this embodiment helps the BMC to efficiently execute multiple AI applications in limited resources, limits the resource overhead of the AI applications, and avoids affecting the execution of BMC basic service applications.
下面对本申请实施例的***架构举例说明。The following describes the system architecture of the embodiment of the present application by way of example.
附图1是本申请实施例提供的一种BMC的***架构的示意图。如附图1所示,BMC包括至少一个AI应用、管理***和基础AI库。下面通过(1)至(3),对附图1中各个组件具体说明。FIG. 1 is a schematic diagram of a system architecture of a BMC provided by an embodiment of the present application. As shown in Fig. 1, BMC includes at least one AI application, management system and basic AI library. Each component in FIG. 1 will be described in detail below through (1) to (3).
(1)AI应用(1) AI application
BMC内的AI应用例如为智能运维应用。例如,AI应用包括而不限于故障预测应用、性能分析应用、功耗分析应用等。故障预测应用用于根据AI算法预测BMC管理的***硬件发生故障的概率。例如,故障预测应用包括而不限于内存故障预测应用以及硬盘故障预测应用。硬盘故障预测应用如HDD故障预测应用、SSD故障预测应用等。性能分析应用用于根据AI算法分析BMC管理的***硬件的性能。功耗分析应用用于根据AI算法分析BMC管理的***硬件的功耗。可选地,功耗分析应用还用于为***各个硬件找到合理的功耗并将找到的功耗提供给BMC,BMC会将***硬件的功耗调节至功耗分析应用找到的功耗。The AI application in the BMC is, for example, an intelligent operation and maintenance application. For example, AI applications include, but are not limited to, fault prediction applications, performance analysis applications, power consumption analysis applications, and the like. The failure prediction application is used to predict the probability of failure of the system hardware managed by the BMC according to the AI algorithm. For example, failure prediction applications include, but are not limited to, memory failure prediction applications and hard disk failure prediction applications. Hard disk failure prediction applications such as HDD failure prediction applications, SSD failure prediction applications, etc. The performance analysis application is used to analyze the performance of the system hardware managed by the BMC according to the AI algorithm. The power consumption analysis application is used to analyze the power consumption of the system hardware managed by the BMC according to the AI algorithm. Optionally, the power consumption analysis application is further used to find reasonable power consumption for each hardware of the system and provide the found power consumption to the BMC, and the BMC adjusts the power consumption of the system hardware to the power consumption found by the power consumption analysis application.
(2)管理***(2) Management system
管理***也称AI任务以及模型管理***。管理***用于AI任务进行资源分配以及任务调度。具体地,管理***会监控BMC内AI任务的资源消耗情况以及BMC的基础业务的资源消耗情况,结合配额、优先级、资源消耗情况、执行计划、业务周期等智能化动态的进行作业调度。管理***是软件。The management system is also called AI task and model management system. The management system is used for AI tasks to allocate resources and schedule tasks. Specifically, the management system will monitor the resource consumption of AI tasks in BMC and the resource consumption of BMC's basic services, and intelligently and dynamically schedule jobs in combination with quotas, priorities, resource consumption, execution plans, and business cycles. Management systems are software.
管理***为BMC内AI运行时框架。管理***为BMC提供了统一的AI应用安全、高性能执行的运行时管控能力,避免了AI应用由于资源开销较大,多个AI应用无序并发产生资源竞争,影响BMC的基础业务有效执行的问题。管理***包括历史信息回归学习模块、动态资源配额管理模块以及任务管理(调度执行)模块。The management system is the AI runtime framework within the BMC. The management system provides BMC with a unified runtime control capability for safe and high-performance execution of AI applications, which avoids resource competition caused by the large resource overhead of AI applications and the disorderly concurrency of multiple AI applications, which affects the effective execution of BMC's basic services. question. The management system includes a historical information regression learning module, a dynamic resource quota management module and a task management (scheduling execution) module.
历史信息回归学习模块用于基于历史信息进行自学习回归优化。具体地,历史信息回归学习模块基于特征为内存密集型(MEM intensive,如训练、联邦学习等)或CPU密集型(CPU intensive,如推理,batch数据处理)的***业务周期,合理选择AI任务进行调度。开启历史信息的自学习回归优化功能后,管理***会自动学习当前BMC的负载周期,识别不同场 景下的负载特征,动态驱动AI任务(如区分训练、推理等),在合理资源使用时间段进行区分调度执行。The historical information regression learning module is used for self-learning regression optimization based on historical information. Specifically, the historical information regression learning module selects AI tasks reasonably based on the system business cycle with features that are memory-intensive (MEM intensive, such as training, federated learning, etc.) or CPU-intensive (CPU intensive, such as inference, batch data processing). schedule. After enabling the self-learning regression optimization function of historical information, the management system will automatically learn the current BMC load cycle, identify the load characteristics in different scenarios, and dynamically drive AI tasks (such as distinguishing training, reasoning, etc.), and perform them within a reasonable resource usage time period. Differentiate scheduling execution.
动态资源配额管理模块用于对各个AI任务总体的资源开销进行管控,从而实现资源总包以及总体资源封顶的功能。动态资源配额管理模块在执行时,会基于资源配额动态管控AI任务的资源分配与调整。The dynamic resource quota management module is used to manage and control the overall resource overhead of each AI task, so as to realize the functions of total resource package and overall resource capping. When the dynamic resource quota management module is executed, it will dynamically control the resource allocation and adjustment of AI tasks based on resource quotas.
任务管理(调度执行)模块支持AI任务的执行计划管理。执行计划包括定时执行计划、按需执行计划等。任务管理(调度执行)模块还支持基于优先级的执行管理。任务管理(调度执行)模块配合动态资源配额管理模块提高资源的利用率。The task management (scheduling execution) module supports the execution plan management of AI tasks. The execution plan includes a timed execution plan, an on-demand execution plan, and the like. The Task Management (Scheduled Execution) module also supports priority-based execution management. The task management (scheduling execution) module cooperates with the dynamic resource quota management module to improve the utilization of resources.
(3)基础AI库(3) Basic AI library
基础AI库包括基础数学库以及数据采集与处理模块。基础数学库包括至少一个模型。基础数学库中的模型例如是通过机器学习算法训练后的模型。例如,基础数学库包括联邦学习(Collaborative Metric Learning,CML)模型、神经网络(Neural Network,NN)模型、随机森林模型、K均值(K-means)模型等。The basic AI library includes basic math library and data acquisition and processing module. The base math library includes at least one model. The models in the basic math library are, for example, models trained by machine learning algorithms. For example, basic math libraries include a federated learning (Collaborative Metric Learning, CML) model, a neural network (Neural Network, NN) model, a random forest model, a K-means (K-means) model, and the like.
下面对本申请实施例的方法流程举例说明。The method flow of the embodiment of the present application is illustrated below with an example.
附图2是本申请实施例提供的一种任务调度方法200的流程图。方法200包括以下步骤S201至步骤S205。FIG. 2 is a flowchart of a task scheduling method 200 provided by an embodiment of the present application. The method 200 includes the following steps S201 to S205.
在一些实施例中,方法200所基于的***架构如上述附图1所示。例如,结合附图1来看,方法200中的BMC为附图1中的BMC,方法200中的至少一个AI应用包括附图1中的AI应用1、AI应用2、AI应用3和AI应用n,方法200中的基础AI库为附图1中的基础AI库。In some embodiments, the system architecture on which the method 200 is based is as shown in FIG. 1 above. For example, with reference to FIG. 1, the BMC in the method 200 is the BMC in FIG. 1, and at least one AI application in the method 200 includes AI application 1, AI application 2, AI application 3, and AI application in FIG. 1 n, the basic AI library in the method 200 is the basic AI library in FIG. 1 .
方法200涉及多个AI任务以及多个AI应用。为了区分不同的AI任务,用“第一AI任务”、“第二AI任务”区分描述不同的AI任务。为了区分不同的AI应用,用“第一AI应用”、“第二AI应用”区分描述不同的AI应用。The method 200 involves multiple AI tasks and multiple AI applications. In order to distinguish different AI tasks, "first AI task" and "second AI task" are used to distinguish and describe different AI tasks. In order to distinguish different AI applications, "first AI application" and "second AI application" are used to distinguish and describe different AI applications.
可选地,方法200用于多个AI应用并发执行的场景。为了便于读者理解,方法200以管理***与第一AI应用交互为例进行说明。管理***与其他AI应用的交互可参考与第一AI应用的交互流程。Optionally, the method 200 is used in a scenario where multiple AI applications are executed concurrently. To facilitate the reader's understanding, the method 200 is described by taking the interaction between the management system and the first AI application as an example. For the interaction between the management system and other AI applications, refer to the interaction process with the first AI application.
步骤S201、第一AI应用生成并向管理***发送请求。Step S201, the first AI application generates and sends a request to the management system.
第一AI应用为BMC中的一个AI应用。第一AI应用的请求用于指示管理***执行至少一个AI任务。在一些实施例中,第一AI应用的请求包括基础AI库中至少一个模型的标识以及模型的至少一个输入参数。The first AI application is an AI application in the BMC. The request of the first AI application is used to instruct the management system to perform at least one AI task. In some embodiments, the request of the first AI application includes an identification of at least one model in the base AI library and at least one input parameter of the model.
模型的标识用于标识基础AI库中对应的模型。第一AI应用通过在请求中携带模型的标识,从而指明调用基础AI库中的哪一种模型进行计算。例如,请求包括随机森林模型的标识,请求指示管理***调用随机森林模型从而执行AI任务。The ID of the model is used to identify the corresponding model in the basic AI library. The first AI application specifies which model in the basic AI library to call for calculation by carrying the model identifier in the request. For example, the request includes the identification of the random forest model, and the request instructs the management system to invoke the random forest model to perform the AI task.
请求中的输入参数包括BMC管理的***硬件的属性,属性的具体类型与第一AI应用具体的业务逻辑以及针对的***硬件的类型相关。以***硬件为硬盘为例,当第一AI应用为故障预测应用时,请求携带的输入参数用于预测硬盘是否故障,具体是硬盘的健康状况信息,例如硬盘的扫描错误数量、重新分配计数和试用计数等。又如,第一AI应用为性能分析应用时,请求携带的输入参数为硬盘的性能参数,比如硬盘的转速、容量、平均寻道时间、传输速率等。The input parameter in the request includes the attribute of the system hardware managed by the BMC, and the specific type of the attribute is related to the specific business logic of the first AI application and the type of the targeted system hardware. Taking the system hardware as a hard disk as an example, when the first AI application is a failure prediction application, the input parameters carried in the request are used to predict whether the hard disk is faulty, specifically the health status information of the hard disk, such as the number of scan errors of the hard disk, the reallocation count and Trial count, etc. For another example, when the first AI application is a performance analysis application, the input parameters carried in the request are the performance parameters of the hard disk, such as the rotational speed, capacity, average seek time, and transfer rate of the hard disk.
步骤S202、管理***响应于第一AI应用的请求,使用基础AI库生成至少一个AI任务。Step S202, the management system generates at least one AI task by using the basic AI library in response to the request of the first AI application.
AI任务是调用基础AI库中的模型进行运算的任务。AI任务有时也称为AI作业。AI任务包括而不限于训练任务以及推理任务。训练任务包括而不限于计算模型的梯度值、计算模型参数等。推理任务为通过训练好的模型进行推理的任务。在一些实施例中,请求指示调用基础AI库中的分类模型,分类模型对应的推理任务为确定类别或者类别的概率。在另一些实施例中,请求指示调用基础AI库中的回归模型,回归模型对应的推理任务为确定目标值。AI tasks are tasks that call models in the basic AI library to perform operations. AI tasks are sometimes also referred to as AI jobs. AI tasks include, but are not limited to, training tasks and inference tasks. Training tasks include, but are not limited to, calculating gradient values of the model, calculating model parameters, and the like. The inference task is the task of inference through the trained model. In some embodiments, the request instruction invokes a classification model in the basic AI library, and the reasoning task corresponding to the classification model is to determine a category or a probability of a category. In other embodiments, the request indicates to call a regression model in the basic AI library, and the reasoning task corresponding to the regression model is to determine the target value.
在一些实施例中,AI任务具体为根据AI应用的请求中的输入参数、采用AI应用的请求中指示的模型进行运算的任务。具体地,管理***从第一AI应用的请求中获取模型的标识以及模型的输入参数。管理***根据模型的标识,从基础AI库中的至少一个模型中选择对应的模型,将请求中携带的输入参数输入至模型,通过模型对输入参数进行处理。其中,通过模型进行处理的过程即为AI任务。In some embodiments, the AI task is specifically a task of performing operations according to input parameters in the request of the AI application and using the model indicated in the request of the AI application. Specifically, the management system obtains the identifier of the model and the input parameters of the model from the request of the first AI application. The management system selects a corresponding model from at least one model in the basic AI library according to the identification of the model, inputs the input parameters carried in the request into the model, and processes the input parameters through the model. Among them, the process of processing through the model is the AI task.
步骤S203、管理***从BMC的资源中为第一AI任务分配第一资源。Step S203, the management system allocates the first resource for the first AI task from the resources of the BMC.
BMC的资源包括而不限于:计算资源、存储资源和网络资源。计算资源包括而不限于CPU、内存、GPU等。存储资源包括硬盘,如HDD、SSD等。网络资源包括带宽、网际互连协议(internet protocol,IP)地址、端口号等。The resources of the BMC include but are not limited to: computing resources, storage resources and network resources. Computing resources include, but are not limited to, CPU, memory, GPU, and the like. Storage resources include hard disks, such as HDDs, SSDs, and the like. Network resources include bandwidth, internet protocol (IP) addresses, port numbers, and the like.
第一AI任务是至少一个AI任务中的一个AI任务。第一资源是指为第一AI任务分配的资源。第一资源是BMC的资源中的部分资源。例如,第一资源是一定量的CPU和一定量的内存空间。The first AI task is an AI task of at least one AI task. The first resource refers to a resource allocated for the first AI task. The first resource is a part of the resources of the BMC. For example, the first resource is a certain amount of CPU and a certain amount of memory space.
在一些实施例中,管理***基于资源配额(resource quota)的机制来管控AI任务占用的资源。具体地,配额用于对AI任务的资源消耗总量提供限制。管理***会监控AI任务占用的资源,从而保证AI任务的资源用量不超过限额。例如,管理***在针对第一AI任务做资源分配时,管理***根据配额为第一AI任务分配第一资源。第一资源不超过配额。例如,如果第一AI任务所需的资源是内存,而配额是n兆,那么管理***分配的内存(第一资源)会不超过n兆。在一些实施例中,配额是BMC的配置文件的内容。配额是用户预先设定的。In some embodiments, the management system manages resources occupied by AI tasks based on a resource quota mechanism. Specifically, quotas are used to provide limits on the total resource consumption of AI tasks. The management system monitors the resources occupied by AI tasks to ensure that the resource usage of AI tasks does not exceed the limit. For example, when the management system allocates resources for the first AI task, the management system allocates the first resource for the first AI task according to the quota. The first resource does not exceed the quota. For example, if the resource required by the first AI task is memory and the quota is n megabytes, then the memory allocated by the management system (the first resource) will not exceed n megabytes. In some embodiments, the quota is the content of the BMC's configuration file. Quotas are preset by the user.
管理***通过配额来管理AI任务的资源,能够防止AI任务不加限制的使用资源,从而避免AI任务使用的资源超过上限而影响BMC的基础业务。The management system manages the resources of AI tasks through quotas, which can prevent AI tasks from using resources without restrictions, thereby preventing the resources used by AI tasks from exceeding the upper limit and affecting the basic business of BMC.
资源配额的机制包括多种实现方式,下面结合两种实现方式举例说明。The mechanism of resource quota includes multiple implementations, and the following is an example of the two implementations.
实现方式一、将至少一个AI任务作为一个总体,通过为这些AI任务引入总体配额从而实现总资源封顶的功能。 Implementation method 1. Take at least one AI task as a whole, and realize the function of capping the total resources by introducing an overall quota for these AI tasks.
具体地,上述配额包括总体配额,管理***在做资源分配时,会根据总体配额,为至少一个AI任务分配总体资源。在针对第一AI任务进行资源分配时,管理***会从总体资源中分配第一资源。Specifically, the above quota includes an overall quota. When the management system allocates resources, it will allocate overall resources for at least one AI task according to the overall quota. When resource allocation is performed for the first AI task, the management system allocates the first resource from the overall resources.
总体配额指示至少一个AI任务总体的配额。例如,总体配额为各个AI任务的配额的和。管理***为至少一个AI任务分配的总体资源不超过总体配额。其中,单个AI任务分配的资源是总体资源中的一部分,单个AI任务分配的资源最多为总体资源。例如,在执行一个AI任务的情况下,管理***为这一个AI任务分配的资源最多为总体配额。在并发执行多个AI任务的情况下,管理***为每个AI任务分配的资源均小于总体配额,且,管理***为所有AI任务分配的资源之和不超过总体配额。The overall quota indicates the quota of at least one AI task population. For example, the overall quota is the sum of the quotas of the individual AI tasks. The overall resource allocated by the management system for at least one AI task does not exceed the overall quota. Among them, the resources allocated by a single AI task are part of the overall resources, and the resources allocated by a single AI task are at most the overall resources. For example, in the case of executing an AI task, the resources allocated by the management system for this AI task are at most the overall quota. In the case of concurrent execution of multiple AI tasks, the resources allocated by the management system for each AI task are less than the overall quota, and the sum of resources allocated by the management system for all AI tasks does not exceed the overall quota.
例如,上述资源为内存,总体配额为20M。如果目前共计有n个AI任务,那么管理*** 为n个AI任务总共分配20M内存,各个AI任务相当于在这20M内存中分资源。如果当前只有一个AI任务需要执行,那么这一个AI任务最多获得20M内存。如果当前并发执行多个AI任务,那么为每一个AI任务分配的内存均小于20M,且为所有AI任务分配的内存之和不超过20M。For example, the above resource is memory, and the overall quota is 20M. If there are currently n AI tasks in total, the management system allocates a total of 20M memory for the n AI tasks, and each AI task is equivalent to dividing resources in the 20M memory. If there is only one AI task that needs to be executed at present, then this AI task can obtain up to 20M of memory. If multiple AI tasks are currently being executed concurrently, the memory allocated for each AI task is less than 20M, and the sum of the memory allocated for all AI tasks does not exceed 20M.
通过引入上述总体配额的机制,有助于对BMC内所有AI任务的总体资源开销进行管控。By introducing the above-mentioned overall quota mechanism, it is helpful to control the overall resource expenditure of all AI tasks in BMC.
在一些实施例中,在总体配额的基础上为各个AI任务引入对应的比例配额。具体地,上述配额不仅包括总体配额,还包括比例配额。比例配额也称资源配比,比例配额指示一个AI任务的配额占总体配额的比例,比例配额越大的AI任务能获得的资源越多。管理***在针对第一AI任务进行资源分配时,会根据第一AI任务的比例配额,从总体资源中分配第一资源。其中,第一AI任务的比例配额指示第一AI任务的配额占总体配额的比例。第一资源不超过总体资源与比例配额的乘积。例如,资源为内存,总体配额为20M,如果第一AI任务的比例配额为0.7,则第一资源不超过14M。In some embodiments, a corresponding proportional quota is introduced for each AI task on the basis of the overall quota. Specifically, the above quotas include not only overall quotas, but also proportional quotas. Proportional quota is also called resource allocation. Proportional quota indicates the proportion of an AI task's quota to the overall quota. The larger the proportional quota, the more resources an AI task can obtain. When the management system allocates resources for the first AI task, it will allocate the first resource from the overall resources according to the proportional quota of the first AI task. The proportion quota of the first AI task indicates the proportion of the quota of the first AI task to the overall quota. The first resource does not exceed the product of the overall resource and the proportional quota. For example, if the resource is memory and the overall quota is 20M, if the proportional quota of the first AI task is 0.7, the first resource does not exceed 14M.
在一些实施例中,比例配额是在多个AI任务并发执行的情况下使用的。具体地,如果管理***要执行多个AI任务,比如需要执行的AI任务不仅包括上面举例的第一AI任务,还包括第二AI任务等其他AI任务,那么管理***根据第一AI任务的比例配额为第一AI任务分配第一资源。如果管理***执行第一AI任务这一个AI任务,那么管理***可选地为第一AI任务分配超过比例配额的资源,且为第一AI任务分配的资源最多为所有AI任务的总体资源。In some embodiments, proportional quotas are used where multiple AI tasks are executing concurrently. Specifically, if the management system needs to perform multiple AI tasks, for example, the AI tasks that need to be performed include not only the first AI task exemplified above, but also other AI tasks such as the second AI task, then the management system is based on the proportion of the first AI tasks. Quota allocates the first resource for the first AI task. If the management system executes the first AI task, the management system optionally allocates resources exceeding the proportional quota to the first AI task, and the resources allocated to the first AI task are at most the total resources of all AI tasks.
实现方式二、为每个AI任务分别设置具体的配额,在针对第一AI任务进行资源分配时,管理***会根据AI任务与配额之间的对应关系,为第一AI任务分配第一资源。其中,第一资源不超过第一AI任务的配额。Implementation method 2: Set a specific quota for each AI task. When allocating resources for the first AI task, the management system will allocate the first resource for the first AI task according to the corresponding relationship between the AI task and the quota. Wherein, the first resource does not exceed the quota of the first AI task.
在一些实施例中,当AI任务占用的资源超过配额时,管理***会对AI任务执行指定的行为从而释放AI任务过多占用的资源。指定的行为包括而不限于杀死(kill)job、内存转交换(swap)分区等。例如,如果第一AI任务占用的资源超过配额,管理***杀死第一AI任务,从而释放第一AI任务占用的资源。又如,上述资源包括内存,如果第一AI任务占用的资源超过配额,管理***将BMC的内存中第一AI任务的数据保存至swap分区中,并释放第一AI任务的数据在BMC的内存中占用的空间,从而腾出内存空间供第一AI任务之外的其他AI任务使用。In some embodiments, when the resource occupied by the AI task exceeds the quota, the management system performs a specified behavior on the AI task to release the resource occupied by the AI task excessively. Specified behaviors include, but are not limited to, kill jobs, swap partitions, and the like. For example, if the resources occupied by the first AI task exceed the quota, the management system kills the first AI task, thereby releasing the resources occupied by the first AI task. For another example, the above-mentioned resources include memory. If the resources occupied by the first AI task exceed the quota, the management system saves the data of the first AI task in the memory of the BMC to the swap partition, and releases the data of the first AI task in the memory of the BMC. This frees up memory space for other AI tasks other than the first AI task.
在一些实施例中,管理***会基于AI任务的优先级管控AI任务的资源分配。具体地,允许高优先级的AI任务抢占低优先级的AI任务的资源。其中,资源抢占的实现方式包括而不限于kill任务或者调用linux中的控制组(control groups,Cgroup)动态资源调整等。例如,存在第一AI任务和第二任务,第二AI任务的优先级低于第一AI任务的优先级。管理***针对第一AI任务进行资源分配时,若剩余资源不满足第一AI任务的资源需求,管理***杀死第二AI任务,从第二AI任务释放的资源中为第一AI任务分配第一资源;或者,若剩余资源不满足第一AI任务的资源需求,管理***调用Cgroup从而调整第一AI任务的资源和第二任务的资源。通过这种方式,有助于提高资源的利用率。In some embodiments, the management system will govern the resource allocation of AI tasks based on the priority of the AI tasks. Specifically, high-priority AI tasks are allowed to preempt the resources of low-priority AI tasks. The implementation of resource preemption includes, but is not limited to, kill tasks or invoking control groups (control groups, Cgroups) in linux to dynamically adjust resources. For example, there is a first AI task and a second task, the second AI task has a lower priority than the first AI task. When the management system allocates resources for the first AI task, if the remaining resources do not meet the resource requirements of the first AI task, the management system kills the second AI task, and allocates the first AI task from the resources released by the second AI task. a resource; or, if the remaining resources do not meet the resource requirements of the first AI task, the management system calls Cgroup to adjust the resources of the first AI task and the resources of the second task. In this way, it helps to improve the utilization of resources.
在一些实施例中,管理***还通过历史信息回归学习,选择合理的AI任务在合理的时间段执行。具体地,管理***基于历史信息确定BMC的业务周期;管理***根据业务周期从至少一个AI任务中确定第一AI任务。In some embodiments, the management system also regresses learning through historical information, and selects reasonable AI tasks to execute in a reasonable time period. Specifically, the management system determines the service period of the BMC based on the historical information; the management system determines the first AI task from at least one AI task according to the service period.
历史信息包括业务在多个历史时间段的资源开销。其中,业务包括BMC的基础业务以及 AI任务。基础业务例如监控风扇的状态、日志记录等。BMC的基础业务的历史信息包括基础业务在历史时间段的资源开销。AI任务的历史信息包括AI任务在历史时间段的资源开销。在一些实施例中,历史信息是BMC预先采集的。例如,BMC中基础AI库中的数据采集与处理模块预先采集并保存***硬件的历史信息。The historical information includes the resource overhead of the service in multiple historical time periods. Among them, the business includes BMC's basic business and AI tasks. Basic services such as monitoring fan status, logging, etc. The historical information of the basic service of the BMC includes the resource expenditure of the basic service in the historical time period. The historical information of AI tasks includes the resource consumption of AI tasks in historical time periods. In some embodiments, the historical information is pre-collected by the BMC. For example, the data acquisition and processing module in the basic AI library in BMC collects and saves the historical information of the system hardware in advance.
业务周期指示业务的资源开销与时间之间的对应关系。例如,业务周期的数据形式是一条曲线,曲线的波峰为资源开销的最大值,曲线的波谷为资源开销的最小值。在一些实施例中,业务周期包括CPU、内存等多个维度。CPU维度的业务周期指示业务对CPU的开销与时间之间的对应关系。内存维度的业务周期指示业务对内存的开销与时间之间的对应关系。在一些实施例中,业务周期包括BMC的基础业务的业务周期以及至少一个AI任务的业务周期。The service period indicates the correspondence between the resource overhead of the service and the time. For example, the data form of the business cycle is a curve, the peak of the curve is the maximum resource cost, and the trough of the curve is the minimum resource cost. In some embodiments, the business cycle includes multiple dimensions such as CPU and memory. The service period of the CPU dimension indicates the corresponding relationship between the CPU overhead and time of the service. The business cycle of the memory dimension indicates the correspondence between the business's memory overhead and time. In some embodiments, the business period includes the business period of the basic business of the BMC and the business period of at least one AI task.
在一些实施例中,管理***采用回归学***均自回归模型(Autoregressive Integrated Moving Average model,ARIMA)、指数平滑、周期识别算法等等。In some embodiments, the management system employs a regression learning algorithm to determine the business period based on historical information. Regression learning algorithms are, for example, time series forecasting algorithms. Time series forecasting algorithms include but are not limited to differential integrated moving average autoregressive models (Autoregressive Integrated Moving Average model, ARIMA), exponential smoothing, cycle identification algorithms, and so on.
在一些实施例中,管理***基于学习出的基础业务的高峰期来选择AI任务调度。例如,管理***从基础业务的业务周期中,确定基础业务的高峰期;管理***从至少一个AI任务的业务周期中,确定至少一个AI任务的高峰期;管理***根据基础业务的高峰期以及至少一个AI任务的高峰期确定第一AI任务。In some embodiments, the management system selects the AI task schedule based on the learned peak period of the underlying business. For example, the management system determines the peak period of the basic business from the business cycle of the basic business; the management system determines the peak period of at least one AI task from the business cycle of at least one AI task; the management system determines the peak period of the basic business according to the peak period and at least one AI task. The peak period of an AI task determines the first AI task.
高峰期是指业务周期中资源开销的最大值对应的时间段。比如,业务周期是一条曲线,曲线的纵坐标表示资源开销,曲线的横坐标表示时间,高峰期是曲线的波峰对应的时间段。The peak period refers to the time period corresponding to the maximum resource overhead in the business cycle. For example, a business cycle is a curve, the ordinate of the curve represents resource overhead, the abscissa of the curve represents time, and the peak period is the time period corresponding to the peak of the curve.
可选地,管理***找到基础业务的高峰期和AI任务的高峰期之后,从至少一个AI任务中找到高峰期与基础业务的高峰期相错开的AI任务,调度找到的AI任务(第一AI任务)执行。换句话说,上述第一AI任务的高峰期与基础业务的高峰期不同。通过这种方式,有助于实现错峰处理,避免AI任务的资源开销出现高峰期时基础业务的资源开销也出现高峰期,从而降低AI任务的执行对基础业务造成的影响。Optionally, after finding the peak period of the basic business and the peak period of the AI task, the management system finds an AI task whose peak period is staggered from the peak period of the basic business from at least one AI task, and schedules the found AI task (the first AI task). task) to execute. In other words, the peak period of the above-mentioned first AI task is different from the peak period of the underlying business. In this way, it is helpful to realize off-peak processing and avoid the peak period of resource consumption of AI tasks when the resource consumption of basic services also peaks, thereby reducing the impact of AI task execution on basic services.
步骤S204、管理***使用第一资源执行第一AI任务,从而得到执行结果。Step S204, the management system executes the first AI task by using the first resource, thereby obtaining an execution result.
执行结果是指执行第一AI任务得到的结果。具体地,执行结果为基础AI库中的模型的输出参数。例如,执行结果包括而不限于二分类结果、分类的概率、预测的目标值等。例如,第一AI应用为SSD故障预测应用,管理***调用的基础AI库中的模型是一个分类模型。在这种场景下,例如,执行结果是二分类结果,执行结果是0时表示SSD是正常盘,执行结果是1时表示SSD是故障盘。又如,执行结果是概率,执行结果的取值范围在(0,1)之间,执行结果越大表示SSD是故障盘的概率越大。The execution result refers to the result obtained by executing the first AI task. Specifically, the execution result is the output parameter of the model in the basic AI library. For example, the execution result includes, but is not limited to, a binary classification result, a classification probability, a predicted target value, and the like. For example, the first AI application is an SSD failure prediction application, and the model in the basic AI library for managing system calls is a classification model. In this scenario, for example, the execution result is a binary classification result. When the execution result is 0, it indicates that the SSD is a normal disk, and when the execution result is 1, it indicates that the SSD is a faulty disk. For another example, the execution result is a probability, and the value range of the execution result is between (0, 1). The larger the execution result, the higher the probability that the SSD is a faulty disk.
在一些实施例中,管理***根据多个AI任务的优先级,确定多个AI任务的执行顺序。其中,AI任务的优先级越高,AI任务越先执行。例如,优先级包括三种,分别是高优先级、中优先级以及低优先级。所有AI任务按照优先级分为高优先级的AI任务、中优先级的AI任务以及低优先级的AI任务。管理***会设置三种队列,分别是高优先级的队列、中优先级的队列以及低优先级的队列。每种队列用于缓存对应优先级的AI任务。管理***要执行任务时,会优先获取并执行高优先级的队列中的任务。例如,管理***首先判断高优先级的队列是否包含AI任务。如果高优先级的队列包含AI任务,则管理***从高优先级的队列获取AI任务并执行。如果高优先级的队列不包含AI任务,则管理***继续判断中优先级的队列是否包含AI任务。如果中优先级的队列不包含AI任务,则管理***继续判断低优先级的 队列是否包含AI任务。通过支持基于优先级的执行管理,有助于提高资源的利用率。In some embodiments, the management system determines the execution order of the plurality of AI tasks according to the priorities of the plurality of AI tasks. Among them, the higher the priority of the AI task, the earlier the AI task is executed. For example, the priority includes three types, namely, high priority, medium priority, and low priority. All AI tasks are divided into high-priority AI tasks, medium-priority AI tasks, and low-priority AI tasks according to their priorities. The management system will set up three kinds of queues, namely high priority queues, medium priority queues and low priority queues. Each queue is used to cache AI tasks with corresponding priorities. When the management system wants to execute a task, it will first obtain and execute the task in the high-priority queue. For example, the management system first determines whether a high-priority queue contains AI tasks. If the high-priority queue contains AI tasks, the management system obtains the AI tasks from the high-priority queue and executes them. If the high-priority queue does not contain AI tasks, the management system continues to determine whether the medium-priority queue contains AI tasks. If the medium-priority queue does not contain AI tasks, the management system continues to judge whether the low-priority queue contains AI tasks. Helps improve resource utilization by supporting priority-based execution management.
在一些实施例中,管理***基于历史学习得出基础业务的高峰期,确定AI任务的执行时间。具体地,管理***根据BMC的基础业务的业务周期确定目标时间;管理***在目标时间,使用第一资源执行第一AI任务。其中,目标时间是第一AI任务的执行时间点,目标时间位于基础业务的高峰期之外的时间段。通过这种方式,能让AI任务的执行时间与基础业务的高峰期错开,避免在基础业务的高峰期执行AI任务,从而避免AI任务对基础业务造成影响。In some embodiments, the management system obtains the peak period of the basic business based on historical learning, and determines the execution time of the AI task. Specifically, the management system determines the target time according to the service cycle of the basic service of the BMC; at the target time, the management system uses the first resource to execute the first AI task. The target time is the execution time point of the first AI task, and the target time is in a time period outside the peak period of the basic business. In this way, the execution time of AI tasks can be staggered from the peak period of the basic business, avoiding the execution of AI tasks during the peak period of the basic business, thereby avoiding the impact of the AI task on the basic business.
在一些实施例中,支持设定AI任务的执行计划。执行计划指示开始执行AI任务的时间点。执行计划包括而不限于定时执行计划以及按需执行计划。执行计划例如是预先设定的配置。In some embodiments, setting execution plans for AI tasks is supported. The execution plan indicates the point in time when the execution of the AI task will begin. Execution plans include, but are not limited to, timed execution plans and on-demand execution plans. The execution plan is, for example, a preset configuration.
例如,针对第一AI任务来说,管理***获取第一AI任务的执行计划,管理***按照第一AI任务的执行计划执行第一AI任务。执行计划指示开始执行第一AI任务的时间点。For example, for the first AI task, the management system obtains the execution plan of the first AI task, and the management system executes the first AI task according to the execution plan of the first AI task. The execution plan indicates the point in time when the execution of the first AI task is started.
可选地,定时执行计划指定一个具体时间点,或者指定一个时间周期。例如,第一AI任务的执行计划为定时执行计划,定时执行计划指示在预设时间点执行第一AI任务,管理***会从定时执行计划中获取该预设时间点,在预设时间点开始执行第一AI任务。又如,第一AI任务的执行计划为定时执行计划,定时执行计划指示每隔预设周期执行第一AI任务,管理***会从定时执行计划中获取该预设周期,启动定时器,每隔预设周期开始执行第一AI任务。Optionally, the timing execution plan specifies a specific time point, or specifies a time period. For example, the execution plan of the first AI task is a timing execution plan. The timing execution plan indicates that the first AI task is executed at a preset time point. The management system will obtain the preset time point from the timing execution plan and start at the preset time point. Perform the first AI mission. For another example, the execution plan of the first AI task is a timed execution plan, and the timed execution plan instructs to execute the first AI task every preset period. The management system will obtain the preset period from the timed execution plan, start the timer, and every The preset period starts to execute the first AI task.
又如,第一AI任务的执行计划为按需执行计划,按需执行计划指示当接收到指令时执行第一AI任务。指令用于指示管理***开始执行第一AI任务。当上层业务应用或者AI应用需要时,可以生成并向管理***发送指令。管理***接收到指令时,会按照该按需执行计划开始执行第一AI任务。For another example, the execution plan of the first AI task is an on-demand execution plan, and the on-demand execution plan indicates that the first AI task is executed when an instruction is received. The instruction is used to instruct the management system to start executing the first AI task. When required by upper-layer business applications or AI applications, instructions can be generated and sent to the management system. When the management system receives the instruction, it will start executing the first AI task according to the on-demand execution plan.
步骤S205、管理***向第一AI应用提供执行结果。Step S205, the management system provides the execution result to the first AI application.
管理***提供执行结果的实现方式包括很多种。例如,管理***向第一AI任务发送执行结果。在一种可能的实现中,管理***提供执行结果的查询接口。第一AI应用调用查询接口,向管理***发送查询请求。管理***响应于查询请求,向第一AI应用发送查询响应,查询响应包括执行结果。第一AI应用接收查询响应,从查询响应获得执行结果。又如,管理***将执行结果保存至指定地址,第一AI应用访问指定地址从而获得执行结果。There are many ways for the management system to provide execution results. For example, the management system sends the execution result to the first AI task. In a possible implementation, the management system provides a query interface for execution results. The first AI application calls the query interface and sends a query request to the management system. In response to the query request, the management system sends a query response to the first AI application, where the query response includes an execution result. The first AI application receives the query response, and obtains an execution result from the query response. For another example, the management system saves the execution result to a specified address, and the first AI application accesses the specified address to obtain the execution result.
本实施例提供的方法,通过在BMC内引入了针对AI任务的管理***,从而将AI应用与AI任务以及底层的基础AI库解耦开来,由管理***来统一为各个AI任务进行资源管理以及任务调度,从而强化了对AI任务的管控,防止AI任务不加限制的使用超过期望的资源,避免AI任务过多占用资源影响到BMC的基础业务,一定程度上有助于保障BMC的基础业务的稳定性。In the method provided in this embodiment, a management system for AI tasks is introduced into the BMC, thereby decoupling AI applications from AI tasks and the underlying basic AI library, and the management system uniformly manages resources for each AI task As well as task scheduling, it strengthens the control of AI tasks, prevents AI tasks from using more resources than expected without restrictions, and prevents AI tasks from taking up too many resources to affect BMC's basic business, which helps to ensure the foundation of BMC to a certain extent. business stability.
下面结合一个实例,对上述方法200举例说明。附图2所示方法200中的资源为实例1中的CPU和内存。附图2所示方法200中的第一AI任务是实例1中的AI-1,方法200中的第二AI任务是实例1中的AI-2。The above-mentioned method 200 is illustrated below with reference to an example. The resources in the method 200 shown in FIG. 2 are the CPU and memory in the example 1. The first AI task in the method 200 shown in FIG. 2 is AI-1 in the example 1, and the second AI task in the method 200 is the AI-2 in the example 1.
实例1Example 1
请参考附图3,附图3是实例1的流程图。首先向BMC的管理***输入配置文件,然后执行下述步骤301至步骤313。Please refer to FIG. 3 , which is a flowchart of Example 1. FIG. First, input the configuration file to the management system of the BMC, and then execute the following steps 301 to 313.
配置文件包括BMC内AI特性的总体配额以及每个AI任务各自的比例配额。其中,AI特性的总体配额包括CPU维度的总体配额以及内存维度的总体配额。CPU维度的总体配额为50%CPU。内存维度的总体配额为20M。BMC中包含AI任务内存故障预测(简称AI-1)和AI任务HDD故障预测(简称AI-2)以及AI-3。配置文件中设定AI-1的比例配额为0.7,AI-2的比例配额为0.2,AI-3的比例配额为0.1。AI-1优先级高,AI-2优先级中,AI-3优先级低。The configuration file includes the overall quota for AI features within the BMC as well as the respective proportional quota for each AI task. Among them, the overall quota of AI features includes the overall quota of the CPU dimension and the overall quota of the memory dimension. The overall quota for the CPU dimension is 50% CPU. The overall quota for the memory dimension is 20M. The BMC includes AI task memory failure prediction (AI-1 for short), AI task HDD failure prediction (AI-2 for short) and AI-3. The configuration file sets the proportional quota of AI-1 to 0.7, the proportional quota of AI-2 to 0.2, and the proportional quota of AI-3 to 0.1. AI-1 has high priority, AI-2 has medium priority, and AI-3 has low priority.
步骤301、管理***启动加载AI任务的设定配额。具体地,管理***为各个AI任务共分配20M内存,或最多允许AI任务占用50%CPU。 Step 301 , the management system starts to load the set quota of the AI task. Specifically, the management system allocates a total of 20M of memory for each AI task, or allows the AI task to occupy 50% of the CPU at most.
步骤302、管理***检测是否开启历史信息回归学习。如果开启历史信息回归学习,则执行下述步骤303。如果未开启历史信息回归学习,则执行下述步骤304。 Step 302, the management system detects whether the historical information regression learning is enabled. If the historical information regression learning is enabled, the following step 303 is performed. If the historical information regression learning is not enabled, the following step 304 is performed.
步骤303、管理***基于业务周期挑选Al-1,Al-2,Al-3。Step 303: The management system selects Al-1, Al-2, and Al-3 based on the service cycle.
步骤304、管理***将Al-1,Al-2,Al-3加载至任务执行队列。Step 304: The management system loads Al-1, Al-2, and Al-3 into the task execution queue.
步骤305、管理***执行任务Al-1、Al-2和Al-3(按配额设定执行)。具体地,如果AI-1没有执行,或者AI-1业务量较小资源占用较少,AI-2可占用超过20x0.3的内存、50%x0.3的CPU。如果AI-1、AI-2均正在执行,且计算密度均较大,则管理***会将AI-1占用的CPU限制在(50%x0.7),将AI-1占用的内存限制在(20Mx0.7);管理***会将AI-2占用的CPU限制在(50%x0.3),将AI-2占用的内存限制在(20Mx0.3)。具体参见附图3中的步骤306至步骤313。 Step 305, the management system executes tasks A1-1, A1-2 and A1-3 (executed according to the quota setting). Specifically, if AI-1 is not executed, or AI-1 has a small business volume and occupies less resources, AI-2 can occupy more than 20x0.3 memory and 50%x0.3 CPU. If both AI-1 and AI-2 are executing and the computing density is high, the management system will limit the CPU occupied by AI-1 to (50% x 0.7) and the memory occupied by AI-1 to ( 20Mx0.7); the management system will limit the CPU occupied by AI-2 to (50%x0.3) and the memory occupied by AI-2 to (20Mx0.3). For details, refer to steps 306 to 313 in FIG. 3 .
步骤306、任务AI-2需要使用额外资源。Step 306: Task AI-2 needs to use additional resources.
步骤307、管理***检测总体配额对应的总体资源是否剩余。如果总体资源有剩余,那么进入步骤308。如果总体资源没有剩余,那么进入步骤313。Step 307: The management system detects whether the overall resources corresponding to the overall quota remain. If there are remaining total resources, go to step 308 . If the total resource is not left, then go to step 313 .
步骤308、管理***确定任务AI-1未消耗完总体资源。Step 308: The management system determines that the task AI-1 has not consumed the total resources.
步骤309、管理***将总体资源中的剩余资源(AI-1未使用的资源)分配给AI-2。Step 309: The management system allocates the remaining resources in the total resources (resources not used by AI-1) to AI-2.
步骤310、任务AI-1需要消费更多资源(满配额使用)。Step 310: Task AI-1 needs to consume more resources (full quota usage).
步骤311、管理***将AI-2的资源下降至(50%0.2),将AI-2内存中额外数据保存在Swap中。 Step 311 , the management system reduces the resource of AI-2 to (50% 0.2), and saves the extra data in the memory of AI-2 in Swap.
步骤312、AI-1任务满配额使用给定资源。 Step 312 , the AI-1 task uses the given resource when the quota is full.
步骤313、AI-2保持既定配额执行(不额外分配资源)。 Step 313 , AI-2 maintains the predetermined quota for execution (no additional resources are allocated).
上述方法中,当开启历史信息的自学习回归优化功能后,管理***会自动学习当前设备的负载周期,识别不同场景下的负载特征,动态驱动AI作业(如区分训练、推理等),在合理资源使用时间段进行区分调度执行。In the above method, when the self-learning regression optimization function of historical information is turned on, the management system will automatically learn the load cycle of the current equipment, identify the load characteristics in different scenarios, and dynamically drive AI operations (such as distinguishing training, reasoning, etc.) Resource usage time periods are used to differentiate scheduling execution.
下面结合两个附图描述本实施例的有益效果。The beneficial effects of this embodiment will be described below with reference to two drawings.
附图4和附图5中的横坐标均表示时间。图4和图5中的纵坐标均表示AI任务的资源开销(如内存使用量或内存比例)。图4和图5中的两个曲线分别表示AI-1在各个时间段的资源开销以及AI-2在各个时间段的资源开销。图4示出了无管理***时的资源开销。图5示出了有管理***调度时的资源开销。图4加粗直线也称期望封顶值,含义是期望的资源开销最大值,即期望所有AI任务最多占用多少资源。附图5加粗直线表示总体配额,也称AI任务总包,单位为M。The abscissas in Fig. 4 and Fig. 5 both represent time. The ordinates in both Figures 4 and 5 represent the resource overhead (such as memory usage or memory ratio) of AI tasks. The two curves in FIG. 4 and FIG. 5 respectively represent the resource overhead of AI-1 in each time period and the resource overhead of AI-2 in each time period. Figure 4 shows the resource overhead when there is no management system. Figure 5 shows the resource overhead when scheduling in a managed system. The bold line in Figure 4 is also called the expected cap value, which means the maximum expected resource overhead, that is, how much resources are expected to be occupied by all AI tasks at most. The bold straight line in Figure 5 represents the overall quota, also known as the total package of AI tasks, and the unit is M.
请参考附图4,在没有管理***参与资源分配与任务调度时,AI应用资源开销大,缺乏 管控会产生资源竞争,并发执行时会占用额外***资源影响BMC基础***的执行。具体地,如附图4所示,AI-1的资源开销已超过了期望的资源开销最大值,此外AI-1和AI-2经常出现并发执行的情况。Please refer to Figure 4, when there is no management system involved in resource allocation and task scheduling, AI application resource overhead is large, lack of management and control will cause resource competition, and additional system resources will be occupied during concurrent execution to affect the execution of the BMC basic system. Specifically, as shown in FIG. 4 , the resource overhead of AI-1 has exceeded the expected maximum resource overhead, and AI-1 and AI-2 are often executed concurrently.
请参考附图5,通过引入了AI任务的管理***,管理***有序的对多AI作业并发执行进行管控,并错开基于历史学习得出的业务高峰期,保障BMC基础***的基础能力稳定运行以及资源消耗大的AI作业正确执行。Please refer to Figure 5. By introducing the management system of AI tasks, the management system manages and controls the concurrent execution of multiple AI jobs in an orderly manner, and staggers the business peak periods based on historical learning to ensure the stable operation of the basic capabilities of the BMC basic system. And resource-intensive AI jobs are executed correctly.
下面对BMC的基本硬件结构举例说明。The following is an example of the basic hardware structure of the BMC.
附图6是本申请实施例提供的一种BMC的结构示意图。附图6所示的BMC600用于实施上述附图2或附图3描述的方法。FIG. 6 is a schematic structural diagram of a BMC provided by an embodiment of the present application. The BMC 600 shown in FIG. 6 is used to implement the method described in FIG. 2 or FIG. 3 above.
可选地,结合附图1来看,附图6所示的BMC600是附图1中的BMC。Optionally, in conjunction with FIG. 1 , the BMC 600 shown in FIG. 6 is the BMC in FIG. 1 .
BMC600包括至少一个处理器601、通信总线602、存储器603以及至少一个网络接口604。 BMC 600 includes at least one processor 601 , communication bus 602 , memory 603 and at least one network interface 604 .
处理器601例如是通用中央处理器(central processing unit,CPU)、网络处理器(network processer,NP)、图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing units,NPU)、数据处理单元(data processing unit,DPU)、微处理器或者一个或多个用于实现本申请方案的集成电路。例如,处理器601包括专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。PLD例如是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。The processor 601 is, for example, a general-purpose central processing unit (central processing unit, CPU), a network processor (network processor, NP), a graphics processing unit (graphics processing unit, GPU), a neural-network processing unit (neural-network processing units, NPU) ), a data processing unit (DPU), a microprocessor or one or more integrated circuits for implementing the solution of the present application. For example, the processor 601 includes an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof. The PLD is, for example, a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof.
通信总线602用于在上述组件之间传送信息。通信总线602可以分为地址总线、数据总线、控制总线等。为便于表示,附图6中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The communication bus 602 is used to transfer information between the aforementioned components. The communication bus 602 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in FIG. 6, but it does not mean that there is only one bus or one type of bus.
存储器603例如是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,又如是随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,又如是电可擦可编程只读存储器(electrically erasable programmable read-only Memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的计算机指令并能够由计算机存取的任何其它介质,但不限于此。存储器603例如是独立存在,并通过通信总线602与处理器601相连接。存储器603也可以和处理器601集成在一起。The memory 603 is, for example, a read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, or a random access memory (random access memory, RAM) or a memory device that can store information and instructions. Other types of dynamic storage devices, such as electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, optical disks storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, or other magnetic storage devices, or are capable of carrying or storing desired computer instructions in the form of instructions or data structures and capable of Any other medium accessed by a computer without limitation. The memory 603 exists independently, for example, and is connected to the processor 601 through the communication bus 602 . The memory 603 may also be integrated with the processor 601 .
网络接口604使用任何收发器一类的装置,用于与其它设备或通信网络通信。网络接口604包括有线网络接口,还可以包括无线网络接口。其中,有线网络接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线网络接口可以为无线局域网(wireless local area networks,WLAN)接口,蜂窝网络网络接口或其组合等。The network interface 604 uses any transceiver-like device for communicating with other devices or communication networks. The network interface 604 includes a wired network interface and may also include a wireless network interface. The wired network interface may be, for example, an Ethernet interface. The Ethernet interface can be an optical interface, an electrical interface or a combination thereof. The wireless network interface may be a wireless local area network (wireless local area networks, WLAN) interface, a cellular network network interface or a combination thereof, and the like.
在具体实现中,作为一种实施例,处理器601可以包括一个或多个CPU,如附图6中所示的CPU0和CPU1。In a specific implementation, as an embodiment, the processor 601 may include one or more CPUs, such as CPU0 and CPU1 shown in FIG. 6 .
在具体实现中,作为一种实施例,BMC600可以包括多个处理器,如附图6中所示的处理器601和处理器605。这些处理器中的每一个可以是一个单核处理器(single-CPU),也可以 是一个多核处理器(multi-CPU)。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。In a specific implementation, as an embodiment, the BMC 600 may include multiple processors, such as the processor 601 and the processor 605 shown in FIG. 6 . Each of these processors can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU). A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
在具体实现中,作为一种实施例,BMC600还可以包括输出设备和输入设备。输出设备和处理器601通信,可以以多种方式来显示信息。例如,输出设备可以是液晶显示器(liquid crystal display,LCD)、发光二级管(light emitting diode,LED)显示设备、阴极射线管(cathode ray tube,CRT)显示设备或投影仪(projector)等。输入设备和处理器601通信,可以以多种方式接收用户的输入。例如,输入设备可以是鼠标、键盘、触摸屏设备或传感设备等。In a specific implementation, as an embodiment, the BMC600 may further include an output device and an input device. The output device communicates with the processor 601 and can display information in a variety of ways. For example, the output device may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, a projector, or the like. The input device communicates with the processor 601 and can receive user input in a variety of ways. For example, the input device may be a mouse, a keyboard, a touch screen device, or a sensor device, or the like.
可选地,处理器601通过读取存储器603中保存的计算机指令610实现上述实施例中的方法,或者,处理器601通过内部存储的计算机指令实现上述实施例中的方法。在处理器601通过读取存储器603中保存的计算机指令610实现上述实施例中的方法的情况下,存储器603中保存实现本申请实施例提供的方法的计算机指令。Optionally, the processor 601 implements the methods in the above embodiments by reading the computer instructions 610 stored in the memory 603, or the processor 601 implements the methods in the above embodiments by using internally stored computer instructions. When the processor 601 implements the methods in the above embodiments by reading the computer instructions 610 stored in the memory 603, the memory 603 stores computer instructions for implementing the methods provided in the embodiments of the present application.
处理器601实现上述功能的更多细节请参考前面各个方法实施例中的描述,在这里不再重复。For more details of implementing the above functions by the processor 601, please refer to the descriptions in the foregoing method embodiments, which will not be repeated here.
附图7是本申请实施例提供的一种管理***70的结构示意图。附图7所示的管理***70例如实现方法200中管理***的功能。FIG. 7 is a schematic structural diagram of a management system 70 provided by an embodiment of the present application. The management system 70 shown in FIG. 7 , for example, implements the functions of the management system in the method 200 .
请参考附图7,管理***70包括处理单元701和提供单元702。管理***70中的各个单元全部或部分地通过软件、硬件、固件或者其任意组合来实现。管理***70中的各个单元用于执行上述方法200中管理***的相应功能。具体地,处理单元701用于支持管理***70执行S202至S204。提供单元702用于支持管理***70执行S205。Referring to FIG. 7 , the management system 70 includes a processing unit 701 and a providing unit 702 . Each unit in the management system 70 is implemented in whole or in part by software, hardware, firmware, or any combination thereof. Each unit in the management system 70 is used to perform the corresponding functions of the management system in the above-mentioned method 200 . Specifically, the processing unit 701 is configured to support the management system 70 to execute S202 to S204. The providing unit 702 is used to support the management system 70 to perform S205.
附图7所描述的装置实施例仅仅是示意性的,例如,上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。附图7中上述各个单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。例如,采用软件实现时,上述处理单元701可以是由附图6中的至少一个处理器601读取存储器603中存储的程序代码后,生成的软件功能单元来实现。附图7中上述各个单元也可以由管理***70中的不同硬件分别实现,例如处理单元701由附图6中的至少一个处理器601中的一部分处理资源(例如多核处理器中的一个核或两个核)实现,而提供单元702由附图6中至少一个处理器601中的其余部分处理资源(例如多核处理器中的其他核),或者采用现场可编程门阵列(field-programmable gate array,FPGA)、或协处理器等可编程器件来完成。或者,提供单元702由附图6中的网络接口604实现。显然上述功能单元也可以采用软件硬件相结合的方式来实现,例如提供单元702由硬件可编程器件实现,而处理单元701是由CPU读取存储器中存储的程序代码后,生成的软件功能单元。The apparatus embodiment described in FIG. 7 is only illustrative. For example, the division of the above-mentioned units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or may be Integration into another system, or some features can be ignored, or not implemented. Each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned units in FIG. 7 can be implemented either in the form of hardware or in the form of software functional units. For example, when implemented in software, the above-mentioned processing unit 701 may be implemented by a software functional unit generated after at least one processor 601 in FIG. 6 reads the program code stored in the memory 603 . The above-mentioned units in FIG. 7 can also be implemented by different hardware in the management system 70, for example, the processing unit 701 is composed of a part of the processing resources in at least one processor 601 in FIG. two cores), while the providing unit 702 is implemented by the rest of the processing resources in at least one processor 601 in FIG. , FPGA), or programmable devices such as coprocessors. Alternatively, the providing unit 702 is implemented by the network interface 604 in FIG. 6 . Obviously, the above functional units can also be implemented by a combination of software and hardware. For example, the providing unit 702 is implemented by a hardware programmable device, and the processing unit 701 is a software functional unit generated after the CPU reads the program code stored in the memory.
在本申请实施例的描述中,除非另有说明,“至少一个”的含义是指一个或多个。“多个”的含义是指两个或两个以上。例如,多个AI应用是指两个或两个以上的AI应用。A参考B,指的是A与B相同或者A为B的简单变形。In the description of the embodiments of the present application, unless otherwise specified, the meaning of "at least one" refers to one or more. The meaning of "plurality" means two or more. For example, multiple AI applications refers to two or more AI applications. A refers to B, which means that A is the same as B or A is a simple variation of B.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分可互 相参考,每个实施例重点说明的都是与其他实施例的不同之处。Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the procedures or functions according to the embodiments of the present application are generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device. The computer instructions may be stored on or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted over a wire from a website site, computer, server or data center (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) to another website site, computer, server or data center. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more available media integrated. The available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), among others.
以上该,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: they can still The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (22)

  1. 一种任务调度方法,其特征在于,所述方法应用于基板管理控制器BMC中,所述BMC包括至少一个人工智能AI应用、管理***和基础AI库,所述方法包括:A task scheduling method, characterized in that the method is applied in a baseboard management controller BMC, the BMC includes at least one artificial intelligence AI application, a management system and a basic AI library, and the method includes:
    所述管理***响应于所述至少一个AI应用中第一AI应用的请求,使用所述基础AI库生成至少一个AI任务;The management system generates at least one AI task using the basic AI library in response to a request of the first AI application in the at least one AI application;
    所述管理***从所述BMC的资源中为第一AI任务分配第一资源,所述第一AI任务是所述至少一个AI任务中的一个AI任务;The management system allocates a first resource for a first AI task from the resources of the BMC, and the first AI task is an AI task in the at least one AI task;
    所述管理***使用所述第一资源执行所述第一AI任务,从而得到执行结果;The management system uses the first resource to execute the first AI task, thereby obtaining an execution result;
    所述管理***向所述第一AI应用提供所述执行结果。The management system provides the execution result to the first AI application.
  2. 根据权利要求1所述的方法,其特征在于,所述管理***从所述BMC的资源中为第一AI任务分配第一资源,包括:The method according to claim 1, wherein the management system allocates the first resource for the first AI task from the resources of the BMC, comprising:
    所述管理***根据配额为第一AI任务分配第一资源,所述第一资源不超过所述配额。The management system allocates a first resource to the first AI task according to the quota, and the first resource does not exceed the quota.
  3. 根据权利要求2所述的方法,其特征在于,所述配额包括总体配额,所述总体配额指示所述至少一个AI任务总体的配额,所述管理***根据配额为第一AI任务分配第一资源,包括:The method of claim 2, wherein the quota includes an overall quota, the overall quota indicates a quota of the at least one AI task population, and the management system allocates the first resource for the first AI task according to the quota ,include:
    所述管理***根据所述总体配额,为所述至少一个AI任务分配总体资源,所述总体资源不超过所述总体配额:The management system allocates overall resources for the at least one AI task according to the overall quota, and the overall resources do not exceed the overall quota:
    所述管理***从所述总体资源中分配所述第一资源。The management system allocates the first resource from the overall resource.
  4. 根据权利要求3所述的方法,其特征在于,所述配额还包括比例配额,所述比例配额指示所述第一AI任务的配额占所述总体配额的比例,所述管理***从所述总体资源中分配所述第一资源,包括:The method according to claim 3, wherein the quota further comprises a proportional quota, the proportional quota indicates the ratio of the quota of the first AI task to the overall quota, and the management system calculates the quota from the overall quota. Allocating the first resource among the resources includes:
    所述管理***根据所述比例配额,从所述总体资源中分配所述第一资源,所述第一资源不超过所述总体资源与所述比例配额的乘积。The management system allocates the first resource from the overall resources according to the proportional quota, and the first resource does not exceed the product of the overall resource and the proportional quota.
  5. 根据权利要求2所述的方法,其特征在于,所述管理***根据配额为第一AI任务分配第一资源之后,所述方法还包括:The method according to claim 2, wherein after the management system allocates the first resource for the first AI task according to the quota, the method further comprises:
    如果所述第一AI任务占用的资源超过所述配额,所述管理***杀死所述第一AI任务;或者,If the resource occupied by the first AI task exceeds the quota, the management system kills the first AI task; or,
    如果所述第一AI任务占用的资源超过所述配额,所述管理***将所述BMC的内存中所述第一AI任务的数据保存至交换swap分区中,并释放所述数据在所述BMC的内存中占用的空间。If the resource occupied by the first AI task exceeds the quota, the management system saves the data of the first AI task in the memory of the BMC to the swap swap partition, and releases the data in the BMC space occupied in memory.
  6. 根据权利要求1所述的方法,其特征在于,所述第一资源是所述至少一个AI任务中第二AI任务原本占用的资源,所述第二AI任务的优先级低于所述第一AI任务的优先级,所述管理***从所述BMC的资源中为第一AI任务分配第一资源,包括:The method according to claim 1, wherein the first resource is a resource originally occupied by a second AI task in the at least one AI task, and the priority of the second AI task is lower than that of the first AI task The priority of the AI task, the management system allocates the first resource for the first AI task from the resources of the BMC, including:
    若剩余资源不满足所述第一AI任务的资源需求,所述管理***杀死所述第二AI任务,从所述第二AI任务释放的资源中为第一AI任务分配第一资源;或者,If the remaining resources do not meet the resource requirements of the first AI task, the management system kills the second AI task, and allocates the first resource for the first AI task from the resources released by the second AI task; or ,
    若剩余资源不满足所述第一AI任务的资源需求,所述管理***调用控制组Cgroup从而调整所述第一AI任务的资源和所述第二任务的资源。If the remaining resources do not meet the resource requirements of the first AI task, the management system calls the control group Cgroup to adjust the resources of the first AI task and the resources of the second task.
  7. 根据权利要求1所述的方法,其特征在于,所述管理***使用所述第一资源执行所述第一AI任务之前,所述方法还包括:The method according to claim 1, wherein before the management system uses the first resource to perform the first AI task, the method further comprises:
    所述管理***基于历史信息确定所述BMC的业务周期,所述业务周期指示业务的资源开销与时间之间的对应关系;The management system determines the service period of the BMC based on the historical information, and the service period indicates the correspondence between the resource overhead of the service and the time;
    所述管理***根据所述业务周期从所述至少一个AI任务中确定第一AI任务。The management system determines a first AI task from the at least one AI task according to the business cycle.
  8. 根据权利要求7所述的方法,其特征在于,所述业务周期包括所述BMC的基础业务的业务周期以及所述至少一个AI任务的业务周期,所述管理***根据所述业务周期从所述至少一个AI任务中确定第一AI任务,包括:The method according to claim 7, wherein the service period includes a service period of a basic service of the BMC and a service period of the at least one AI task, and the management system starts from the service period according to the service period. Determine the first AI task in at least one AI task, including:
    所述管理***从所述基础业务的业务周期中,确定所述基础业务的高峰期,所述高峰期是指业务周期中资源开销的最大值对应的时间段;The management system determines the peak period of the basic service from the service period of the basic service, and the peak period refers to the time period corresponding to the maximum value of the resource overhead in the service period;
    所述管理***从所述至少一个AI任务的业务周期中,确定所述至少一个AI任务的高峰期;The management system determines the peak period of the at least one AI task from the business cycle of the at least one AI task;
    所述管理***根据所述基础业务的高峰期以及所述至少一个AI任务的高峰期确定所述第一AI任务,所述第一AI任务的高峰期与所述基础业务的高峰期不同。The management system determines the first AI task according to the peak period of the basic business and the peak period of the at least one AI task, and the peak period of the first AI task is different from the peak period of the basic business.
  9. 根据权利要求1所述的方法,其特征在于,所述管理***使用所述第一资源执行所述第一AI任务,包括:The method of claim 1, wherein the management system performs the first AI task using the first resource, comprising:
    所述管理***根据所述BMC的基础业务的业务周期确定目标时间,所述目标时间位于所述基础业务的高峰期之外的时间段;The management system determines a target time according to the service cycle of the basic service of the BMC, and the target time is located in a time period outside the peak period of the basic service;
    所述管理***在所述目标时间,使用所述第一资源执行所述第一AI任务。The management system performs the first AI task using the first resource at the target time.
  10. 根据权利要求7至9中任一项所述的方法,其特征在于,所述管理***基于历史信息确定所述管理***的业务周期,包括:The method according to any one of claims 7 to 9, wherein the management system determines the business cycle of the management system based on historical information, comprising:
    所述管理***采用回归学习算法基于历史信息确定所述业务周期。The management system uses a regression learning algorithm to determine the business cycle based on historical information.
  11. 根据权利要求1所述的方法,其特征在于,所述管理***执行所述第一AI任务,包括:The method of claim 1, wherein the management system performs the first AI task, comprising:
    所述管理***按照所述第一AI任务的执行计划执行所述第一AI任务,所述执行计划指示开始执行所述第一AI任务的时间点。The management system executes the first AI task according to an execution plan of the first AI task, and the execution plan indicates a time point to start executing the first AI task.
  12. 根据权利要求11所述的方法,其特征在于,所述执行计划包括定时执行计划以及按需执行计划,所述定时执行计划指示在预设时间点执行所述第一AI任务或者每隔预设周期执行所述第一AI任务,所述按需执行计划指示当接收到指令时执行所述第一AI任务。The method according to claim 11, wherein the execution plan comprises a timed execution plan and an on-demand execution plan, the timed execution plan instructing to execute the first AI task at a preset time point or every preset time The first AI task is executed periodically, and the on-demand execution plan indicates that the first AI task is executed when an instruction is received.
  13. 根据权利要求1所述的方法,其特征在于,所述至少一个AI任务为多个AI任务,所述管理***获取至少一个AI任务之后,所述方法还包括:The method according to claim 1, wherein the at least one AI task is a plurality of AI tasks, and after the management system acquires the at least one AI task, the method further comprises:
    所述管理***根据所述多个AI任务的优先级,确定所述多个AI任务的执行顺序,所述AI任务的优先级越高,所述AI任务越先执行。The management system determines the execution order of the multiple AI tasks according to the priorities of the multiple AI tasks, and the higher the priority of the AI tasks, the earlier the AI tasks are executed.
  14. 一种管理***,其特征在于,所述管理***应用于基板管理控制器BMC中,所述BMC包括至少一个人工智能AI应用、管理***和基础AI库,所述管理***包括:A management system, characterized in that the management system is applied in a baseboard management controller BMC, the BMC includes at least one artificial intelligence AI application, a management system and a basic AI library, and the management system includes:
    处理单元,用于响应于所述至少一个AI应用中第一AI应用的请求,使用所述基础AI库生成至少一个AI任务;a processing unit, configured to generate at least one AI task using the basic AI library in response to a request of the first AI application in the at least one AI application;
    所述处理单元,还用于从所述BMC的资源中为第一AI任务分配第一资源,所述第一AI任务是所述至少一个AI任务中的一个AI任务;The processing unit is further configured to allocate a first resource for a first AI task from the resources of the BMC, where the first AI task is an AI task in the at least one AI task;
    所述处理单元,还用于使用所述第一资源执行所述第一AI任务,从而得到执行结果;The processing unit is further configured to execute the first AI task by using the first resource, thereby obtaining an execution result;
    提供单元,用于向所述第一AI应用提供所述执行结果。A providing unit is configured to provide the execution result to the first AI application.
  15. 根据权利要求14所述的管理***,其特征在于,所述处理单元,用于根据配额为第一AI任务分配第一资源,所述第一资源不超过所述配额。The management system according to claim 14, wherein the processing unit is configured to allocate a first resource to the first AI task according to a quota, and the first resource does not exceed the quota.
  16. 根据权利要求15所述的管理***,其特征在于,所述配额包括总体配额,所述总体配额指示所述至少一个AI任务总体的配额,所述处理单元,用于根据所述总体配额,为所述至少一个AI任务分配总体资源,所述总体资源不超过所述总体配额:从所述总体资源中分配所述第一资源。The management system according to claim 15, wherein the quota includes an overall quota, the overall quota indicates the overall quota of the at least one AI task, and the processing unit is configured to, according to the overall quota, be The at least one AI task allocates an overall resource that does not exceed the overall quota: the first resource is allocated from the overall resource.
  17. 根据权利要求16所述的管理***,其特征在于,所述配额还包括比例配额,所述比例配额指示所述第一AI任务的配额占所述总体配额的比例,所述处理单元,用于根据所述比例配额,从所述总体资源中分配所述第一资源,所述第一资源不超过所述总体资源与所述比例配额的乘积。The management system according to claim 16, wherein the quota further comprises a proportional quota, and the proportional quota indicates the ratio of the quota of the first AI task to the overall quota, and the processing unit is configured to: According to the proportional quota, the first resource is allocated from the overall resource, and the first resource does not exceed the product of the overall resource and the proportional quota.
  18. 根据权利要求15所述的管理***,其特征在于,所述处理单元,还用于如果所述第一AI任务占用的资源超过所述配额,杀死所述第一AI任务;或者,如果所述第一AI任务占用的资源超过所述配额,将所述BMC的内存中所述第一AI任务的数据保存至swap分区中,并释放所述数据在所述BMC的内存中占用的空间。The management system according to claim 15, wherein the processing unit is further configured to kill the first AI task if the resource occupied by the first AI task exceeds the quota; When the resource occupied by the first AI task exceeds the quota, the data of the first AI task in the memory of the BMC is stored in the swap partition, and the space occupied by the data in the memory of the BMC is released.
  19. 根据权利要求14所述的管理***,其特征在于,所述第一资源是所述至少一个AI任务中第二AI任务原本占用的资源,所述第二AI任务的优先级低于所述第一AI任务的优先级,所述处理单元,还用于若剩余资源不满足所述第一AI任务的资源需求,杀死所述第二AI任务,从所述第二AI任务释放的资源中为第一AI任务分配第一资源;或者,若剩余资源不满足所述第一AI任务的资源需求,调用控制组Cgroup从而调整所述第一AI任务的资源和所述第二任务的资源。The management system according to claim 14, wherein the first resource is a resource originally occupied by a second AI task in the at least one AI task, and the second AI task has a lower priority than the first AI task. The priority of an AI task, the processing unit is further configured to kill the second AI task if the remaining resources do not meet the resource requirements of the first AI task, and remove the resources released from the second AI task Allocate the first resource for the first AI task; or, if the remaining resources do not meet the resource requirements of the first AI task, call the control group Cgroup to adjust the resources of the first AI task and the resources of the second task.
  20. 一种基板管理控制器BMC,其特征在于,所述BMC包括如权利要求14至19中任一项所述的管理***、至少一个AI应用和基础AI库。A baseboard management controller BMC, characterized in that, the BMC includes the management system according to any one of claims 14 to 19, at least one AI application and a basic AI library.
  21. 一种基板管理控制器BMC,其特征在于,包括:A baseboard management controller BMC, characterized in that it includes:
    处理器和存储器;processor and memory;
    所述存储器,用于存储计算机指令;the memory for storing computer instructions;
    所述处理器,用于执行所述存储器存储的计算机指令,使得所述BMC执行权利要求1至13任一所述的方法。The processor is configured to execute computer instructions stored in the memory, so that the BMC executes the method of any one of claims 1 to 13.
  22. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包括计算机指令,所述计算机指令指示基板管理控制器BMC执行权利要求1至13任一所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium comprises computer instructions, the computer instructions instructing a baseboard management controller (BMC) to execute the method of any one of claims 1 to 13.
PCT/CN2021/141119 2021-01-13 2021-12-24 Task scheduling method and management system WO2022151951A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110044643.6 2021-01-13
CN202110044643.6A CN114764371A (en) 2021-01-13 2021-01-13 Task scheduling method and management system

Publications (1)

Publication Number Publication Date
WO2022151951A1 true WO2022151951A1 (en) 2022-07-21

Family

ID=82363650

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/141119 WO2022151951A1 (en) 2021-01-13 2021-12-24 Task scheduling method and management system

Country Status (2)

Country Link
CN (1) CN114764371A (en)
WO (1) WO2022151951A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070255833A1 (en) * 2006-04-27 2007-11-01 Infosys Technologies, Ltd. System and methods for managing resources in grid computing
CN104301257A (en) * 2014-09-17 2015-01-21 华为技术有限公司 Resource distribution method, device and equipment
CN108052384A (en) * 2017-12-27 2018-05-18 联想(北京)有限公司 A kind of task processing method, service platform and electronic equipment
CN110995614A (en) * 2019-11-05 2020-04-10 华为技术有限公司 Computing power resource allocation method and device
CN111367679A (en) * 2020-03-31 2020-07-03 中国建设银行股份有限公司 Artificial intelligence computing power resource multiplexing method and device
CN111738488A (en) * 2020-05-14 2020-10-02 华为技术有限公司 Task scheduling method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070255833A1 (en) * 2006-04-27 2007-11-01 Infosys Technologies, Ltd. System and methods for managing resources in grid computing
CN104301257A (en) * 2014-09-17 2015-01-21 华为技术有限公司 Resource distribution method, device and equipment
CN108052384A (en) * 2017-12-27 2018-05-18 联想(北京)有限公司 A kind of task processing method, service platform and electronic equipment
CN110995614A (en) * 2019-11-05 2020-04-10 华为技术有限公司 Computing power resource allocation method and device
CN111367679A (en) * 2020-03-31 2020-07-03 中国建设银行股份有限公司 Artificial intelligence computing power resource multiplexing method and device
CN111738488A (en) * 2020-05-14 2020-10-02 华为技术有限公司 Task scheduling method and device

Also Published As

Publication number Publication date
CN114764371A (en) 2022-07-19

Similar Documents

Publication Publication Date Title
US11507430B2 (en) Accelerated resource allocation techniques
JP6278320B2 (en) End-to-end data center performance control
US7945913B2 (en) Method, system and computer program product for optimizing allocation of resources on partitions of a data processing system
JP3978199B2 (en) Resource utilization and application performance monitoring system and monitoring method
US8910153B2 (en) Managing virtualized accelerators using admission control, load balancing and scheduling
US11206193B2 (en) Method and system for provisioning resources in cloud computing
Hashem et al. MapReduce scheduling algorithms: a review
CN110221920B (en) Deployment method, device, storage medium and system
CN109564528B (en) System and method for computing resource allocation in distributed computing
WO2021136137A1 (en) Resource scheduling method and apparatus, and related device
WO2017010922A1 (en) Allocation of cloud computing resources
JP2013515991A (en) Method, information processing system, and computer program for dynamically managing accelerator resources
US11467874B2 (en) System and method for resource management
JP2015146154A (en) Job scheduling apparatus, job scheduling method and job scheduling program
US20140201371A1 (en) Balancing the allocation of virtual machines in cloud systems
Han et al. Energy efficient VM scheduling for big data processing in cloud computing environments
CN112181613A (en) Heterogeneous resource distributed computing platform batch task scheduling method and storage medium
WO2022151951A1 (en) Task scheduling method and management system
US11776087B2 (en) Function-as-a-service (FAAS) model for specialized processing units
CN115098269A (en) Resource allocation method, device, electronic equipment and storage medium
CN112416538B (en) Multi-level architecture and management method of distributed resource management framework
US11729119B2 (en) Dynamic queue management of network traffic
Huang et al. Hestia: A Cost-Effective Multi-dimensional Resource Utilization for Microservices Execution in the Cloud
US20220035429A1 (en) Method and system for intelligent power distribution management
US20210405728A1 (en) Dynamic power capping of computing systems and subsystems contained therein

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21919138

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21919138

Country of ref document: EP

Kind code of ref document: A1