WO2022211980A1

WO2022211980A1 - Planet-scale, fully managed artificial intelligence infrastructure service

Info

Publication number: WO2022211980A1
Application number: PCT/US2022/019213
Authority: WO
Inventors: Dharma Kiritkumar SHUKLA; Rimma Vladimirovna Nehme; Pankaj Sharma; Shreshth Singhal; Vipul Arunkant Modi; Muthian Sivathanu; Atul KATIYAR
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2021-03-30
Filing date: 2022-03-08
Publication date: 2022-10-06
Also published as: EP4315057A1

Abstract

The disclosure herein describes managing artificial intelligence (AI) workloads in a cloud infrastructure platform. A set of distributed infrastructure resources are integrated into the cloud infrastructure platform via native support interfaces. AI workloads are received from a plurality of tenants, wherein the AI workloads include training workloads and inferencing workloads and resource subsets of the set of distributed infrastructure resources are assigned to the received AI workloads. The received AI workloads are scheduled for execution on the assigned resource subsets and based on the scheduling of the AI workloads, they are executed on the assigned resource subsets. The described cloud infrastructure platform provides efficient, secure execution of AI workloads for many different tenants and enables the flexible use of a wide variety of both third-party and first-party infrastructure resources.

Description

PLANET-SCALE, FULLY MANAGED ARTIFICIAL INTELLIGENCE INFRASTRUCTURE SERVICE

BACKGROUND

[0001] The speed and scale of artificial intelligence (AI) innovations require highly scalable, performant, robust, and technically efficient AI infrastructure. Current methods of incrementally extending existing general-purpose infrastructure as a service (IaaS) and cloud-based environments have significant limitations as AI workloads are fundamentally different and necessitate purpose-built AI infrastructure. Furthermore, managing the minutia of current infrastructure presents substantial challenges to data scientists trying to accelerate the algorithmic innovations of AI.

SUMMARY

[0002] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

[0003] A computerized method for managing AI workloads in a cloud infrastructure platform is described. A set of distributed infrastructure resources are integrated into the cloud infrastructure platform via native support interfaces. Then, AI workloads are received from a plurality of tenants, wherein the AI workloads include training workloads and inferencing workloads and resource subsets of the set of distributed infrastructure resources are assigned to the received AI workloads. The received AI workloads are scheduled for execution on the assigned resource subsets and based on the scheduling of the AI workloads, they are executed on the assigned resource subsets.

BRIEF DESCRIPTION OF THE DRAWINGS

[0004] The present description will be better understood from the following detailed description read in light of the accompanying drawings, wherein:

[0005] FIG. 1 is a block diagram illustrating a system configured for providing infrastructure service for artificial intelligence (AI) workloads;

[0006] FIG. 2 is a block diagram illustrating a runtime plane of the system of FIG. 1; [0007] FIG. 3 is a block diagram illustrating an infrastructure plane of the system of FIG. i;

[0008] FIG. 4 is a flowchart illustrating a method for managing AI workloads in a cloud infrastructure platform; [0009] FIG. 5 is a block diagram illustrating a hierarchical scheduling subsystem configured for scheduling AI workloads; and

[0010] FIG. 6 is a block diagram of an example computing device for implementing aspects disclosed herein.

[0011] Corresponding reference characters indicate corresponding parts throughout the drawings. In FIGs. 1 to 6, the systems are illustrated as schematic drawings. The drawings may not be to scale.

DETAILED DESCRIPTION

[0012] The various embodiments will be described in detail with reference to the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. References made throughout this disclosure relating to specific examples and implementations are provided solely for illustrative purposes but, unless indicated to the contrary, are not meant to limit all examples. [0013] Aspects of the disclosure provide a computerized method and system for managing the execution of artificial intelligence (AI) workloads, such as training and inferencing workloads, using a diverse, distributed pool of infrastructure resources. Distributed infrastructure resources (first-party and third-party) are integrated into the cloud infrastructure platform via native support interfaces, enabling many entities to make use of their own infrastructure to add to the global pool of resources. AI workloads are received from a plurality of tenants and resource subsets of the set of distributed infrastructure resources are assigned to the received AI workloads, including securing the AI workloads from each other using containers, enabling multiple AI workloads to be executed securely on the same server. The received AI workloads are scheduled for execution on the assigned resource subsets and, based on the scheduling of the AI workloads, they are then executed on the assigned resource subsets.

[0014] Cloud infrastructure includes hardware accelerators, computer networking and storage — all of which are bundled together in a workload-aware manner. AI workloads (e.g., Deep Learning Training (DLT) and inferencing) are special in how they operate in that they are written, architected, and execute in a specific manner. Currently, general-purpose cloud- based IaaS are used for DLT and inferencing jobs, which require data scientists to set their AI DLT problems, execute them, and solve any resultant problems that may occur from today’s IaaS.

[0015] This has resulted in multiple trends. DLT workloads are growing exponentially (e.g., lOx per year). As a result, the industry is responding to this uptick in DLT workloads by including more hardware in the IaaS environments, e.g., buy more graphics processing units (GPUs) or other hardware accelerators, add more nodes, and build out more distributed clusters. Yet, if the models continue to grow exponentially, it becomes untenable to grow IaaS systems in such an exponential manner. There are limits to the size of cloud infrastructures, from a practical standpoint. Aspects of the disclosure solve these and other technical problems in unconventional ways.

[0016] The disclosed examples provide a “Singularity” service that increases efficiencies from today’s fixed infrastructure resource (including hardware accelerators, networking, storage, etc.) and drives the most technical efficiencies as the models continue to grow or as the number of DLT jobs and/or other AI workloads increase. For instance, the disclosed service operates in an unconventional manner by allowing for an IaaS or other infrastructure to grow to accommodate large numbers of DLT jobs or function as smaller groups of IaaSs that facilitate different DLT job processing. Conventional general-purpose IaaSs are not able to the handle these large increases in DLT jobs because today’s general-purpose IaaSs are developed to be workload-agnostic. The disclosed service, on the other hand, is designed to build purpose-built workloads that may be efficiently processed in an IaaS. The AI infrastructure service of the disclosure is operable with all AI workloads, including training (e.g., workloads for training new or updated AI models) and inferencing (e.g., workloads for using trained AI models to evaluate and make inferences from data).

[0017] More specifically, an example of the disclosed service is a fully managed, globally distributed, multi-tenant AI infrastructure service with native support for third-party (3P) hardware (e.g., different companies than the company operating a cloud environment), custom silicon, application-specific integrated circuit (ASIC), GPU, central processing unit (CPU), and first party (IP) hardware (the company operating the cloud environment), for DLT job training and inferencing workloads. With the disclosed service, an AI planet-scale computer infrastructure is used for training and inferencing at any scale, with the highest technical efficiency and differentiated capabilities which significantly improves the productivity of data scientists. The disclosed service manages 3P (e.g., GPUs and field- programmable gate arrays (FPGAs)) and IP AI hardware capacity and enabling high-level services, like AZURE® machine learning (ML), to build experiences and tools to serve customers.

[0018] While the disclosed examples are discussed in relation to DLT jobs and inferences, any kind of AI job may be migrated using the disclosed techniques. Such jobs may be long- running (e.g., processing for several hours or days or weeks or months). [0019] The disclosed embodiments and the examples mention the Azure cloud service provided by the MICROSOFT CORPORATION, headquartered in Redmond, Washington, USA. But any large-scale cloud infrastructure may utilize the disclosed service.

[0020] The following are example capabilities that the disclosure provides along with the corresponding technical design description.

[0021] The disclosure provides high-efficiency AI training and inferencing by driving the high utilization of resources. Secure, fine-grained multi-tenancy service is provided with high-density containerized hosting. For instance, such service may be provided using Hyper-V isolated containers on bare-metal machines. The disclosed service is able to both securely and densely pack multiple IP and 3P tenants on the same hosts, enabling highly efficient use of compute and AI hardware capacity across the cloud service. High-density workloads that belong to different tenants are enabled. For example, AI workloads can run alongside search workloads.

[0022] The disclosure provides multiplexing or interspersing of inferencing and training workloads on the same shared pool of resources. By sharing the same pool of cloud-wide resources for both inferencing and training, more efficient scheduling and packing of workloads is enabled to maximize use of hardware capacity and deal with fluctuations in the mix of workloads and demand for resources of the shared pool. By contrast, in conventional services, inferencing workloads and training workloads are on different pools of resources, fragmenting the capacity. Instead, the disclosed service multiplexes the training and inferencing workloads on the same pool of cloud resources (e.g., hardware accelerators, compute resources, networking resources, and storage resources, etc.). This benefits the ability to further saturate the hardware density and dynamically load balance the cloud resources to adjust to spikes or lulls in computing needs for either the training or inferencing workloads, thereby driving efficiencies to the maximum ability. DLT workloads and inferencing workloads need topological collocation of the nodes and the hardware associated with a job. In some examples, the disclosed service intersperses inferencing workloads on top of or in between training workloads, helping drive efficiencies and finish more jobs through the IaaS.

[0023] The disclosed service provides cloud-wide (e.g., global), topology & workload- aware scheduling of AI workloads. A global scheduler is provided to exploit the heterogeneity of workloads (e.g., differing attributes between training jobs, inferencing jobs, etc.) and to provide dynamic, topology-aware scheduling of resources across the entire AI hardware capacity in the cloud. Specifically, with its ability to transparently checkpoint the processor and the device state constituting a job or workload (e.g., saving the state of a workload without any involvement from the user or changes to the frameworks or changes to the training script logic), the disclosed scheduler is able to transparently preempt any running job, live migrate any running job, and/or elastically scale up/down and load balance the workers of the service to drive the highest utilization without impacting the performance or downtime. Additionally, the disclosed scheduler is configured to be aware of all the jobs across the entire IaaS (e.g., a global view of the workload(s) across the entire IaaS). For example, the scheduler used by the disclosed service is configured to identify groups of GPUs/CPUs/hardware accelerators that are not being efficiently utilized and therefore migrate jobs on such groups to other GPUs/CPUs/hardware accelerators by transparently checkpointing and verifying processor device states for migration to occur. The scheduler is further configured to monitor and/or track workloads that are currently running and hardware capacity that is currently available anywhere around the world in the cloud of the disclosed service. Additionally, the scheduler is configured to decide if and/or when to preempt a job, migrate a job, scale up or scale down the job, or load-balance between different workers for a job.

[0024] The disclosed service is configured to manage AI workloads in a priority-driven and/or tier-driven manner. When the disclosed scheduler makes decisions regarding AI training or inference workloads, the scheduler may consider the designated tier of a given job (or an inferencing model) or associated job submitter. Each tier may be defined with different technical requirements. For example, if a job is submitted with the highest tier level, indicating a best-capacity tier, the job is run with the least preemption, the equivalent of running on dedicated cloud resources. If a job is submitted at a middle tier, there is some preemption or migration experienced that may “slow” the job somewhat but drive efficiencies and improving the overall utilization of the fixed pool of resources. If the job is submitted at the lowest tier, the job is preempted frequently, providing the experience similar to spot virtual machines (VMs), but with the guarantee that the j ob will be completed, albeit not necessarily at the fastest pace. Numerous examples exist of different tiers that need not be exhaustively discussed herein, other than to say that DLT training and inferencing jobs may be scheduled based, at least partially, on their associated tier, which may be specific to the job, the customer, and/or the capacity kind. Today, there are no systems that provide tier-based guarantees to DLT training and inferencing jobs.

[0025] The disclosed system is configured to provide reliable and performant AI infrastructure. Without reliable infrastructure, utilization will always be sub-optimal. This is because planned and unplanned failures result into lost GPU hours and productivity. For example, if a large job is running for months on hundreds of nodes and GPUs, eventually, some of the GPUs will become unhealthy or need to be upgraded during the job’s processing. This has an impact on the customer workload. By virtue of how AI workloads operate, any stall in the health of a GPU may stall the entire AI workload job and progress may be stopped. Worse still, if the job or model has not been checkpointed, precious processing may be lost. To overcome this, the disclosed system provides capabilities such as transparent preemption, dynamic load-balancing, defragmentation, and elasticity that all enable a highly reliable infrastructure.

[0026] The disclosure deeply integrates the bare-metal computing, networking, and the driver stacks of IP and 3P accelerators by providing at least the following technical contributions: (i) a bandwidth optimal distributed barrier and rendezvous protocol implementation directly inside the backend network communication stack to implement distributed agreement protocol among an ensemble of accelerator devices and worker processes, and (ii) transparent and consistent checkpointing and restoration of process and device state to enable transparent preemptive scheduling, failover, live migration, and dynamic elasticity - all without impacting the model convergence and without requiring any help from the user or frameworks. The disclosed service provides for AI jobs to be checkpointed so that their device state may be captured and then restored on other nodes, without impacting the correctness of the model or the model’s convergence — at the infrastructure layer.

[0027] The disclosed service is configured to provide global distribution of inferencing endpoints for (a) predictable single digit millisecond latencies at 99th percentile (P99), anywhere around the world and (b) high availability in the face of regional disasters. When a user submits an inferencing workload, the inferencing model may be deployed across different geographic regions and run in the closest region.

[0028] The disclosed service is configured to provide vertical integration for both 3P and IP hardware. The example architecture of illustrated in FIG. 1 below is designed for the future, with built-in extensibility to be agile as new scenarios and technologies emerge. The disclosed design is flexible with respect to the following: providing first class support for both 3P and IP AI accelerators; providing disaggregated and aggregated topologies; providing non-uniform backend network configuration, providing extensible, layered architecture; enabling extensible scheduling systems for customizability by tenants; enabling extensible heterogeneous accelerators, devices, and/or hardware; and providing a compiler tool chain that is agnostic of AI training and inferencing frameworks.

[0029] The disclosure provides a unified abstraction on top of a wide range of both 3P and IP AI accelerators, and can map a given training job or an inferencing endpoint across a mix of heterogeneous device types to drive the highest efficiency.

[0030] Along with supporting standard server-style compute topologies, the disclosed service is configured to support and drive a cloud computing environment’s disaggregation strategy and/or other similar strategies associated with other cloud platforms. Aggregated topologies include devices that are physically attached to the servers, such that one does not need to go through a backend network. Disaggregated topologies include a rack of compute nodes and a rack of hardware accelerators that may make use of a backend network. The disclosed service abstracts both of these topologies.

[0031] The disclosed service is configured to support a variety of non-uniform backend network architectures envisioned by different first party and third-party hardware manufacturers.

[0032] The disclosed service provides a layered architecture that supports extensibility at every level, including pluggable data planes (e.g., the orchestration layer extensibility supports plugging in alternate data planes or an orchestrator below its scheduler to support Kubernetes running in a customer’s private data center), pluggable scheduling subsystems (e.g., the scheduling layer extensibility supports plugging in alternate schedulers and custom policies below its control plane to support gradual migration to the disclosed service), and pluggable heterogeneous device types and accelerators (e.g., the disclosure is designed to enable a consistent model for provisioning and scaling accelerator devices with a pluggable device provider interface, including quantum-computing devices).

[0033] The disclosed service is configured to provide a compiler toolchain that is agnostic of AI training and inferencing frameworks. The service does not rely on any help from the user or frameworks for providing its core capabilities. It is designed to be agnostic of AI training and inferencing frameworks and tools. It does not require the user to opt into any specific framework, compiler toolchain or library. The service integrates at the level of device drivers and the device-to-device communication channels for supporting various hardware specific capabilities.

[0034] The disclosed service provides a highly scalable AI infrastructure. The service is designed to scale across 100s of datacenters and tens of thousands of accelerators with training models of trillions of parameters. The service may be configured to cross- geographical boundaries as well. The architecture is also capable of treating training jobs and inferencing services as equal when they originate from data centers as well as on premises sources.

[0035] While the aspects of the disclosure have been described in terms of various examples with their associated operations, a person skilled in the art would appreciate that a combination of operations from any number of different examples is also within scope of the aspects of the disclosure.

[0036] While the examples provided involve implementations using GPUs, it will be appreciated that FPGAs, ASICs, or other specialized hardware may be used similarly to carry out the functionality described herein.

[0037] FIG. 1 is a block diagram illustrating a system 100 configured for providing infrastructure service for AI workloads according to an embodiment. The system 100 includes a control plane 102, a runtime plane 104, and an infrastructure plan 106. In some examples, the system 100 is a distributed computing infrastructure system that includes hardware devices distributes across many different locations (e.g., a global or planet-scale distributed system). Further, the system 100 is configured specifically to enable the execution of AI workloads, such that the hardware, firmware, and/or software of the system 100 is configured to enable efficient execution of tasks associated with AI workloads. Alternatively, or additionally, the system 100 may include hardware, firmware, and/or software configured specifically to enable the execution of other types of workloads without departing from the description.

[0038] The control plane 102 includes a manageability subsystem 108, pluggable data planes 110, and a global scheduling subsystem 112. In some examples, the control plane 102 is configured to receive or accept AI workloads and associated data through a variety of extensible or pluggable data planes 110 that may be defined by the tenants of the system (e.g., plugging in an alternate data plane below the scheduler to support Kubernetes or another similar system running in a tenant’s private data center). Those AI workloads are scheduled for execution on the infrastructure of the system 100 (e.g., the infrastructure plane 106), as described herein.

[0039] The manageability subsystem 108 includes hardware, firmware, and/or software configured to provide interactive processing of AI workload requests to tenants. Further, the manageability subsystem 108 is configured to provide all infrastructure resources of the system 100 in all regions of the system’s operation. In some examples, the manageability subsystem 108 includes manageability replicas in various regions of the system 100 such that the infrastructure resources of the system 100 are multi -mastered by various replicas as an interface between tenants and the system 100. The manageability subsystem 108 may be decoupled from the global scheduler subsystem 112.

[0040] The global scheduler subsystem 108 includes hardware, firmware, and/or software configured to schedule AI workloads/jobs for execution on the infrastructure resource of the system 100 as described herein. In some examples, the global scheduler subsystem 108 includes hierarchical schedulers: global scheduler(s), regional schedulers, and coordinator services. The global scheduler is responsible for preparing schedules corresponding to the AI workloads (e.g., jobs, models, and/or pods) and handing them over to the regional schedulers based on those prepared schedules. The regional scheduler is responsible for managing and reporting regional capacity with the global scheduler and then also executing the schedule received from the global scheduler. The coordinator service is responsible for translating the schedules into physical resource allocations across clusters of infrastructure resources within a region. The coordinator service may also constitute or otherwise be closely associated with the reliability subsystem 122 as described herein. The global scheduling sub system 112 is described in greater detail below.

[0041] The runtime plane 104 includes subsystems configured to enable the AI workloads to be distributed to and executed on the infrastructure plane 106 as described herein. Such subsystems may include a monitoring subsystem 114, a compilation subsystem 116, a communication subsystem 118, and/or a load balancing subsystem 120. Further, the runtime plane 104 includes a reliability subsystem 122 configured for securing the reliability of execution of AI workloads while enabling such workloads to be checkpointed and/or migrated throughout the infrastructure resources of the system 100. The runtime plane 104 further includes AI accelerator provider models 124 that are configured to enable the use of a variety of libraries and/or configurations for managing AI accelerators when executing AI workloads. The runtime plane 104 is described in greater detail below.

[0042] The infrastructure plane 106 includes hardware, firmware, and/or software for executing the AI workloads based on the schedules provided by the control plane 102 and instructions received from the runtime plane 104. The infrastructure plane 106 includes hosting and activation subsystems 126, infrastructure resources 128, and devices/AI accelerators 130. The infrastructure plane 106 is described in greater detail below.

[0043] FIG. 2 is a block diagram 200 illustrating a runtime plane 204 of the system 100 of FIG. 1 according to an embodiment. In some examples, the runtime plane 204 is substantially the same as the runtime plane 104 described above with respect to FIG. 1. The runtime plane 204 includes a monitoring subsystem 214, a compilation subsystem 216, a communication subsystem 218, a load balancing subsystem 220, a reliability subsystem 222, and AI accelerator provider models 224.

[0044] The reliability subsystem 222 includes routines for interacting with AI workloads to ensure their reliability. In some examples, the routines include failover 232, suspend 234, resume 236, migrate 238, scale 240, checkpoint 242, and restore 244. The checkpoint 242 and restore 244 routines may be configured as the core routines and the other routines (failover 232, suspend 234, resume 236, migrate 238, and scale 240) may be configured to use checkpoint 242 and/or restore 244 routines to achieve the desired results.

[0045] The checkpoint 242 routine is configured to save the state of an AI workload as it is executed, such that the saved state can be used to continue execution of the AI workload from the saved point in time. Checkpoint 242 may be used to perform the suspend 234 routine to halt the execution of an AI workload for a period of time and/or to perform the migrate 238 routine to save the state of the AI workload such that it can be moved to another set of infrastructure resources for continued execution.

[0046] The restore 244 routine is configured to take a saved state of an AI workload as input and restore the execution of the AI workload on infrastructure resources starting at the point of the saved state. The restore 244 routine may be used to perform the resume 236 routine and/or to restore the execution of an AI workload that has been migrated to another set of infrastructure resources based on a migrate 238 routine.

[0047] The failover 232 routine is configured to checkpoint the state of an AI workload based on detection of a failure of the current infrastructure resources and to restore the AI workload on a new set of infrastructure resources, such that the AI workload recovers from the detected failure.

[0048] The scale 240 routine is configured to scale up and/or scale down the quantity, quality, and/or type of infrastructure resources being used to execute an AI workload. For instance, if additional infrastructure resources are available, an AI workload may be scaled up to make use of those additional infrastructure resources. Alternatively, if a new AI workload requires some infrastructure resources in use executing a current AI workload, the current AI workload may be scaled down to free up some resources for the new AI workload (e.g., the new AI workload may be associated with a higher priority or tier than the current AI workload).

[0049] The reliability subsystem 222 further includes a rendezvous protocol 246 configured to synchronize or otherwise enforce synchronization on AI workloads upon which the above-described routines are to be applied. For instance, if an AI workload is going to be migrated, the rendezvous protocol 246 is configured to synchronize the operations of the system such that the resources involved in the migration are not altered during the migration process. Such a rendezvous protocol 246 may include use of locking or forming a barrier such that processes that are otherwise not associated with the migration do not affect the migration inadvertently.

[0050] The AI accelerator provider models 224 are configured to enable the use of various software stacks, including 3P libraries 248 (e.g., libraries provided by tenants of the system 100) and/or IP libraries 250 (e.g., libraries provided by the entity that manages the system 100). For instance, 3P libraries 248 may include a 3P-specific management library (ML) 252, 3P-specific multi-GPU communications library (MGCL) 254, and 3P-specific GPU library (GPUL) 256. Additionally, or alternatively, IP libraries 250 may include a management library 264, a communication library 266, and/or a compiler toolchain 268. The runtime plane 204 enables tenants to make use of a wide variety of software stacks and associated libraries, including their own software stacks, to execute AI workloads within the described system 100 based on its extensible, flexible configuration.

[0051] FIG. 3 is a block diagram 300 illustrating an infrastructure plane 306 of the system 100 of FIG. 1 according to an embodiment. In some examples, the infrastructure plane 306 is substantially the same as the infrastructure plane 106 of FIG. 1, as described above. The infrastructure plane 306 includes a hosting and activation subsystem 326, infrastructure resources 328, and devices and AI accelerators 330.

[0052] The hosting and activation sub system 326 includes host agents 370 and containers 372. The host agents 370 enable and organize the hosting of AI workloads on the infrastructure resources 328. The containers 372 (e.g., copy-on-write containers) keep different AI workloads (e.g., workloads from different tenants) separate and secure from each other, even when they are being executed on the same host. A host controlled by a host agent 370 may be a device that includes a set of infrastructure resources 328 that are configured to execute an AI workload or at least a portion thereof. Thus, by separating AI workloads into containers 372, some resources of a host may be used to execute an AI workload from one tenant, while other resources of the host may be used to execute an AI workload of another tenant at the same time. The containers 372 are configured such that the two separated AI workloads are prevented from interacting in any manner while they are being executed.

[0053] The infrastructure resources 328 include a service fabric 374 interface, storage resources 376, networking resources 378, compute resources 380 which may include bare metal blades 382 (e.g., physical processing devices) and virtual machines 384, and other resources 386 (e.g., integration infrastructure resources). In some examples, the infrastructure resources 328 are primarily provided for use by the entity that is offering services of the system 100 (e.g., IP resources), but in other examples, the infrastructure resources 328 may also include resources provided by other entities (e.g., 3P resources) such as resources owned and used by tenants of the system 100. Such integration may be enabled via the 3P libraries 248 and other configurations described above.

[0054] The devices and AI accelerators 330 include GPUs 388, FPGA devices 390, other 3P devices 392, and other IP devices 394. The described processes may further be enabled by backend networks 396 and/or associated devices. The execution of AI workloads may uniquely benefit from the use of GPUs 388, FPGAs 390, and/or other specialized hardware. In such examples, infrastructure resources 328, such as compute resources 380, may be linked to GPUs 388, for instance, such that a compute resource 380 provides instructions to the GPU 388 for how to execute steps of the AI workload. Such execution then takes advantage of specialized architecture of the GPU 388, such as the GPU 388 having many cores enabling parallel processing of data to a significant degree beyond the capabilities of the compute resources 380.

[0055] The backend networks 396 are configured to support a variety of non-uniform backend network architectures that may be envisioned by a variety of entities that use the system, such as IP and 3P hardware manufacturers. Such backend networks 396 may be used to provide links between disaggregated topologies of compute nodes (e.g., compute resources 380) and hardware accelerators (e.g., GPUs 388).

[0056] FIG. 4 is a flowchart illustrating a method 400 for managing AI workloads in a cloud infrastructure platform according to an embodiment. In some examples, the cloud infrastructure platform of method 400 is a system such as system 100 of FIG.1. At 402, a set of distributed infrastructure resources (e.g., hosting and activation subsystems 126, infrastructure resources 128, and/or devices/ AI accelerators 130 of the infrastructure plane 106) are integrated into the cloud infrastructure platform via native support interfaces of those resources. In some examples, the native support interfaces may include interfaces and/or libraries of the providers of the resources, such as the 3P libraries 248 and IP libraries 250 of FIG. For instance, a tenant of the could infrastructure platform may provide a subset of infrastructure resources for integration into the platform based on provided libraries, such that the tenant and/or other tenants of the platform may use those resources in execution of AI workloads. [0057] At 404, AI workloads are received from a plurality of tenants, wherein the received AI workloads include training workloads and inferencing workloads. In some examples, the tenants provide AI workloads for execution on the platform via interfaces such as pluggable data planes 110 as described herein.

[0058] At 406, resource subsets of the distributed infrastructure resources are assigned to the received AI workloads. In some examples, the assignment of resource subsets to the AI workloads is performed by a global scheduling system 112 as described herein. Assigning the resources may include determining resource requirements of an AI workload and then identifying a subset of infrastructure resources that satisfy those requirements (e.g., an AI workload that requires the use of four GPUs in parallel may be assigned to a node of the system that has at least four GPUs).

[0059] Additionally, or alternatively, the assignment of a subset of resources to an AI workload may include rearranging of other AI workloads with respect to the subset of resources. For instance, assigning a resource subset to an AI workload may include saving a state checkpoint of an AI workload that is currently being executed on a first resource subset, migrating that AI workload to a second resource subset, restoring the saved state checkpoint of the migrated AI workload on the second resource subset, and then assigning at least a portion of the first resource subset to another AI workload. In some examples, such processes may be performed using routines of a reliability subsystem 222 as described herein.

[0060] At 408, the received AI workloads are scheduled for execution on the assigned resource subsets. In some examples, a global scheduling subsystem 112 generates a schedule for the AI workloads as described herein. Further, scheduling the execution of the AI workloads may include scheduling training workloads and inferencing workloads on the same infrastructure resources and those two types of workloads are multiplexed on those infrastructure resources (e.g., execution of a training workload is interspersed with execution of an inferencing workload on an infrastructure resource, such as a GPU).

[0061] Further, in some examples, AI workloads are associated with priorities or tiers that affect how resources are assigned and how AI workloads are scheduled to be executed on those resources. For instance, lower tier AI workloads may be more likely to be migrated to other resources to make space for higher tier AI workloads or higher tier AI workloads may be scheduled for a greater share of resource usage time that lower tier AI workloads, as described herein.

[0062] At 410, the AI workloads are executed based on the scheduling of the AI workloads on the assigned resource subsets. In some examples, the AI workloads are hosted in a hosting and activation subsystem 126 and then infrastructure resources 128 and/or devices/AI accelerators 130 are used to execute the AI workloads. For instance, assigning and executing AI workloads on resource subsets includes isolating the AI workloads from each other in secure containers, whereby AI workloads associated with different tenants are securely executed alongside each other (e.g., on resources associated with the same server).

[0063] Further, in some examples, executing AI workloads are monitored based on the performance of the cloud infrastructure platform and, based on that monitoring, the scheduling of the AI workloads is adjusted. The adjusting of the scheduling may include preempting an AI workload, migrating an AI workload, scaling up an AI workload, scaling down an AI workload, and/or load-balancing between two or more AI workloads. Such schedule adjustment may be performed by a global scheduling subsystem 112 or other component of the system 100.

[0064] FIG. 5 is a block diagram illustrating a hierarchical scheduling subsystem 500 configured for scheduling AI workloads 512 according to an embodiment. In some examples, the scheduling subsystem 500 is included in a system such as system 100 of FIG. 1. For instance, the scheduling subsystem 500 may be substantially the same as the global scheduling subsystem 112 of FIG. 1. The scheduling subsystem 500 includes a global scheduler 502 and multiple regional schedulers 504, coordinator services 506, and associated infrastructure resources 508. The global scheduler 502 is configured to use the global capacity data 510 (e.g., data indicating the current state of resource usage throughout the associated global infrastructure system, including resource usage in each region of the system) and AI workloads 512 to generate a global schedule 514 that schedules the AI workloads 512 to be executed on the infrastructure resources 508. The global scheduler 514 includes regional schedules 520 for each region of the system, which are then provided to the regional schedulers 504 associated with those regions (e.g., a regional scheduler 520 of a region is provided to the regional scheduler 504 associated with that particular region). [0065] The regional schedulers 504 monitor the current regional capacity data 516 of the infrastructure resources 508 associated with the respective regions and that regional capacity data 516 is provided to the global scheduler 502 periodically or based on a pattern or a triggering event. Further, the regional schedulers 504 receive the regional AI workloads 518 associated with their regions from the global scheduler 502 from the set of AI workloads 512. The regional schedulers 504 are also configured to instruct the coordinator services 506 to execute the associated regional schedules 520 using the data of the regional AI workloads 518 (each region includes a regional scheduler 504 and a coordinator service 506).

[0066] The coordinator services 506 are configured to receive a regional schedule 522 and associated regional AI workloads 524 from an associated regional scheduler 504 and to use the reliability routines 526 (e.g., the routines of the reliability subsystem 222 of FIG. 2 as described above) to cause the regional AI workloads 524 to be executed using infrastructure resources 508 of the region based on the regional scheduler 522. For instance, a coordinator service 506 may be configured to allocate a subset of infrastructure resource 508 of the region to a regional AI workload 524 and cause that workload 524 to be executed on those allocated resources 508. Additionally, or alternatively, a coordinator service 506 may be configured to checkpoint, restore, migrate, and/or perform other reliability routines 526 to arrange the use of the infrastructure resources 508 according to the regional schedule 522. Exemplary Operating Environment

[0067] FIG. 6 is a block diagram of an example computing device 600 for implementing aspects disclosed herein, and is designated generally as computing device 600. Computing device 600 is but one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the examples disclosed herein. Neither should computing device 600 be interpreted as having any dependency or requirement relating to any one or combination of components/modules illustrated. The examples disclosed herein may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program components, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program components including routines, programs, objects, components, data structures, and the like, refer to code that performs particular tasks, or implement particular abstract data types. The disclosed examples may be practiced in a variety of system configurations, including personal computers, laptops, smart phones, mobile tablets, hand-held devices, consumer electronics, specialty computing devices, etc. The disclosed examples may also be practiced in distributed computing environments when tasks are performed by remote-processing devices that are linked through a communications network.

[0068] Computing device 600 includes a bus 610 that directly or indirectly couples the following devices: computer-storage memory 612, one or more processors 614, one or more presentation components 616, input/output (I/O) ports 618, I/O components 620, a power supply 622, and a network component 624. While computing device 600 is depicted as a seemingly single device, multiple computing devices 600 may work together and share the depicted device resources. For example, memory 612 is distributed across multiple devices, and processor(s) 614 is housed with different devices.

[0069] Bus 610 represents what may be one or more busses (such as an address bus, data bus, or a combination thereof). Although the various blocks of FIG. 6 are shown with lines for the sake of clarity, delineating various components may be accomplished with alternative representations. For example, a presentation component such as a display device is an I/O component in some examples, and some examples of processors have their own memory. Distinction is not made between such categories as “workstation,” “server,” “laptop,” “hand-held device,” etc., as all are contemplated within the scope of FIG. 6 and the references herein to a “computing device.” Memory 612 may take the form of the computer-storage media references below and operatively provide storage of computer- readable instructions, data structures, program modules and other data for the computing device 600. In some examples, memory 612 stores one or more of an operating system, a universal application platform, or other program modules and program data. Memory 612 is thus able to store and access data 612a and instructions 612b that are executable by processor 614 and configured to carry out the various operations disclosed herein.

[0070] In some examples, memory 612 includes computer-storage media in the form of volatile and/or nonvolatile memory, removable or non-removable memory, data disks in virtual environments, or a combination thereof. Memory 612 may include any quantity of memory associated with or accessible by the computing device 600. Memory 612 may be internal to the computing device 600 (as shown in FIG. 6), external to the computing device 600 (not shown), or both (not shown). Examples of memory 612 in include, without limitation, random access memory (RAM); read only memory (ROM); electronically erasable programmable read only memory (EEPROM); flash memory or other memory technologies; CD-ROM, digital versatile disks (DVDs) or other optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices; memory wired into an analog computing device; or any other medium for encoding desired information and for access by the computing device 600. Additionally, or alternatively, the memory 612 may be distributed across multiple computing devices 600, for example, in a virtualized environment in which instruction processing is carried out on multiple devices 600. For the purposes of this disclosure, “computer storage media,” “computer- storage memory,” “memory,” and “memory devices” are synonymous terms for the computer-storage memory 612, and none of these terms include carrier waves or propagating signaling.

[0071] Processor(s) 614 may include any quantity of processing units that read data from various entities, such as memory 612 or I/O components 620. Specifically, processor(s) 614 are programmed to execute computer-executable instructions for implementing aspects of the disclosure. The instructions may be performed by the processor, by multiple processors within the computing device 600, or by a processor external to the client computing device 600. In some examples, the processor(s) 614 are programmed to execute instructions such as those illustrated in the flow charts discussed below and depicted in the accompanying drawings. Moreover, in some examples, the processor(s) 614 represent an implementation of analog techniques to perform the operations described herein. For example, the operations are performed by an analog client computing device 600 and/or a digital client computing device 600. Presentation component(s) 616 present data indications to a user or other device. Exemplary presentation components include a display device, speaker, printing component, vibrating component, etc. One skilled in the art will understand and appreciate that computer data may be presented in a number of ways, such as visually in a graphical user interface (GUI), audibly through speakers, wirelessly between computing devices 600, across a wired connection, or in other ways. I/O ports 618 allow computing device 600 to be logically coupled to other devices including EO components 620, some of which may be built in. Example EO components 620 include, for example but without limitation, a microphone, joystick, game pad, satellite dish, scanner, printer, wireless device, etc.

[0072] The computing device 600 may operate in a networked environment via the network component 624 using logical connections to one or more remote computers. In some examples, the network component 624 includes a network interface card and/or computer-executable instructions (e.g., a driver) for operating the network interface card. Communication between the computing device 600 and other devices may occur using any protocol or mechanism over any wired or wireless connection. In some examples, network component 624 is operable to communicate data over public, private, or hybrid (public and private) using a transfer protocol, between devices wirelessly using short range communication technologies (e.g., near-field communication (NFC), BLUETOOTH branded communications, or the like), or a combination thereof. Network component 624 communicates over wireless communication link 626 and/or a wired communication link 626a to a cloud resource 628 across network 630. Various different examples of communication links 626 and 626a include a wireless connection, a wired connection, and/or a dedicated link, and in some examples, at least a portion is routed through the internet.

[0073] Although described in connection with an example computing device 600, examples of the disclosure are capable of implementation with numerous other general- purpose or special-purpose computing system environments, configurations, or devices. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with aspects of the disclosure include, but are not limited to, smart phones, mobile tablets, mobile computing devices, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, gaming consoles, microprocessor- based systems, set top boxes, programmable consumer electronics, mobile telephones, mobile computing and/or communication devices in wearable or accessory form factors (e.g., watches, glasses, headsets, or earphones), network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, virtual reality (VR) devices, augmented reality (AR) devices, mixed reality (MR) devices, holographic device, and the like. Such systems or devices may accept input from the user in any way, including from input devices such as a keyboard or pointing device, via gesture input, proximity input (such as by hovering), and/or via voice input.

[0074] Examples of the disclosure may be described in the general context of computer- executable instructions, such as program modules, executed by one or more computers or other devices in software, firmware, hardware, or a combination thereof. The computer- executable instructions may be organized into one or more computer-executable components or modules. Generally, program modules include, but are not limited to, routines, programs, objects, components, and data structures that perform particular tasks or implement particular abstract data types. Aspects of the disclosure may be implemented with any number and organization of such components or modules. For example, aspects of the disclosure are not limited to the specific computer-executable instructions or the specific components or modules illustrated in the figures and described herein. Other examples of the disclosure may include different computer-executable instructions or components having more or less functionality than illustrated and described herein.

[0075] In examples involving a general-purpose computer, aspects of the disclosure transform the general-purpose computer into a special-purpose computing device when configured to execute the instructions described herein.

[0076] An example system for managing AI workloads in a cloud infrastructure platform comprises: at least one processor of the cloud infrastructure platform; and at least one memory of the cloud infrastructure platform comprising computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the at least one processor to: integrate a set of distributed infrastructure resources via native support interfaces; receive AI workloads from a plurality of tenants, wherein the AI workloads include training workloads and inferencing workloads; assign resource subsets of the set of distributed infrastructure resources to the received AI workloads; schedule the received AI workloads for execution on the assigned resource subsets; and execute the AI workloads based on the scheduling of the AI workloads on the assigned resource subsets. [0077] An example computerized method for managing AI workloads in a cloud infrastructure platform comprises: integrating, by at least one processor of the cloud infrastructure platform, a set of distributed infrastructure resources via native support interfaces; receiving, by the at least one processor, AI workloads from a plurality of tenants, wherein the AI workloads include training workloads and inferencing workloads; assigning, by the at least one processor, resource subsets of the set of distributed infrastructure resources to the received AI workloads; scheduling, by the at least one processor, the received AI workloads for execution on the assigned resource subsets; and executing, by the at least one processor, the AI workloads based on the scheduling of the AI workloads on the assigned resource subsets.

[0078] One or more computer storage media have computer-executable instructions for managing AI workloads in a cloud infrastructure platform that, upon execution by a processor, cause the processor to at least: integrate a set of distributed infrastructure resources via native support interfaces; receive AI workloads from a plurality of tenants, wherein the AI workloads include training workloads and inferencing workloads; assign resource subsets of the set of distributed infrastructure resources to the received AI workloads; schedule the received AI workloads for execution on the assigned resource subsets; and execute the AI workloads based on the scheduling of the AI workloads on the assigned resource subsets.

[0079] Alternatively, or in addition to the other examples described herein, examples include any combination of the following:

-wherein assigning the resource subsets to the received AI workloads includes isolating the AI workloads from each other in secure containers, whereby AI workloads associated with different tenants are securely executed alongside each other.

-wherein assigning resource subsets of the set of distributed infrastructure resources to the received AI workloads further includes: saving a state checkpoint of a first AI workload that is being executed on a first resource subset; migrating the first AI workload to a second resource subset; restoring the saved state checkpoint of the first AI workload on the second resource subset; and assigning at least a portion of the first resource subset to a second AI workload.

-wherein scheduling the received AI workloads for execution of the assigned resource subsets includes multiplexing execution of at least two AI workloads on at least one resource of an assigned resource subset.

-wherein the at least two AI workloads include a training workload and an inferencing workload; and wherein the multiplexing of execution of the training workload and the inferencing workload on the at least one resource is based on differing resource use between the training workload and the inferencing workload -further comprising: monitoring, by the at least one processor, the executing of the AI workloads based on performance of the cloud infrastructure platform; and based on the monitoring, adjusting, by the at least one processor, the scheduling of the AI workloads, whereby performance of the cloud infrastructure platform is improved, and wherein the adjusting includes at least one of the following: preempting an AI workload, migrating an AI workload, scaling up an AI workload, scaling down an AI workload, and load-balancing between at least two AI workloads.

-wherein each AI workload of the received AI workloads is associated with a priority tier; and wherein assigning resource subsets to the received AI workloads and scheduling the received AI workloads for execution on the assigned resource subsets are based on the associated priority tiers of the AI workloads.

[0080] The embodiments illustrated and described herein as well as embodiments not specifically described herein but within the scope of aspects of the claims constitute an exemplary means for integrating, by at least one processor of the cloud infrastructure platform, a set of distributed infrastructure resources via native support interfaces; exemplary means for receiving, by the at least one processor, AI workloads from a plurality of tenants, wherein the AI workloads include training workloads and inferencing workloads; exemplary means for assigning, by the at least one processor, resource subsets of the set of distributed infrastructure resources to the received AI workloads; exemplary means for scheduling, by the at least one processor, the received AI workloads for execution on the assigned resource subsets; and exemplary means for executing, by the at least one processor, the AI workloads based on the scheduling of the AI workloads on the assigned resource subsets.

[0081] By way of example and not limitation, computer readable media comprise computer storage media and communication media. Computer storage media include volatile and nonvolatile, removable and non-removable memory implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or the like. Computer storage media are tangible and mutually exclusive to communication media. Computer storage media are implemented in hardware and exclude carrier waves and propagated signals. Computer storage media for purposes of this disclosure are not signals per se. Exemplary computer storage media include hard disks, flash drives, solid-state memory, phase change random-access memory (PRAM), static random-access memory (SRAM), dynamic random-access memory (DRAM), other types of random-access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that may be used to store information for access by a computing device. In contrast, communication media typically embody computer readable instructions, data structures, program modules, or the like in a modulated data signal such as a carrier wave or other transport mechanism and include any information delivery media.

[0082] The order of execution or performance of the operations in examples of the disclosure illustrated and described herein is not essential, and may be performed in different sequential manners in various examples. For example, it is contemplated that executing or performing a particular operation before, contemporaneously with, or after another operation is within the scope of aspects of the disclosure.

[0083] When introducing elements of aspects of the disclosure or the examples thereof, the articles "a," "an," "the," and "said" are intended to mean that there are one or more of the elements. The terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements other than the listed elements. The term “exemplary” is intended to mean “an example of.” The phrase “one or more of the following: A, B, and C” means “at least one of A and/or at least one of B and/or at least one of C."

[0084] Having described aspects of the disclosure in detail, it will be apparent that modifications and variations are possible without departing from the scope of aspects of the disclosure as defined in the appended claims. As various changes could be made in the above constructions, products, and methods without departing from the scope of aspects of the disclosure, it is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.

Claims

1. A system for managing AI workloads in a cloud infrastructure platform, the system comprising: at least one processor of the cloud infrastructure platform; and at least one memory of the cloud infrastructure platform comprising computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the at least one processor to: integrate a set of distributed infrastructure resources via native support interfaces; receive AI workloads from a plurality of tenants, wherein the AI workloads include training workloads and inferencing workloads; assign resource subsets of the set of distributed infrastructure resources to the received AI workloads; schedule the received AI workloads for execution on the assigned resource subsets; and execute the AI workloads based on the scheduling of the AI workloads on the assigned resource subsets.

2. The system of claim 1, wherein assigning the resource subsets to the received AI workloads includes isolating the AI workloads from each other in secure containers, whereby AI workloads associated with different tenants are securely executed alongside each other.

3. The system of any of claims 1-2, wherein assigning resource subsets of the set of distributed infrastructure resources to the received AI workloads further includes: saving a state checkpoint of a first AI workload that is being executed on a first resource subset; migrating the first AI workload to a second resource subset; restoring the saved state checkpoint of the first AI workload on the second resource subset; and assigning at least a portion of the first resource subset to a second AI workload.

4. The system of any of claims 1-3, wherein scheduling the received AI workloads for execution of the assigned resource subsets includes multiplexing execution of at least two AI workloads on at least one resource of an assigned resource subset.

5. The system of claim 4, wherein the at least two AI workloads include a training workload and an inferencing workload; and wherein the multiplexing of execution of the training workload and the inferencing workload on the at least one resource is based on differing resource use between the training workload and the inferencing workload.

6. The system of any of claims 1-5, the at least one memory and the computer program code configured to, with the at least one processor, further cause the at least one processor to: monitor the executing of the AI workloads based on performance of the cloud infrastructure platform; and based on the monitoring, adjust the scheduling of the AI workloads, whereby performance of the cloud infrastructure platform is improved, and wherein the adjusting includes at least one of the following: preempting an AI workload, migrating an AI workload, scaling up an AI workload, scaling down an AI workload, and load-balancing between at least two AI workloads.

7. The system of any of claims 1-6, wherein each AI workload of the received AI workloads is associated with a priority tier; and wherein assigning resource subsets to the received AI workloads and scheduling the received AI workloads for execution on the assigned resource subsets are based on the associated priority tiers of the AI workloads.

8. A computerized method for managing AI workloads in a cloud infrastructure platform, the computerized method comprising: integrating, by at least one processor of the cloud infrastructure platform, a set of distributed infrastructure resources via native support interfaces; receiving, by the at least one processor, AI workloads from a plurality of tenants, wherein the AI workloads include training workloads and inferencing workloads; assigning, by the at least one processor, resource subsets of the set of distributed infrastructure resources to the received AI workloads; scheduling, by the at least one processor, the received AI workloads for execution on the assigned resource subsets; and executing, by the at least one processor, the AI workloads based on the scheduling of the AI workloads on the assigned resource subsets.

9. The method of claim 8, wherein assigning the resource subsets to the received AI workloads includes isolating the AI workloads from each other in secure containers, whereby AI workloads associated with different tenants are securely executed alongside each other.

10. The method of any of claims 8-9, wherein assigning resource subsets of the set of distributed infrastructure resources to the received AI workloads further includes: saving a state checkpoint of a first AI workload that is being executed on a first resource subset; migrating the first AI workload to a second resource subset; restoring the saved state checkpoint of the first AI workload on the second resource subset; and assigning at least a portion of the first resource subset to a second AI workload.

11. The method of any of claims 8-10, wherein scheduling the received AI workloads for execution of the assigned resource subsets includes multiplexing execution of at least two AI workloads on at least one resource of an assigned resource subset.

12. The method of claim 11, wherein the at least two AI workloads include a training workload and an inferencing workload; and wherein the multiplexing of execution of the training workload and the inferencing workload on the at least one resource is based on differing resource use between the training workload and the inferencing workload.

13. The method of any of claims 8-12, further comprising: monitoring, by the at least one processor, the executing of the AI workloads based on performance of the cloud infrastructure platform; and based on the monitoring, adjusting, by the at least one processor, the scheduling of the AI workloads, whereby performance of the cloud infrastructure platform is improved, and wherein the adjusting includes at least one of the following: preempting an AI workload, migrating an AI workload, scaling up an AI workload, scaling down an AI workload, and load-balancing between at least two AI workloads.

14. The method of any of claims 8-13, wherein each AI workload of the received AI workloads is associated with a priority tier; and wherein assigning resource subsets to the received AI workloads and scheduling the received AI workloads for execution on the assigned resource subsets are based on the associated priority tiers of the AI workloads.

15. One or more computer storage media having computer-executable instructions for managing AI workloads in a cloud infrastructure platform that, upon execution by a processor, cause the processor to at least: integrate a set of distributed infrastructure resources via native support interfaces; receive AI workloads from a plurality of tenants, wherein the AI workloads include training workloads and inferencing workloads; assign resource subsets of the set of distributed infrastructure resources to the received AI workloads; schedule the received AI workloads for execution on the assigned resource subsets; and execute the AI workloads based on the scheduling of the AI workloads on the assigned resource subsets.