CN117271057A

CN117271057A - Large model deployment method, device and product based on server non-perception calculation

Info

Publication number: CN117271057A
Application number: CN202311249504.2A
Authority: CN
Inventors: 李洋; 李振华
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2023-12-22

Abstract

The application provides a large model deployment method, device and product based on server non-perception calculation, which relate to the technical field of deep learning and comprise the following steps: acquiring a deployment request of a target large model; sampling from a plurality of cloud virtual environment schemes by using a Bayesian optimization algorithm to obtain a target cloud virtual environment scheme; optimizing the calculation graph of the target large model to obtain an optimized calculation graph; generating an optimal model parallelization scheme by using the deep reinforcement learning model; according to the target cloud virtual environment scheme and the optimal model parallelization scheme, a target large model is deployed, and an reasoning service is executed by using the target large model, so that a reasoning performance result is obtained; according to the reasoning performance result, adjusting parameters of the Bayesian optimization algorithm; performing multiple iterations; and deploying a target large model according to the generated target cloud virtual environment scheme and the optimal model parallelization scheme in the last iteration process, and executing reasoning service by using the target large model by adopting a server non-perception calculation method.

Description

Large model deployment method, device and product based on server non-perception calculation

Technical Field

The application relates to the technical field of deep learning, in particular to a large model deployment method, device and product based on server non-perception calculation.

Background

With the development of deep learning technology, a deep neural network (Deep Nueral Network, DNN) large model is widely applied to multiple fields of computer vision, voice recognition and the like, and becomes an important back-end support for many real-time online services. To improve service efficiency, many real-time online service options deploy their pre-trained large models in public clouds (e.g., ali clouds, tencent clouds) and provide users with corresponding reasoning services.

However, the deployment of a large model requires selection of a specific type of virtual machine instance of suitable hardware and operating system, namely, a cloud virtual environment scheme, and the number of cloud virtual environment schemes provided by public clouds is huge, so that it is difficult to manually select a suitable cloud virtual environment scheme from the cloud virtual environment schemes, which easily results in excessive fund waste of model deployment cost or excessively low configuration speed of model reasoning.

Therefore, there is a need to develop a server-agnostic computing-based large model deployment method, apparatus, and product to improve the cost effectiveness of large model cloud deployment.

Disclosure of Invention

In view of the foregoing, embodiments of the present application provide a server-agnostic computing-based large model deployment method, apparatus, and article of manufacture to overcome or at least partially address the foregoing.

In a first aspect of an embodiment of the present application, a large model deployment method based on server non-aware computing is provided, where the method includes:

acquiring a deployment request of a target large model; the target large model represents a basic model with more than 1 hundred million parameter contained in a calculation map;

sampling from a plurality of cloud virtual environment schemes by using a Bayesian optimization algorithm according to the deployment request to obtain a target cloud virtual environment scheme;

optimizing the calculation graph of the target large model to obtain an optimized calculation graph;

generating an optimal model parallelization scheme according to the optimized calculation graph by using a deep reinforcement learning model; the optimal model parallelization scheme represents parallelization schemes of a plurality of computing nodes of the target large model on a plurality of computing devices;

deploying the target large model according to the target cloud virtual environment scheme and the optimal model parallelization scheme, and executing reasoning service by using the target large model to obtain a reasoning performance result of the target large model;

According to the reasoning performance result, adjusting parameters of the Bayesian optimization algorithm;

performing multiple iterations according to the steps until reaching the iteration stop condition;

and deploying the target large model according to the generated target cloud virtual environment scheme and the optimal model parallelization scheme in the last iteration process, and executing reasoning service by using the target large model by adopting a server non-perception calculation method.

In an optional implementation manner, the sampling from the plurality of cloud virtual environment schemes by using a bayesian optimization algorithm to obtain a target cloud virtual environment scheme includes:

predicting the reasoning cost of each cloud virtual environment scheme by using the Bayesian optimization algorithm; each cloud virtual environment scheme at least comprises the corresponding environment: CPU kernel number information, CPU clock speed information, CUDA kernel number information, GPU clock speed information, GPU number information and cloud virtual environment memory information; the reasoning cost represents the product of the time required for reasoning and the price of the cloud virtual environment scheme;

and determining the cloud virtual environment scheme with the lowest reasoning cost from the cloud virtual environment schemes as the target cloud virtual environment scheme.

In an optional implementation manner, the bayesian optimization algorithm includes a constraint condition, where the constraint condition indicates that the bayesian optimization algorithm needs to determine the target cloud virtual environment scheme within a preset search time.

In an optional embodiment, the generating, by using the deep reinforcement learning model, an optimal model parallelization scheme according to the optimized computation graph includes:

encoding the optimized calculation graph into a deep learning operator sequence, inputting the deep reinforcement learning model, and outputting a device sequence by the deep reinforcement learning model, wherein devices in the device sequence are in one-to-one correspondence with the deep learning operators in the deep learning operator sequence; the deep learning operator sequence and the equipment sequence form a model parallelization scheme;

deploying the target large model according to the target cloud virtual environment scheme and the model parallelization scheme, and executing an reasoning service by using the target large model to obtain a reasoning performance result of the target large model;

according to the reasoning performance result, adjusting model parameters of the deep reinforcement learning model;

repeating the steps for multiple times to finish iterative training of the deep reinforcement learning model, and obtaining a trained deep reinforcement learning model;

And inputting the deep learning operator sequence into the trained deep reinforcement learning model to obtain a target equipment sequence, wherein the target equipment sequence and the deep learning operator sequence form the optimal model parallelization scheme.

In an alternative embodiment, the adopting a server-unaware computing method, executing an inference service by using the target big model, includes:

acquiring an inference service request input by a user terminal through a server non-perception interface;

abstracting the reasoning service request into a feature vector, and transmitting the feature vector to the target large model;

carrying out reasoning service by the target large model according to the feature vector to obtain a reasoning result;

and returning the reasoning result to the user side.

In an alternative embodiment, said performing an inference service using said target big model comprises:

under the condition that the target large model is started for the first time, identifying and starting a target code block by using a probe to finish starting the target large model; the object code block includes at least: hardware detection code blocks, computational graph construction code blocks and CUDA initialization code blocks;

Preserving the starting state of the target code block;

under the condition that the target large model is started for the Kth time, directly acquiring the starting state of the target code block to finish the starting of the target large model; the K represents any constant greater than 1;

and after the target large model is started, executing an inference service.

In an alternative embodiment, the optimizing the calculation map of the target large model to obtain an optimized calculation map includes:

replacing one or more source subgraphs in the computational graph of the target large model with target subgraphs with tensor algebraic superoptimizers, the target subgraphs being functionally equivalent to the source subgraphs and having higher computational performance than the source subgraphs.

The second aspect of the embodiment of the application also provides a large model deployment device based on server non-perception calculation, which comprises:

the deployment request acquisition module is used for acquiring a deployment request of the target large model; the target large model represents a basic model with more than 1 hundred million parameter contained in a calculation map;

the cloud virtual environment scheme sampling module is used for sampling from a plurality of cloud virtual environment schemes by using a Bayesian optimization algorithm according to the deployment request to obtain a target cloud virtual environment scheme;

The calculation map optimization module is used for optimizing the calculation map of the target large model to obtain an optimized calculation map;

the model parallelization scheme generating module is used for generating an optimal model parallelization scheme according to the optimized calculation graph by utilizing a deep reinforcement learning model; the optimal model parallelization scheme represents parallelization schemes of a plurality of computing nodes of the target large model on a plurality of computing devices;

the reasoning module is used for deploying the target large model according to the target cloud virtual environment scheme and the optimal model parallelization scheme, and executing reasoning service by using the target large model to obtain a reasoning performance result of the target large model;

the parameter adjustment module is used for adjusting the parameters of the Bayesian optimization algorithm according to the reasoning performance result;

the iteration module is used for carrying out multiple iterations according to the steps until reaching the iteration stop condition;

the deployment module is used for deploying the target large model according to the generated target cloud virtual environment scheme and the optimal model parallelization scheme in the last iteration process, and executing reasoning service by using the target large model by adopting a server non-perception calculation method.

In an alternative embodiment, the cloud virtual environment scheme sampling module includes:

the reasoning cost prediction sub-module is used for predicting the reasoning cost of each cloud virtual environment scheme by utilizing the Bayesian optimization algorithm; each cloud virtual environment scheme at least comprises the corresponding environment: CPU kernel number information, CPU clock speed information, CUDA kernel number information, GPU clock speed information, GPU number information and cloud virtual environment memory information; the reasoning cost represents the product of the time required for reasoning and the price of the cloud virtual environment scheme;

and the determining submodule is used for determining the cloud virtual environment scheme with the lowest reasoning cost as the target cloud virtual environment scheme in the plurality of cloud virtual environment schemes.

In an alternative embodiment, the model parallelization scheme generating module includes:

the device sequence output sub-module is used for encoding the optimized calculation graph into a deep learning operator sequence, inputting the deep reinforcement learning model, outputting a device sequence by the deep reinforcement learning model, wherein devices in the device sequence are in one-to-one correspondence with the deep learning operators in the deep learning operator sequence; the deep learning operator sequence and the equipment sequence form a model parallelization scheme;

The reasoning sub-module is used for deploying the target large model according to the target cloud virtual environment scheme and the model parallelization scheme, and executing reasoning service by using the target large model to obtain a reasoning performance result of the target large model;

the parameter adjustment sub-module is used for adjusting model parameters of the deep reinforcement learning model according to the reasoning performance result;

the iteration sub-module is used for repeating the steps for a plurality of times to finish the iterative training of the deep reinforcement learning model and obtain a trained deep reinforcement learning model;

and the optimal model parallelization scheme generation sub-module is used for inputting the deep learning operator sequence into the trained deep reinforcement learning model to obtain a target equipment sequence, and the target equipment sequence and the deep learning operator sequence form the optimal model parallelization scheme.

In an alternative embodiment, the deployment module includes:

the acquisition sub-module is used for acquiring an inference service request input by a user terminal through a server non-perception interface;

the feature vector generation sub-module is used for abstracting the reasoning service request into a feature vector and sending the feature vector to the target large model;

The reasoning service sub-module is used for carrying out reasoning service according to the feature vector by the target big model to obtain a reasoning result;

and the returning sub-module is used for returning the reasoning result to the user side.

In an alternative embodiment, the reasoning module includes:

the first promoter module is used for identifying and starting a target code block by using a probe to finish starting the target large model under the condition that the target large model is started for the first time; the object code block includes at least: hardware detection code blocks, computational graph construction code blocks and CUDA initialization code blocks;

a reservation sub-module, configured to reserve a start state of the target code block;

the second promoter module is used for directly acquiring the starting state of the target code block to finish the starting of the target large model under the condition that the target large model is started for the Kth time; the K represents any constant greater than 1;

and the service sub-module is used for executing the reasoning service after the target large model is started.

In an alternative embodiment, the computation graph optimization module includes:

and the sub-graph replacing sub-module is used for replacing one or more source sub-graphs in the calculation graph of the target large model with target sub-graphs, the target sub-graphs are equivalent to the source sub-graphs in function, and the calculation performance of the target sub-graphs is higher than that of the source sub-graphs.

The third aspect of the embodiment of the application further provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps in the large model deployment method based on server-less aware computing according to the first aspect of the embodiment of the application.

The fourth aspect of the embodiments of the present application further provides a computer readable storage medium, on which a computer program/instruction is stored, which when executed by a processor, implements the steps in the server-unaware computing-based large model deployment method according to the first aspect of the embodiments of the present application.

A fifth aspect of the embodiments of the present application also provides a computer program product, which when run on an electronic device, causes a processor to perform the steps in a server-agnostic computing based large model deployment method according to the first aspect of the embodiments of the present application.

The embodiment of the application provides a large model deployment method, device and product based on server non-perception calculation, wherein the method comprises the following steps: acquiring a deployment request of a target large model; the target large model represents a basic model with more than 1 hundred million parameter contained in a calculation map; sampling from a plurality of cloud virtual environment schemes by using a Bayesian optimization algorithm according to the deployment request to obtain a target cloud virtual environment scheme; optimizing the calculation graph of the target large model to obtain an optimized calculation graph; generating an optimal model parallelization scheme according to the optimized calculation graph by using a deep reinforcement learning model; the optimal model parallelization scheme represents parallelization schemes of a plurality of computing nodes of the target large model on a plurality of computing devices; deploying the target large model according to the target cloud virtual environment scheme and the optimal model parallelization scheme, and executing reasoning service by using the target large model to obtain a reasoning performance result of the target large model; according to the reasoning performance result, adjusting parameters of the Bayesian optimization algorithm; performing multiple iterations according to the steps until reaching the iteration stop condition; and deploying the target large model according to the generated target cloud virtual environment scheme and the optimal model parallelization scheme in the last iteration process, and executing reasoning service by using the target large model by adopting a server non-perception calculation method.

The concrete beneficial effects are that:

the embodiment of the application combines a Bayesian optimization algorithm and a deep reinforcement learning model, and adaptively discovers an optimal cloud virtual environment scheme and a model parallelization scheme. Specifically, the embodiment of the application continuously adjusts the parameters of the Bayesian optimization algorithm through multiple iterations, so that the Bayesian optimization algorithm can learn the reasoning performance of the target large model, and further a proper cloud virtual environment scheme is determined from a plurality of cloud virtual environment schemes; and generating an optimal model parallelization scheme by utilizing the deep reinforcement learning model so as to balance calculation parallelism and communication overhead among devices. Therefore, the embodiment of the application can find an economic and efficient cloud virtual environment scheme and a model parallelization scheme at the same time, realize double joint optimization of cloud virtual environment and model parallelization, and further improve the cost benefit of large model cloud deployment.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of a large model deployment method based on server-less computing provided in an embodiment of the present application;

FIG. 2 is a flow chart of a large model deployment method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a large model deployment device according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings in the embodiments of the present application. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Deep learning is currently a standard technique used in many fields such as computer vision, speech recognition, and natural language processing. In recent years, deep neural network large models are becoming increasingly important back-end support for many real-time online services such as Siri and instragram. These models are pre-trained generic large models, specifically, a generic base model on which many task-specific models are built by lightweight adaptation (without the need to train from scratch), while exhibiting high prediction accuracy due to the huge scale of operations and parameters (e.g., GPT-3 is a base model with 1750 hundred million parameters, which can be adapted by natural language hints to build language translations, problem solutions, formula generation, etc.). Since the prediction accuracy is guaranteed by a large model, the performance of real-time online services is largely dependent on the response time to process user requests, which includes: network transmission time, task scheduling time, inference time (i.e., execution time of DNN inference service), etc. In response time, inference time generally occupies a major part, especially for large models. Thus, inference time is generally taken as the main limiting factor for quality of service (Quality of Service, qoS) in DNN-driven real-time online services.

To improve service efficiency, many real-time online service options deploy their pre-trained large models in public clouds (e.g., ali clouds, tencent clouds) and provide users with corresponding reasoning services. However, the deployment of a large model requires selecting a specific type of virtual machine instance of suitable hardware and operating system, namely a cloud virtual environment scheme, while the number of cloud virtual environment schemes provided by public clouds is huge (often exceeds 100), so that it is difficult to manually select a suitable cloud virtual environment scheme from the cloud virtual environment schemes, which easily results in excessive cost of model deployment and wasting of funds, or excessively low configuration and reduction of the reasoning speed of the model, resulting in prolonged reasoning service time.

In view of the above problems, embodiments of the present application provide a large model deployment method, device and product based on server non-aware computing, so as to solve the above problems of inapplicability of cloud virtual environment schemes and the like, so as to improve cost efficiency of large model cloud deployment. The large model deployment method based on the server non-aware computing provided by the embodiment of the application is described in detail below through some embodiments and application scenes thereof with reference to the accompanying drawings.

The first aspect of the embodiments of the present application provides a large model deployment method based on server non-aware computing, referring to fig. 1, fig. 1 is a step flowchart of the large model deployment method based on server non-aware computing, as shown in fig. 1, where the method includes:

Step S101, a deployment request of a target large model is acquired, wherein the target large model represents a basic model with the parameter amount of more than 1 hundred million contained in a calculation map.

In this embodiment, the target large model belongs to a large model, that is, a general basic model (the number of operators and parameters included in a calculation graph of the basic model is often quite huge, the number of parameters is known to be more than 1 hundred million), and after receiving a deployment request sent by a user side, a model of a specific task (without training from scratch) is built through lightweight adaptation on the basis model, for example, a language translation model, a problem solution model, a formula generation model and the like. According to the deployment request, a computational graph of the target large model and a plurality of cloud virtual environment schemes available for the target large model can be obtained.

And step S102, sampling from a plurality of cloud virtual environment schemes by using a Bayesian optimization algorithm according to the deployment request to obtain a target cloud virtual environment scheme.

The embodiment of the application proposes to adaptively select a suitable target cloud virtual environment scheme from a plurality of cloud virtual environment schemes through a Bayesian optimization algorithm. The cloud virtual environment scheme represents a combination of computing resources, i.e., a combination of specific types of cloud servers with different hardware and operating systems. The cloud virtual environment scheme plays an important role in reasoning performance and running cost of the large model. The bayesian optimization algorithm is a sequential design strategy for global optimization of black box functions, which does not take on any functional form, and the objective of black box functions is to optimize an objective function f (x), which can query the value of the objective function f (x) at x without knowing any other information (e.g. gradient information) and specific formulas of the objective function f (x).

In the related art, a cloud virtual environment scenario required for deployment is often manually selected from a plurality of cloud virtual environment scenarios. However, under the condition that the number of cloud virtual environment schemes is large (generally exceeds 100), the optimal cloud virtual environment scheme is difficult to find by a manually selected method, so that the too high configuration of the cloud virtual environment scheme can lead to fund waste, the too low configuration can reduce the reasoning speed, and the reasoning time is prolonged. The embodiment utilizes a Bayesian optimization algorithm to adaptively select an optimal target cloud virtual environment scheme from a plurality of cloud virtual environment schemes.

In an optional implementation manner, the step S102, using a bayesian optimization algorithm, samples from a plurality of cloud virtual environment schemes to obtain a target cloud virtual environment scheme, includes:

s1021, predicting the reasoning cost of each cloud virtual environment scheme by using the Bayesian optimization algorithm; each cloud virtual environment scheme at least comprises the corresponding environment: CPU kernel number information, CPU clock speed information, CUDA kernel number information, GPU clock speed information, GPU number information and cloud virtual environment memory information; the reasoning cost represents the product of the time required for reasoning and the price of the cloud virtual environment scheme.

The inference cost represents the product of the time required for inference (i.e., the time required for the target large model deployed according to the cloud virtual environment scheme to perform inference of the inference service) and the price of the cloud virtual environment scheme (i.e., the monetary cost of deployment according to the cloud virtual environment scheme).

Step S1022, determining the cloud virtual environment solution with the lowest reasoning cost from the plurality of cloud virtual environment solutions as the target cloud virtual environment solution.

In this embodiment, the expected gain (i.e., the inference cost) of each cloud virtual environment scheme is predicted by using a bayesian optimization algorithm, and then one cloud virtual environment scheme with the highest expected gain (i.e., the scheme with the lowest inference cost) is selected as the target cloud virtual environment scheme, so as to complete the sampling of the cloud virtual environment scheme.

Because the number of cloud virtual environment schemes is too large, if the cloud virtual environment schemes with the lowest cost are simply and directly searched, the searching cost is too large, and the searching time is too long. In order to solve the problem within a limited search time, the present embodiment proposes to add a constraint condition to the bayesian optimization algorithm, where the constraint condition is used to limit the search time, that is, a suitable target cloud virtual environment scheme needs to be determined within the search time. The length of the search time may be set according to the actual application, and in this embodiment, it is not limited.

And step S103, optimizing the calculation map of the target large model to obtain an optimized calculation map.

Machine learning frameworks typically abstract computational operations of a large model into a computational graph, which is composed of multiple computational operations, the structure of which is used to represent execution sequencing between the computational operations. Specifically, the computation graph may be denoted as G, consisting of n operations (denoted as o= { O ₁ ，O ₂ ，...O _n }). G has a group of directed edges, each edge is connected with two operations, and the dependence relationship of the two operations is represented. If the directed edge is connected with O _i And O _j Then operation O _j Can only be operated in O _i It can be started after completion. In order to realize real-time reasoning of the large model, the configuration cost of the cloud virtual environment scheme obtained through Bayesian optimization is generally high, which leads to high reasoning cost after the deployment of the target large model. In this regard, the present embodiment proposes optimizing the computation graph of the target large model, so as to improve the computation performance of the computation graph, improve the reasoning service efficiency of the target large model, and reduce the reasoning cost.

Step S103, optimizing the calculation map of the target large model to obtain an optimized calculation map, including:

In this embodiment, in the computational graph of the target large model, there are some inefficient subgraphs that severely reduce the speed of model reasoning. The embodiment adopts an adaptive graph replacement method of a Tensor Algebra super optimizer (Tensor Algebra SuperOptimizer, TASO), optimizes the calculation graph by the method, and adaptively replaces an inefficient subgraph (source subgraph) in the calculation graph with a subgraph (target subgraph) with higher calculation performance to reduce the reasoning cost.

Step S104, generating an optimal model parallelization scheme according to the optimized calculation graph by using a deep reinforcement learning model; the optimal model parallelization scheme represents a parallelization scheme of a plurality of computing nodes of the target large model on a plurality of computing devices.

In deploying the target large model, in addition to requiring a suitable cloud virtual environment solution, parallelism of computation needs to be considered. In particular, in order to perform the inference service, the large model needs to place each computing operation on a corresponding computing device (e.g., a CPU core or GPU) that performs the corresponding computing operation. In the related art, a model parallelization scheme is usually designed manually by a service provider, and for a large model with a very large computational graph, there are various alternative model parallelization schemes, and it is difficult for manual selection to make an optimal or near-optimal model parallelization scheme. Deep reinforcement learning is an artificial intelligence method that combines the perceptibility of deep learning with the decision-making capability of reinforcement learning, and directly controls according to input. The embodiment utilizes a deep reinforcement learning model to determine an optimal model parallelization scheme, which can be represented as p= (P) ₁ ，p ₂ ，...p _n ) Computing operation O for any one of the computation graphs G _i In P there is a corresponding P _i Represents O _i And (5) placing equipment. Therefore, according to the embodiment, through the optimal model parallelization scheme, the computing operation of the target large model computing graph is definitely put on a plurality of computing devices such as the GPU and the CPU, so that reasoning is accelerated, and computing parallelism and communication overhead among the devices are weighed.

Step S104, generating an optimal model parallelization scheme according to the optimized computation graph by using a deep reinforcement learning model, including:

step S1041, coding the optimized calculation graph into a deep learning operator sequence, inputting the deep reinforcement learning model, outputting a device sequence by the deep reinforcement learning model, wherein devices in the device sequence are in one-to-one correspondence with the deep learning operators in the deep learning operator sequence; the deep learning operator sequence and the device sequence form a model parallelization scheme.

In the present embodiment, the information of all the computation operations in the optimized computation graph is encoded as a data sequence (deep learning operator sequence), e.g., o= { O ₁ ，O ₂ ，...O _n Input into the deep reinforcement learning model, the output of the deep reinforcement learning model may be structured as a device sequence corresponding to the input deep learning operator sequence, e.g., p= (P) ₁ ，p ₂ ，...p _n ). Elements in the two sequences correspond one to one, for any one of the input deep learning operator sequences, the deep learning operator O _i In the output device sequence P there is a corresponding element P _i Represents O _i And (5) placing equipment.

And step S1042, deploying the target large model according to the target cloud virtual environment scheme and the model parallelization scheme, and executing an reasoning service by using the target large model to obtain a reasoning performance result of the target large model.

According to the target cloud virtual environment scheme, the target large model is deployed on a corresponding CPU and GPU, and according to the corresponding relation between the deep learning operator sequence and elements in the device sequence in the model parallelization scheme, the computing operation of the target large model is deployed on a corresponding hardware device, and after the deployment of the target large model is completed, the corresponding reasoning service can be executed by using the target large model. According to the result of the inference service, the inference performance result of the target large model can be obtained through testing, and the inference performance result is used for representing the inference performance of the target large model, such as the length of time required for executing the inference service, the inference speed, the accuracy of the inference result and the like.

And step S1043, adjusting model parameters of the deep reinforcement learning model according to the reasoning performance result.

And step S1044, repeating the steps for a plurality of times to finish the iterative training of the deep reinforcement learning model, thereby obtaining the trained deep reinforcement learning model. Specifically, training can be stopped when the preset training times or model convergence conditions are reached, and a deep reinforcement learning model with completed training is obtained.

Step S1045, inputting the deep learning operator sequence into the trained deep reinforcement learning model to obtain a target device sequence, where the target device sequence and the deep learning operator sequence form the optimal model parallelization scheme.

In the embodiment, a bayesian optimization algorithm is utilized to judge a cloud virtual environment scheme (a target cloud virtual environment scheme) to be sampled next so as to reduce reasoning cost, after the target cloud virtual environment scheme is obtained by sampling, a calculation map of a target large model is determined according to the target cloud virtual environment scheme, and a deep reinforcement learning model is iteratively trained according to the calculation map so as to formulate an optimal model parallelization scheme. In this embodiment, according to the obtained reasoning performance result, the model parameters of the deep reinforcement learning model are adjusted, and after multiple iterations, training of the deep reinforcement learning model is completed, so that the deep reinforcement learning model can learn the reasoning performance of the target large model, and therefore, according to the input deep learning operator sequence, an optimal model parallelization scheme can be determined from multiple model parallelization schemes, so as to improve the reasoning performance of the target large model after deployment.

And step 105, deploying the target large model according to the target cloud virtual environment scheme and the optimal model parallelization scheme, and executing an reasoning service by using the target large model to obtain a reasoning performance result of the target large model.

According to the target cloud virtual environment scheme, corresponding CPU and GPU resources are configured for the target large model, according to the corresponding relation between the deep learning operator sequence and elements in the device sequence in the optimal model parallelization scheme, the computing operation of the target large model is deployed on corresponding hardware devices, and after the deployment of the target large model is completed, the corresponding reasoning service can be executed by using the target large model. According to the result of the reasoning service, the reasoning performance result of the target large model can be obtained through testing, and the reasoning performance result is used for representing the reasoning performance of the target large model after being deployed according to the target cloud virtual environment scheme and the optimal model parallelization scheme, such as the time required by executing the reasoning service, the reasoning speed, the accuracy of the reasoning result and the like.

And step S106, adjusting parameters of the Bayesian optimization algorithm according to the reasoning performance result.

And step S107, iterating for a plurality of times according to the steps until reaching the iteration stop condition. Specifically, the iteration stop condition may be a preset iteration number and an iteration time.

Referring to fig. 2, fig. 2 shows a flow diagram of a large model deployment method, as shown in fig. 2, for a deployment request of a target large model, a plurality of cloud virtual environment schemes (corresponding to flow node 1 in fig. 2) selectable in a cloud server pool are determined, a bayesian optimization algorithm samples from the plurality of cloud virtual environment schemes, a target cloud virtual environment scheme (corresponding to flow node 2 in fig. 2) is determined, a TASO tensor algebra super-optimizer is utilized to optimize a calculation map of the target large model, the optimized calculation map is input into a deep reinforcement learning model, the deep reinforcement learning model is subjected to repeated iterative training, an optimal model parallelization scheme (corresponding to flow node 3 in fig. 2) is selected from a plurality of model parallelization schemes, then the target large model is deployed according to the obtained target cloud virtual environment scheme and the optimal model parallelization scheme, in an execution environment based on-server unaware calculation, an inference performance result is generated according to the inference result in real time, parameters of the bayesian optimization algorithm are adjusted and optimized (corresponding to flow node 4 in fig. 2), and the cloud environment can be better selected when the target virtual environment is executed. As shown in fig. 2, the process nodes 2, 3 and 4 are used for completing one training, and the parameters of the bayesian optimization algorithm are optimized for multiple times through multiple iterations, so that the performance of the target large model is learned, and the cloud virtual environment scheme with the lowest reasoning cost is selected from the cloud virtual environment schemes to serve as the target cloud virtual environment scheme in a limited search time.

And step S108, deploying the target large model according to the generated target cloud virtual environment scheme and the optimal model parallelization scheme in the last iteration process, and executing reasoning service by using the target large model by adopting a server non-perception calculation method.

The embodiment of the application combines a Bayesian optimization algorithm and a deep reinforcement learning model, and adaptively discovers an optimal cloud virtual environment scheme and a model parallelization scheme. Specifically, the embodiment of the application continuously adjusts the parameters of the Bayesian optimization algorithm through multiple iterations, so that the Bayesian optimization algorithm can learn the reasoning performance of the target large model, and further a proper cloud virtual environment scheme is determined from a plurality of cloud virtual environment schemes; and generating an optimal model parallelization scheme by utilizing the deep reinforcement learning model so as to balance calculation parallelism and communication overhead among devices. Therefore, the embodiment of the application utilizes Bayesian optimization and deep reinforcement learning technology in a combined way, an economic and efficient cloud virtual environment scheme and model parallelization scheme can be found in a limited search time in a self-adaptive way, double combined optimization of cloud virtual environment and model parallelization is realized, the inference cost is minimized while the constraint of the search time is met, and the cost benefit of large model cloud deployment is further improved.

and acquiring an inference service request input by the user terminal through the server non-perception interface.

Abstracting the reasoning service request into a feature vector, and sending the feature vector to the target large model.

And carrying out reasoning service by the target large model according to the feature vector to obtain a reasoning result.

And returning the reasoning result to the user side.

In the process of executing the reasoning service, the reasoning service request input by the user end is always directly transmitted to the server end, and then the server end executes all operation processes by using the deployed target big model to complete the reasoning service. For example, for the face recognition service, the user side sends the collected face image and the face recognition request one to the server side, and the server side analyzes and calculates the face image to generate a face recognition result. However, this approach does not take into consideration the problem of protecting the privacy of the user data, and easily reveals the information of the user side to the server side.

The embodiment provides that a server non-perception computing technology is utilized, and an inference service request input by a user terminal is obtained through a server non-perception interface, wherein the inference service request comprises user data required by executing the inference service. The data input by the user terminal is kept at the terminal side of the large model reasoning service through the server non-perception interface, the reasoning service request is abstracted into a feature vector which cannot be decoded reversely, and then the feature vector is sent to a target large model deployed by the server terminal. And finally, carrying out reasoning service according to the input feature vector by the target large model, and returning the obtained reasoning result to the user side. For the face recognition service, the user side sends the collected face image and the face recognition request to the server non-perception interface, the server non-perception interface abstracts the face image into a feature vector which cannot be decoded reversely, the feature vector is input into a target large model of the server side, face recognition analysis is carried out, and a face recognition result is generated. Therefore, the server side is prevented from acquiring data input by the user side, the privacy of the user is ensured, and the information security of the user is improved.

under the condition that the target large model is started for the first time, identifying and starting a target code block by using a probe to finish starting the target large model; the object code block includes at least: hardware detection code blocks, computational graph construction code blocks, and CUDA initialization code blocks.

And reserving the starting state of the target code block.

Under the condition that the target large model is started for the Kth time, directly acquiring the starting state of the target code block to finish the starting of the target large model; the K represents any constant greater than 1.

And after the target large model is started, executing an inference service.

The embodiment of the application jointly adopts the Bayesian optimization algorithm and the deep reinforcement learning model, and requires multiple iterations and repeatedly runs time-consuming reasoning service experiments, so that the deployment efficiency of the large target model is lower. By dynamically tracking basic code blocks inside the target large model, the root cause of the trial run-time process is found to be that each time the target large model is started, heavy starting code blocks, including hardware detection code blocks, computational graph building code blocks, and unified computing architecture (Compute Unified Device Architecture, CUDA) initialization code blocks, take a significant amount of starting time, are independent of the input of the inference service, but take more than 98% of the time to perform the inference service.

In order to improve the reasoning service efficiency, the embodiment improves the starting mechanism of the original target large model. Specifically, when the target large model is started for the first time, the target code blocks in the code blocks are identified by using the probes, and then the starting of the code blocks is directly performed in advance at one time, so that the first starting of the target large model is completed. And, the boot state of the target code block is preserved. Therefore, when the target large model starts next time, the starting state of the target code blocks can be directly utilized to finish starting, the heavy code blocks do not need to be restarted, the time for repeatedly starting the code blocks is saved, and the starting base can be repeatedly used when the next time of reasoning service is executed, so that the searching cost is reduced.

For example, in the multiple iteration process, step S105 is executed each time, after the target large model is deployed according to the target cloud virtual environment scheme and the optimal model parallelization scheme, the target large model needs to be started, so that the reasoning service is executed by using the target large model, and the reasoning performance result of the target large model is obtained. When the step S105 is executed for the first time, the probe is utilized to identify and start the target code block so as to finish the starting of the target large model, when the step S105 is executed for the kth time, the starting state of the target code block reserved when the step S105 is executed for the first time is directly obtained, and the starting of the target large model is finished, so that the time for executing the reasoning service by the target large model in the process of multiple iterations is reduced.

In an alternative embodiment, in executing step S104, multiple iterative training needs to be performed on the deep reinforcement learning model, in each training process, step S1042 needs to be performed, a target large model is deployed according to a target cloud virtual environment scheme and a model parallelization scheme, and the target large model is started, so that an inference service is performed by using the target large model to obtain an inference performance result of the target large model, so that in executing step S104, multiple starting of the target large model is required. Specifically, when step S1042 is executed for the first time, the probe is used to identify and start the target code block to complete the start of the target large model, when step S1042 is executed for the kth time, the starting state of the target code block reserved when step S1042 is executed for the first time is directly obtained, the starting base is reused, and the start of the target large model is completed, so that the time for executing the reasoning service on the target large model in the process of performing multiple iterative training on the deep reinforcement learning model is reduced, and the reasoning efficiency is improved.

The second aspect of the embodiments of the present application further provides a large model deployment device based on server non-aware computing, referring to fig. 3, fig. 3 shows a schematic structural diagram of the large model deployment device, as shown in fig. 3, where the device includes:

In an alternative embodiment, the deployment module includes:

In an alternative embodiment, the reasoning module includes:

and a sub-graph replacement sub-module, configured to replace one or more source sub-graphs in the computation graph of the target large model with a target sub-graph, where the target sub-graph is functionally equivalent to the source sub-graph, and the computation performance of the target sub-graph is higher than that of the source sub-graph.

Illustratively, the server-based, non-aware computing large model deployment system described in this embodiment is implemented based on TensorFlow, and a prototype system is built on a server instance rented from Microsoft Azure. To fully evaluate the effectiveness and adaptability of the system, a number of experiments were performed using the commonly used lightweight DNN model and large model, including a natural language processing model (cyclic neural network language model RNNLM and the language representation large model BERT represented by the bi-directional encoder) and an online image classification model (deep convolutional neural network model of third generation concept structure concept-V3 and 19-layer deep convolutional neural network model VGG 19). Experiment results show that compared with the related technology, the large model deployment system provided by the embodiment of the application improves the reasoning speed of the lightweight DNN model by 60%, and reduces the optimization cost by 98%; the reasoning speed of the large model is improved by 21%, and the optimization cost is reduced by 84%. In addition, compared with heuristic baselines such as greedy search, the inference cost and the search time are reduced by 51% and 57% for the lightweight DNN model, and 74% of the inference cost and 38% of the search time are saved for the basic model.

The embodiment of the application also provides an electronic device, and referring to fig. 4, fig. 4 is a schematic diagram of the electronic device according to the embodiment of the application. As shown in fig. 4, the electronic device 100 includes: the system comprises a memory 110 and a processor 120, wherein the memory 110 is in communication connection with the processor 120 through a bus, and a computer program is stored in the memory 110 and can run on the processor 120, so that the steps in the server-based non-aware computing large model deployment method disclosed by the embodiment of the application are realized.

Embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program/instructions which, when executed by a processor, implement steps in a server-less aware-computing-based large model deployment method as disclosed in embodiments of the present application.

Embodiments of the present application also provide a computer program product that, when run on an electronic device, causes a processor to perform the steps in a server-less aware-computing-based large model deployment method as disclosed in embodiments of the present application.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, electronic devices, and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present embodiments have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the present application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The foregoing describes in detail a large model deployment method, apparatus and product based on server-less computing, and specific examples are applied to illustrate the principles and embodiments of the present application, and the description of the foregoing examples is only used to help understand the method and core idea of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A server-less computing-based large model deployment method, the method comprising:

acquiring a deployment request of a target large model, wherein the target large model represents a basic model with a parameter amount of more than 1 hundred million contained in a calculation map;

2. The server-agnostic computing-based large model deployment method of claim 1, wherein the sampling from a plurality of cloud virtual environment schemes by using a bayesian optimization algorithm to obtain a target cloud virtual environment scheme comprises:

3. The server-agnostic computing-based large model deployment method of claim 2, wherein the bayesian optimization algorithm includes a constraint condition, the constraint condition indicates that the bayesian optimization algorithm needs to determine the target cloud virtual environment scheme within a preset search time.

4. The server-less aware-computing-based large model deployment method according to claim 1, wherein the generating an optimal model parallelization scheme from the optimized computation graph using a deep reinforcement learning model comprises:

5. The server-less computing-based large model deployment method of claim 1, wherein the employing the server-less computing method, performing an inference service using the target large model, comprises:

and returning the reasoning result to the user side.

6. The server-agnostic computing-based large model deployment method of claim 1, wherein performing inference services with the target large model comprises:

preserving the starting state of the target code block;

and after the target large model is started, executing an inference service.

7. The server-less computing-based large model deployment method according to claim 1, wherein optimizing the computation graph of the target large model to obtain an optimized computation graph comprises:

8. A server-less computing-based large model deployment apparatus, the apparatus comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the steps in the server-unaware computing-based large model deployment method of any of claims 1-7.

10. A computer readable storage medium having stored thereon a computer program/instructions, which when executed by a processor, implements the steps in the server-unaware computing based large model deployment method of any of claims 1-7.