CN112860402A

CN112860402A - Dynamic batch processing task scheduling method and system for deep learning inference service

Info

Publication number: CN112860402A
Application number: CN202110192645.XA
Authority: CN
Inventors: 张德宇; 罗云臻; 张尧学
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2021-05-28
Anticipated expiration: 2041-02-20
Also published as: CN112860402B

Abstract

The invention discloses a dynamic batch processing task scheduling method and a system for deep learning inference service, wherein the method comprises the following steps: describing the number of tasks such as queues and the like at the leaving time of each batch and the size of the leaving batch by a two-dimensional Markov process, determining the steady-state probability of the two-dimensional Markov process, and determining the average service delay in a deep learning inference service system according to the steady-state probability; and constructing an optimization model to optimize the upper limit of the batch processing task size, the average service delay and the memory usage amount, and solving the optimization model to determine the upper limit of the batch size of the batch processing task. The invention has the advantages of meeting dynamic environment, having better average service delay and memory occupation, and the like.

Description

Dynamic batch processing task scheduling method and system for deep learning inference service

Technical Field

The invention relates to the technical field of edge computing and cloud computing, in particular to a dynamic batch processing task scheduling method and system for deep learning inference service.

Background

Due to the excellent performance of Deep Learning (Deep Learning) in the fields such as image processing, natural language processing, and the like, and the increasing popularity of mobile devices equipped with systems such as Android and iOS, mobile devices can now provide a great number of intelligent applications for end users. For example, there are over 16500 mobile applications on Google Play that use deep learning as a core component to provide intelligent services from computer vision to text and audio processing. The specific application includes a mobile phone APP developed by Microsoft and named Seeing AI for assisting visually impaired people to identify the surrounding environment by using a vehicle-mounted camera. Adobe Scan converts images to text using text recognition techniques based on deep learning.

One common way for mobile devices to provide intelligent services using deep learning is to run deep learning reasoning based on pre-trained models. However, deep learning model reasoning has high requirements in terms of energy, memory, and computational engine cycles. Although some mobile neural network computing elements have been released on the market to speed up deep learning model reasoning on devices, such as NPUs and TPUs, their computing power is still very limited and it is difficult to guarantee high quality of service.

To provide efficient mobile intelligence services, a more efficient solution is to offload model reasoning onto powerful edge or cloud servers. As the application range of the deep learning model is continuously expanded and improved, the demand for deep learning reasoning has been rapidly increased in recent years, as observed from related information released by leading high-tech companies. Specifically, a dedicated platform DLIS (DLIS) deployed by microsoft receives hundreds of thousands of Deep Learning Inference requests every second. While the deep learning reasoning demand of Facebook's data center has also increased by a factor of three in two years.

In mobile applications such as AR and VR, the critical issue is the strict low latency requirement, typically in the millisecond range. With the significant increase in the amount of deep learning inference task requests, this strict low latency requirement becomes a challenge even for powerful GPU servers.

Due to the highly parallel computing architecture of the GPU, batching together the inputs can significantly improve computational efficiency. Through research and analysis of the throughput rates of the representative deep learning models on the 2 GPU servers under different batch (batch) sizes, the throughput rates of different batch sizes in the different-depth learning models shown in FIG. 1 are obtained, and therefore the throughput rates can be greatly improved through batch processing input. Meanwhile, through the research on the relationship between the batch input by batch processing and the video memory occupation, the relationship between the sizes of different batches and the video memory occupation in the learning model with different depths as shown in fig. 2 is obtained, and the worst condition is that the video memory occupation reaches 2558 MB.

However, the existing related researches for improving the throughput rate of the deep learning inference service and reducing the delay of the deep learning inference service in a batch processing mode are discussed in a static environment, the static environment means that the tasks of the deep learning inference service are considered to be statically waiting at the local of a server by the related researches, but in the case of actual network service, the tasks arrive at random, so that how to reasonably utilize the batch processing mode to optimize the deep learning inference service in the process of random arrival is not deeply researched in the prior art and has practical significance.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a dynamic batch processing task scheduling method and system of deep learning inference service, which is in accordance with a dynamic environment and has better average service delay and memory occupation.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: a dynamic batch processing task scheduling method of deep learning inference service describes the number of tasks such as queues and the like of each batch at the leaving time and the size of the leaving batch by a two-dimensional Markov process, determines the steady-state probability of the two-dimensional Markov process, and determines the average service delay in a deep learning inference service system according to the steady-state probability;

optimizing the upper limit of the batch processing task size and the average service delay and the memory usage amount by the optimization model shown in the formula (1),

in the formula (1), E (W (b)) is the average service delay corresponding to the upper limit of the batch size b, b is the upper limit of the batch size of the batch processing task, W (b) is the service delay, gamma is the weight of the memory usage compared with the average service delay, m_bThe corresponding memory usage when the upper limit of the batch size is B, B is the maximum value of the upper limit of the batch size, N is the maximum number of tasks waiting in the batch processing task queue, lambda is the task arrival rate, mu_BThe service rate when the batch size is B; solving the optimization model of equation (1) determines the upper limit of batch size in a batch processing task.

Further, the average service delay is determined by the equation (2) calculation,

in the formula (2), E (W (b)) is the average service delay corresponding to the upper limit b of the batch size, L is the average task number, λ is the task arrival rate, P_blockIs the blocking probability of the task.

Further, the average task number is determined by equation (3),

the blocking probability is determined by equation (4),

in the formulas (3) and (4), E (L) is the average task number, n is the number of waiting tasks in the batch processing task queue, r is the batch size, a is the lower limit of the batch size of the batch processing tasks, b is the upper limit of the batch size of the batch processing tasks, and pi_n,rFor the steady-state probability, pi, of waiting for n tasks and a batch size of r_n,0For the steady-state probability, pi, of waiting for the number of tasks to be n and the size of the batch to be 0_N,rThe number of waiting tasks is N, and the steady-state probability of the batch size is r.

Further, the solving process of the optimization model comprises the following steps:

initializing the upper limit of the batch size of the batch processing task and the step length for adjusting the upper limit of the batch size in each iteration; taking the sum of the average service delay and the memory usage corresponding to the upper limit of the batch size as a convergence parameter; and in each iteration, adjusting the upper limit of the batch size according to the step length, and when the convergence parameter obtained in the current iteration is larger than the convergence parameter of the previous iteration, taking the upper limit of the batch size obtained in the current iteration as the optimal solution output by the optimization model.

Further, in the first iteration, the method further includes a process of correcting the adjustment direction of the step size: and when the difference between the average service delay obtained in the first iteration and the average service delay corresponding to the initialized upper limit of the batch size is larger than a preset threshold value, changing the adjustment direction for adjusting the upper limit of the batch size.

A dynamic batch processing task scheduling system of deep learning inference service carries out task scheduling according to the dynamic batch processing task scheduling method of the deep learning inference service.

Compared with the prior art, the invention has the advantages that: compared with the traditional single task processing method, the method has the advantages that the processing speed is greatly improved; compared with the traditional batch processing method, the speed is greatly improved compared with the batch processing method with the optimal fixed batch size, and the video memory occupation is obviously improved; compared with a greedy dynamic batch processing method, under the condition that service delay is basically the same, the occupation amount of the video memory is greatly reduced, and the occupation aspect of the video memory is greatly improved.

Drawings

Fig. 1 shows throughput of different depth learning models in the prior art.

FIG. 2 is a graph of memory occupancy of different depth learning models in the prior art.

FIG. 3 is a flow chart illustrating an embodiment of the present invention.

Fig. 4 is a schematic diagram illustrating a relationship between the deep learning model *** lenet, DenseNet-169 inference throughput rate (a) and the GPU utilization rate (b) on the NVIDIA RTX 2080 GPU and the batch size in the embodiment of the present invention.

Fig. 5 is a schematic diagram illustrating a relationship between the deep learning model *** lenet, DenseNet-169 inference throughput rate (a) and the GPU utilization rate (b) on the NVIDIA Titan Xp GPU according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of the relationship between NVIDIA RTX 2080 GPU (a) and NVIDIA Titan Xp GPU (b) in which *** lenet and DenseNet-169 infer GPU video memory occupancy.

Fig. 7 is a schematic diagram of service delay and blocking probability of the deep learning model *** inference service on NVIDIA RTX 2080 GPU in the dynamic batch processing (a ═ 1) and the static batch processing (a ═ b), and the task arrival rate in the service process is 990 tasks/second.

FIG. 8 is a schematic diagram illustrating a relationship between a dynamic batch lower limit and a service delay of a deep learning model GoogLeNet inference service on an NVIDIA RTX 2080 GPU in an embodiment of the present invention under a fixed dynamic batch upper limit

Fig. 9 is a schematic diagram illustrating comparison of video memory occupation of a deep learning model *** inference service on an NVIDIA RTX 2080 GPU under dynamic batch processing and static batch processing in the embodiment of the present invention.

Fig. 10 is a diagram illustrating a queuing model corresponding to a deep learning inference service system model in an embodiment of the present invention.

Fig. 11 is a diagram illustrating a comparison between the deep learning model *** lenet and densneet-169 inference services on NVIDIA RTX 2080 GPU in the real case and the model prediction case, where the task arrival rate in the *** lenet service process is 990 tasks/second and the task arrival rate in the densneet-169 is 330 tasks/second in the embodiment of the present invention.

Fig. 12 is a schematic diagram illustrating a situation that the inference service of the *** lenet model is affected by the upper limit b of the batch size and the arrival rate λ in the embodiment of the present invention.

Fig. 13 is a schematic diagram illustrating a comparison between the service delay of the method of the present invention and different static batch processing under the condition of varying task arrival rates of the *** lenet inference service of the deep learning model on the NVIDIA RTX 2080 GPU in the embodiment of the present invention.

Fig. 14 is a schematic diagram illustrating comparison between the method of the present invention and different static batch processing and display occupation of greedy dynamic batch processing under the condition of varying task arrival rates by the *** lenet inference service of the deep learning model on the NVIDIA RTX 2080 GPU in the embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.

According to the dynamic batch processing task scheduling method of the deep learning inference service, the number of tasks such as queues and the like at the leaving time of each batch and the size of the leaving batch are described by a two-dimensional Markov process, the steady-state probability of the two-dimensional Markov process is determined, and the average service delay in a deep learning inference service system is determined according to the steady-state probability;

In the present embodiment, the average service delay is determined by the calculation of equation (2),

in the formula (2), E (W (b)) is the average service delay corresponding to the upper limit b of the batch size, L is the average task number, λ is the task arrival rate, P_blockThe remaining parameters in the equation are defined as above for the blocking probability of the task.

In this embodiment, the average task number is determined by equation (3),

the blocking probability is determined by equation (4),

in the formulas (3) and (4), E (L) is the average task number, n is the number of waiting tasks in the batch processing task queue, r is the batch size, a is the lower limit of the batch size of the batch processing tasks, b is the upper limit of the batch size of the batch processing tasks, and pi_n,rFor the steady-state probability, pi, of waiting for n tasks and a batch size of r_n,0For the steady-state probability, pi, of waiting for the number of tasks to be n and the size of the batch to be 0_N,rFor the steady-state probability of waiting for N number of tasks and for a batch size of r, the remaining parameters in the equation are defined as above.

In this embodiment, the solving process of the optimization model includes: initializing the upper limit of the batch size of the batch processing task and the step length for adjusting the upper limit of the batch size in each iteration; taking the sum of the average service delay and the memory usage corresponding to the upper limit of the batch size as a convergence parameter; and in each iteration, adjusting the upper limit of the batch size according to the step length, and when the convergence parameter obtained in the current iteration is larger than the convergence parameter of the previous iteration, taking the upper limit of the batch size obtained in the current iteration as the optimal solution output by the optimization model.

In this embodiment, during the first iteration, the method further includes a process of correcting the adjustment direction of the step size: and when the difference between the average service delay obtained in the first iteration and the average service delay corresponding to the initialized upper limit of the batch size is larger than a preset threshold value, changing the adjustment direction for adjusting the upper limit of the batch size.

The method of the present embodiment is verified and analyzed through a specific simulation experiment. In the experiment, the batch-based deep learning inference service system is described as follows by analyzing a pattern of an actual network deep learning inference service.

The GPU server receives a large number of mobile devices or other terminals to receive deep learning model inference tasks, the task arrival process follows a Poisson distribution with a speed of lambda, the inference process follows a general distribution, and the inference delay depends on the current batch size. When the GPU server runs only one deep learning inference model, the inference process follows a deterministic distribution since there are no other tasks and its competition. In a more realistic scenario, where multiple model inference services run on a single server, the inference delay services are generally distributed to describe the different delays caused by computing resource competition among the services.

According to the mode of the deep learning inference service system, in the experiment of the embodiment, the deep learning inference service system based on batch processing is modeled into an M/G (a, b)/1/N queuing model, and the queuing model is modeled according to the paradigm of the d.g. kendall queuing theory, wherein M represents that the arrival process of tasks obeys poisson distribution, G represents that the inference process obeys general distribution, a represents the lower limit of the size of an inference batch, b represents the upper limit of the size of the inference batch, 1 represents the number of servers, and N represents the maximum number of tasks waiting in the batch processing task queue of the queuing system. And after a queuing model is established, analyzing the queuing model to obtain the average service delay. Specifically, a closed-form solution formula of the average service delay of the deep learning service inference system is theoretically constructed, and the solution result of the closed-form solution formula comprises queuing delay and inference delay. After a large number of experimental analyses, it is determined that as the number of pictures that need to be loaded into the video card increases with the increase of the batch and the amount of intermediate data generated in the inference process increases, the video memory occupancy increases linearly with the increase of the batch size, and this relationship can be described by a linear function. Finally, the optimization variable is the batch size upper limit b by generalizing the closed form company and the linear function into an objective function of the optimization problem. Since the values of the batch sizes are discrete and the search space is small, the problem is solved by traversing the search space.

In the currently popular deep learning frameworks such as TensorFlow, Pytrch, MxNet, and Dynet, where GPU acceleration is used, the batch processed images will be organized into a matrix, which is passed on to the computation of the deep learning model already loaded on the GPU. On one hand, batch processing can enable all images to share convolution kernel weight in convolution operation, so that calling of model parameters is reduced, and delay is reduced; batch processing, on the other hand, takes advantage of the parallelism of convolution operations and full-link layer neuron operations, which further takes advantage of the parallel computing power of the GPU architecture. In the experiment, a deep learning model reasoning experiment is carried out on an RTX 2080 GPU based on 8GB video memory and a Titan X Pascal (Titan Xp) GPU based on 12GB memory, the deep learning model reasoning utilizes a CUDA-v10.0.130 interface and a cuDNN-v7.6.2 acceleration library, and a Pytrch is used as a deep learning framework.

In this embodiment, for comparison, the throughput and video memory occupancy of 5 typical deep learning models in different batch sizes were tested, and the input image size was 224 × 224 pixels. Considering that the number of tasks in the queue during the service process is an arbitrary integer, not a power of 2, two models of DenseNet-169 and *** lenet were chosen, as shown in fig. 4 and 5, and their FLOPs (floating point operands) were 3.2 × 109 and 1.41 × 109, respectively. Under the condition that the batch size of the batch processing is changed from 1 to 64, the throughput is increased rapidly along with the increase of the batch processing scale and then fluctuates around a certain point, which shows that the batch processing can accelerate the model reasoning speed to a considerable extent, and other models tested have the same characteristics. For the GoogLeNet running on RTX 2080 and Titan Xp, the throughput rates of 7.67 times and 25.27 times can be respectively improved by batch processing, and the throughput rates and the GPU utilization rate are almost consistent with the change trend of the batch processing scale. For statistical validation, the throughput and video memory occupancy results in the experiments shown in fig. 4 and 5 are the average of 100 runs.

In the experiment, a curve was fitted by the least squares method to analyze the relationship between the batch throughput and the throughput. Through experiments, the batch reasoning time delay tau under the batch size r_r＝v×r+τ₀V denotes the slope of the inference delay, τ, as I/O operations increase with batch size₀Intercept representing inference time delay, v > 0, tau₀Is greater than 0. From this, the service rate mu can be determined for the batch size r_rIs expressed in (unit of batch/s) as

At finger size r, the expression of throughput rate (image/s) is r × μ_r。

In a scenario where a server provides deep learning model inference services, there may be a large number of clients submitting tasks to the server. The server will organize the tasks into a queue. The maximum number of tasks waiting in the batch task queue is denoted by N in the experiment. According to the Service Level Agreement (SLA), the response time is an important index for the cloud service or the network service, and since an excessively large N may cause the queue tail task to time out due to high delay, N should not be excessively large, and is set to 128 in the present experiment. The service delay W consists of two parts, queuing delay and inference delay, respectively. In experiments to mimic random arrival of inference tasks, it was assumed that the task arrival process followed a poisson distribution.

In a practical system, since the tasks arrive randomly, and the randomly arrived tasks cannot be served immediately, the deep learning service delay includes not only the inference delay but also the queuing delay. In the experiment, the service delay W and the blocking task P of GoogLeNet on the RTX 2080 GPU are tested under different system states_block. It can be determined that the batch size of the process is by the serverIf the number of the tasks is less than the upper limit b of the batch processing size, the server processes all the tasks as a batch; otherwise, the batch size processed by the server will be the upper limit b. Considering that the batch size is limited by the GPU memory, in the experiment, a maximum value B is set for the value of B, namely B is less than or equal to B, and B is less than or equal to N, and in the experiment, B is set to 64. BXmu_BCorresponding to the maximum throughput rate in a certain service procedure. Defining the flow intensity as

Wherein mu_BRepresenting the service rate when the batch size is B, λ represents the task arrival rate, ρ < 1, since when λ is greater than or equal to the maximum throughput rate Bx μ_BWhile increasing the arrival rate λ only increases P_block。

By comparing the cases where a is 1(a is the lower limit of the batch size, and a is 1, i.e., the dynamic batch size) and a is b (i.e., the fixed batch size), and setting ρ to 0.75, the task arrival rate λ can be calculated to be 990, i.e., representing an average of 990 tasks per second. As shown in fig. 7, in the case where a is 1, the service delay decreases when b is increased for a small value of b, and fluctuates only slightly when b is increased. The service delay value in the case of a-b is larger than that in the case of a-b because the task that arrives first must wait until there are at least b tasks in the queue before the batch processing can be performed. Fig. 8 shows that when the upper batch limit b is fixed, increasing the value of the lower limit a increases the service delay, and the corresponding video memory occupation amount is as shown in fig. 9. It was determined that the dynamic batch size performed better than the fixed batch size, and in addition, when a ═ b ═ 1, i.e., no batch was performed, the average of the service delays was 781ms, P_block83% of P_blockThis is almost the same trend as the service delay W, because the lower the service delay, the lower the queuing delay of the task, and thus the lower the probability that the queue is full when the task arrives.

In deep learning inference calculation, a GPU (graphics processing unit) video memory (i.e. a memory) is an important resource in a GPU calculation process, and unlike CPU calculation, a physical GPU cannot accurately limitThe video memory usage of a process, but the GPU may be virtualized by remote API technology and PCI pass-through technology, or distributed to different containers for different services by nvidia docker. The GPU server provides all GPU video memory for a process and prepares to allocate more memory when the process applies for it. Because the Out of Memory and Page Fault are Fatal errors (total Error) to the GPU, the process where the Fault occurs will stop running, and different processes will compete for the GPU Memory, it is necessary to reduce the usage amount of the GPU Memory of one process as much as possible without affecting the running speed. Because each image needs to be loaded into the memory in the reasoning process and an output tensor is generated at each layer of the neural network, the memory usage amount m_rIs linear with the batch size r, as shown in FIG. 6, the relationship between the memory usage and the batch size can be expressed as m_r＝kr+m₀Where k is a slope representing an increase in video memory usage with an increase in the batch size r, and m₀Is used for expressing the video memory usage of the loading deep learning model, and k is more than 0, m₀＞0。

In the deep learning inference service system based on dynamic batch processing, the size of each batch is constrained to be a less than or equal to r less than or equal to b, namely when the number of waiting tasks is more than a, the server starts batch inference, and the number of tasks of one batch cannot exceed b. The batch size is limited by the video memory of the GPU, namely the maximum value of B is B, namely B is less than or equal to B. In order to obtain the average service delay, the number of waiting tasks at any time in the system needs to be analyzed, the change process of the number of waiting tasks in the system at any time is a non-Markov process because the service process obeys general distribution, and in addition, the reasoning delay depends on the batch size between a and b. To simplify the analysis, in this experiment, the embedded Markov chain (eMC) technique was first utilized to obtain the transition probability of a Markov process with two dimensions, including the number of queue waiting tasks n and the batch size r. The embedded Markov process records the number of waiting tasks for a batch to complete, which in this embodiment is referred to as the batch departure time, as shown in FIG. 10. And obtaining the probability matrix of the system state at any moment through the relation between the system state at the batch leaving moment and the probability of the system state at any moment. In this embodiment, x (t) ═ n (t), r (t)) is used to indicate the two-dimensional markov process formed by the evolution of the number of queue-waiting tasks n and the size r of the batch left at each time when the batch leaves, where t is the subscript of the time when the batch leaves, i.e., the number of queue-waiting tasks at the time when the batch leaves the tth batch n (t) and r (t) is the size of the batch of the tth batch.

The proof process that can represent the relationship between the number of waiting tasks and the size of the leaving batch for a deep learning inference service with a two-dimensional markov process is as follows:

with V_t,t+1(r) represents the number of arriving tasks between the time t and t +1 of the departure of the batch, the batch size being r, the conversion relation between n (t) and r (t) in the different cases being deduced as follows:

n (t) < a, that is, the number of tasks in the queue at the batch departure time t is less than a, and the server needs to wait until a-n (t) tasks arrive before the inference of the batch size a can be carried out, so that n (t +1) ═ V_t,t+1(a) And r (t +1) ═ a.

A ≦ n (t) ≦ b, i.e., the number of tasks at the time t of batch departure is between a and b, and all n (t) tasks will be inferred as a batch. Thus, there is n (t +1) ═ V_t,t+1(n (t)) and r (t +1) ═ n (t).

N (t) > b, i.e. the number of tasks at the time t when the batch leaves is greater than b, the first b tasks will be inferred as one batch. Thus, there is n (t +1) ═ n (t) — b + V_t,t+1(b) And r (t +1) ═ b.

It can thus be determined that the values of n (t +1) and r (t +1) are defined by n (t), r (t) and V_t,t+1(r (t +1)) and V_t,t+1The value of (r (t +1)) follows a poisson distribution which is memoryless between the various batch departure times, so that the relationship between the number of waiting tasks of the deep learning inference service and the size of the departing batch can be represented by a two-dimensional markov process.

From the Markov property of the process X (t), the profile of the various states of the system can be analyzedAnd (4) rate. The state space of X (t) is the union of the remaining tasks and leaving batches from 0 to N and a to b, respectively. By using

To represent a probability transition matrix having dimensions of (N +1) (b-a +1) × (N +1) (b-a + 1). To simplify the analysis, the original matrix is divided into sub-matrices each having a size of (b-a +1) × (b-a +1) and used

Which is represented by the following formula,

wherein θ (. cndot.) represents

The values in the parenthesis for the individual elements of the matrix, theta (-), represent batch sizes between a and b.

For the purpose of facilitating the explanation of the derivation process, the definition is preceded by the value determined

To process a batch of r tasks, the probability of j tasks arriving at the server is defined

If n (t). ltoreq.a, the value of n (t) is equal to a, if a < n (t). ltoreq.b, the value of n (t) is equal to b, if n (t) > b, the value of θ (-) is determined in several cases:

n (t) is less than or equal to b, and N (t +1) is less than or equal to N-1. In this case, the next batch to be reasoned is equal in size to the batch size

If n (t) < a, reasoning after waiting for one task to arrive, otherwise, performing batch reasoning on all tasks. Probability of n (t +1) tasks in the server at the completion of the next batch, etcIn that

Namely, it is

In

The other θ (·) is 0.

N (t) ≦ b and N (t +1) ═ N. For the same reason as in the previous case, in this case the batch size to be reasoned for the next batch is equal to

Since N (t +1) ═ N, it means that at least N tasks arrive in the service process of the batch of tasks. The probability of there being n (t +1) tasks in the server at the completion of the next batch is equal to

Namely, it is

In

The other θ (·) is 0.

B < N (t). ltoreq.N and N (t + 1). ltoreq.N-1. In this case, the next batch size to be inferred

To achieve N (t +1) ≦ N-1, the number of tasks reached is equal to N (t +1) - (N (t) -b). The probability of n (t +1) tasks being present in the server at the completion of the next batch is equal to

Namely, it is

In

The other θ (·) is 0.

B < N (t) ≦ N and N (t +1) ═ N. In this case, the next batch size to be inferred

Since N (t +1) ═ N, at least N (t +1) - (N (t) — b) tasks arrive during the batch inference process. The probability of there being n (t +1) tasks in the server at the completion of the next batch is equal to

Namely, it is

In

The other θ (·) is 0.

From a transition probability matrix

The steady state probability matrix of X (t) can be derived

By solving for

The equation can be obtained

By using

To express the steady-state probability that a server has n tasks in a queue at a batch departure time with a batch size r, since all the states of the system are to be described, a steady-state probability matrix of the batch departure time is required, that is, the server has a steady-state probability matrix of the batch departure time

And system taskThe steady state probability matrix pi at the intentional moment is related by_n，rTo represent the probability of each state in the steady-state probability matrix pi. The steady state probability matrix of the batch leaving time can be obtained by calculation

The functional relationship with the steady state probability matrix pi at any time of the system is as follows, and the steady state probability of the system state at N is more than or equal to 0 and less than or equal to N-1 is as follows:

a+1≤r≤b-1,0≤n≤N-1

0≤n≤N-1

wherein the content of the first and second substances,

s_rrepresenting the mean inference delay when the batch size is r.

Then the steady state probability in the N-N state is as follows:

a+1≤r≤b-1

wherein p is_n,r(0) Indicates the probability that there are n tasks in the queue and the remaining service time of a batch of batch size r is 0, p_n,r(0) Given by:

after the analysis, the important indexes can be obtained by calculation, the average task number E (L) in the system is shown as a formula (3) and comprises the tasks in queuing and the tasks in service, and then the average service delay E (W) and the blocking probability P are obtained according to the Litter's rule_blockAs shown in formula (4), according to λ (1-P)_block) The effective arrival rate can be calculated.

Through experiments and the analysis, it can be determined that the service delay in the deep learning inference service based on dynamic batch processing initially decreases with the batch size b and then fluctuates with the increase of the upper bound of the batch size b, and the video memory occupancy monotonically increases with the increase of the batch size. Therefore, it can be determined that in different system states, a good balance between delay and memory usage is achieved by adjusting the value of the upper bound b, and the optimization model shown in formula (1) is obtained. Under the constraint conditions in equation (1), the flow intensity is considered for practical reasons

In order to ensure the stable operation of the system, B is more than or equal to 1 and less than or equal to B, which is the range of the upper bound of the batch processing size. It was found through experiments that E (w (b)) decreases when b is increased for smaller values of b, and E (w (b)) fluctuates only slightly when b is increased, so that γ can be set to a smaller value in case the system is not very sensitive to video occupation. The solution variable for the optimization problem is the value of the upper batch size bound b,the goal is to minimize the service delay E (W (b)) and gammam_bThe sum of (1).

Since the queuing model can accurately predict the service delay under different system states, the influence of the upper bound b of the batch size and the arrival rate λ on the average service delay E (w (b)) can be analyzed by the queuing model, as shown in fig. 12, it is found that:

when

When is, i.e. b μ_bAbove the arrival rate λ, the service delay E (w (b)) decreases slightly with increasing b. As can be seen from FIGS. 4 and 5, b μ_bIncreases monotonically with b, where μ_bIndicating the service rate for a batch size b. b is the upper limit of the batch size during service, thus b μ_bIs the upper limit of the throughput rate during service. When b μ_bIf the arrival rate is higher than the arrival rate, the server can deduce the arrived task in time, and the reduction of the service delay caused by increasing the value of b is not obvious.

When

I.e. b mu_bAt or initially less than λ, the value of the service delay E (w (b)) will increase dramatically as b decreases. In this case, in a queuing system, the task queue must create a backlog, resulting in each newly arrived task being faced with a full task queue.

When

I.e. b mu_bBelow the arrival rate λ, the value of the service delay E (w (b)) continues to increase with decreasing b. In queuing systems with limited queue capacity, the throughput rate has saturated when the arrival rate is greater than the throughput rate, in which case lowering the throughput rate, i.e. lowering the value of b in the system, will result in an increased queuing delay for the tasks.

As shown in fig. 12, the situation that E (w (b)) varies with the quantity size upper bound b and the arrival rate λ in the *** lenet inference service is reflected, and other deep learning models listed in fig. 1 also have the above characteristics due to similarity of inference processes.

In the embodiment, after the optimization model shown in the formula (1) is determined, the solution of the optimization model is realized through an iterative process, and the efficiency is higher compared with the brute force search. The code for the specific implementation of the iterative algorithm is as follows:

the method comprises the following steps: first, when

A relatively low average service delay and memory usage can be achieved. For equation λ ═ b μ_bB in (1) is solved to obtain

Will be lambda tau₀V (1- λ v) rounding up is to ensure

(lines 1-2). Then, the values of E (W (b)) and E (W (b-1)) corresponding to the current b are calculated to obtain the surge condition of the service delay. Because E (W (b)) and E (W (b-1)) correspond to E (W (b)))

And

average service delay (line 3). Finally, the value of b is adjusted according to the trade-off parameters γ and k. When: 1) when E (W (b-1)) -E (W (b)) < gamma k, it means that the weight of the video memory usage in (OP) is greater than the burst rate of the service delay, and the value of b needs to be reduced to reduce the video memory usage. 2) E (W (b-1)) -E (W (b)) > gamma k, indicating that b can be increased further to obtain lower clothesTraffic delay (lines 7-8, 11). 3) Changing the value of b results in E (W (b)) + gammam_bBecomes large, and the value of b is the optimal solution b^*(lines 15-16). Due to E (W (b)) and m_bRespectively, with the monotone decreasing and increasing of b, so E (W (b)) + gamma m_bIn the definition domain [1, B]Must have a minimum value, the worst case for the algorithm to search is b^*At the end points of the defined domain.

The method is realized on a deep learning frame Pythrch, and the performance of the optimization method is evaluated by adopting an NVIDIA RTX 2080 GPU and a deep learning model GoogleNet. FIG. 13 shows the comparison of the service latency of the method of the present invention and different static batch processes under varying task arrival rates for the deep learning model GoogleLeNet inference service on NVIDIA RTX 2080 GPU. Where the transition scenario for the arrival rate is 330, 800, 730, 930, 1120, 990, 330, 530, 670, 400 tasks/sec, where 50000 tasks are reached at each arrival rate. FIG. 14 shows a comparison between the video memory occupancy of the deep learning model GoogleLeNet inference service on NVIDIA RTX 2080 GPU and different static batch processing and greedy dynamic batch processing under the condition of varying task arrival rate. The transition scenario for the arrival rate is 330, 800, 730, 930, 1120, 990, 330, 530, 670, 400 tasks/second, with 50000 tasks arriving at each arrival rate. Compared with a single-task processing method, the optimization method is accelerated by 31 times; in the case of batch processing, as shown in fig. 11 and 12, the method of the present invention is accelerated by 2.2 times compared with the optimal fixed batch size batch processing method, and the GPU video memory occupation is 0.8 times thereof; compared with the greedy dynamic batch processing method, the GPU video memory occupation is only 0.3 times of that of the greedy dynamic batch processing method, and the service delay is basically the same.

The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims

1. A dynamic batch processing task scheduling method of deep learning inference service is characterized in that:

describing the number of tasks such as queues and the like at the leaving time of each batch and the size of the leaving batch by a two-dimensional Markov process, determining the steady-state probability of the two-dimensional Markov process, and determining the average service delay in a deep learning inference service system according to the steady-state probability;

2. The dynamic batch task scheduling method of deep learning inference service of claim 1, wherein:

the average service delay is determined by the calculation of equation (2),

3. The dynamic batch task scheduling method of deep learning inference service of claim 2, characterized in that:

the average task number is determined by equation (3),

the blocking probability is determined by equation (4),

4. The method for scheduling the dynamic batch processing task of the deep learning inference service as claimed in claim 3, wherein the solving process of the optimization model comprises:

5. The method for scheduling the dynamic batch processing task of the deep learning inference service as claimed in claim 4, wherein in the first iteration, the method further comprises a process of correcting the adjustment direction of the step size: and when the difference between the average service delay obtained in the first iteration and the average service delay corresponding to the initialized upper limit of the batch size is larger than a preset threshold value, changing the adjustment direction for adjusting the upper limit of the batch size.

6. A dynamic batch task scheduling system of deep learning inference service, characterized in that, the task scheduling is performed according to the dynamic batch task scheduling method of deep learning inference service of any claim from 1 to 5.