CN114416286A - Resource quota processing method and device for PS (packet switched) node - Google Patents

Resource quota processing method and device for PS (packet switched) node Download PDF

Info

Publication number
CN114416286A
CN114416286A CN202111621545.0A CN202111621545A CN114416286A CN 114416286 A CN114416286 A CN 114416286A CN 202111621545 A CN202111621545 A CN 202111621545A CN 114416286 A CN114416286 A CN 114416286A
Authority
CN
China
Prior art keywords
resource
utilization rate
quota
resource quota
current value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111621545.0A
Other languages
Chinese (zh)
Inventor
王�锋
李丰存
高延庆
王迪
钱玉磊
余建平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN202111621545.0A priority Critical patent/CN114416286A/en
Publication of CN114416286A publication Critical patent/CN114416286A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the invention provides a method and a device for processing resource quota of a PS node, wherein the method comprises the following steps: acquiring the resource utilization rate, the resource quota rated value and the resource quota current value of the current PS node in the PS cluster; when the resource utilization rate is smaller than the lower limit threshold of the resource utilization rate and the rated value of the resource quota is smaller than the current value of the resource quota, carrying out capacity reduction processing on the current value of the resource quota; and when the resource utilization rate is greater than the upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota, carrying out capacity expansion processing on the current value of the resource quota. The embodiment of the invention not only realizes the dynamic adjustment of the resource quota of the PS node, but also avoids the increase or reduction of the resource quota of the PS node by simply increasing or reducing the number of the PS nodes, thereby fully utilizing the system resources.

Description

Resource quota processing method and device for PS (packet switched) node
Technical Field
The present invention relates to the field of internet technologies, and in particular, to a method and an apparatus for processing a resource quota of a PS node.
Background
With the increase of data scale and the increase of Parameter scale of machine learning model, a challenge is presented to fast convergence of the model, and a Parameter Server (PS) distributed training architecture (hereinafter referred to as PS architecture) is widely used to accelerate convergence of the model.
In the PS framework, the parameters of the model are fragmented and distributed to different PS nodes, and the PS nodes are responsible for storing and updating the fragmented parameters; the training data is segmented and distributed to different Worker (Worker) nodes, the different Worker nodes perform data-parallel (data-parallel) type training, calculate the gradient for updating the parameters of the model and report the gradient to the PS node. The PS node updates the parameters of the model after receiving the gradient.
At present, most of technical schemes adopt static PS resource allocation, namely, one resource allocation is selected before model training is started, and the model is trained by utilizing the PS node according to the selected resource allocation until a training task is finished. Some technical schemes support flexible expansion and contraction of the PS nodes, and specifically, the method is mainly implemented by flexibly increasing or decreasing the number of PS nodes with the goal of shortening the completion time of a training task.
No matter the configuration of the static PS resources or the horizontal elastic expansion and contraction capacity for minimizing the completion time of the training task is adopted, the system resources are not fully utilized, and the utilization rate of the system resources is not high.
Disclosure of Invention
In view of the foregoing problems, embodiments of the present invention are proposed to provide a resource quota processing method and apparatus for a PS node, which overcome the foregoing problems or at least partially solve the foregoing problems.
In order to solve the above problem, according to a first aspect of an embodiment of the present invention, a method for processing a resource quota of a PS node is disclosed, including: acquiring the resource utilization rate, the resource quota rated value and the resource quota current value of a current PS node in a PS cluster, wherein the current PS node is used for storing and updating parameters of a model; when the resource utilization rate is smaller than a preset lower limit threshold of the resource utilization rate and the rated value of the resource quota is smaller than the current value of the resource quota, carrying out capacity reduction processing on the current value of the resource quota; and when the resource utilization rate is greater than a preset upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota, performing capacity expansion processing on the current value of the resource quota.
Optionally, when the resource utilization is greater than a preset upper threshold of resource utilization, and the rated value of the resource quota is greater than or equal to the current value of the resource quota, the method further includes: and when no vacant resource exists, if the resource utilization rate is smaller than the rated threshold of the resource utilization rate, forbidding to carry out capacity expansion processing on the current value of the resource quota, and continuing to train the model.
Optionally, when the resource utilization is greater than a preset upper threshold of resource utilization, and the rated value of the resource quota is greater than or equal to the current value of the resource quota, the method further includes: and when no vacant resource exists, if the resource utilization rate is greater than or equal to the resource utilization rate rated threshold, saving the training progress and parameters of the model, and terminating the training of the model.
Optionally, the method further comprises: and when the resource utilization rate is smaller than the lower limit threshold of the resource utilization rate and the rated value of the resource quota is larger than or equal to the current value of the resource quota, forbidding the capacity reduction processing on the current value of the resource quota.
Optionally, the method further comprises: and when the resource utilization rate is greater than the upper threshold of the resource utilization rate and the rated value of the resource quota is less than or equal to the current value of the resource quota, prohibiting the capacity expansion processing on the current value of the resource quota.
Optionally, before the obtaining of the resource utilization rate, the resource quota rated value, and the resource quota current value of the current PS node in the PS cluster, the method further includes: before the training task of the model is started, if historical training record data of the model exist, resource quota data of the training task of the model at the current PS node are estimated according to the historical training record data.
Optionally, the obtaining the rating of the resource quota includes: fitting a resource occupation growth curve according to the resource occupation amount of the model in the training stage; estimating the completion time point of the training task of the model according to the training average time consumption duration of the single sample data of the model and the number of the sample data; and acquiring the rated value of the resource quota according to the resource occupancy increasing curve and the completion time point.
According to a second aspect of the embodiments of the present invention, there is also disclosed a device for processing resource quotas of PS nodes, including: the acquisition module is used for acquiring the resource utilization rate, the resource quota rated value and the resource quota current value of the current PS node in the PS cluster, wherein the current PS node is used for storing and updating the parameters of the model; the capacity reduction module is used for carrying out capacity reduction processing on the current value of the resource quota when the resource utilization rate is smaller than a preset lower limit threshold of the resource utilization rate and the rated value of the resource quota is smaller than the current value of the resource quota; and the capacity expansion module is used for carrying out capacity expansion processing on the current value of the resource quota when the resource utilization rate is greater than a preset upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota.
Optionally, the apparatus further comprises: and the maintaining module is used for forbidding to perform capacity expansion processing on the current value of the resource quota and continuing to train the model if no vacant resource exists and the resource utilization rate is less than the rated threshold of the resource utilization rate when the resource utilization rate is greater than a preset upper threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota.
Optionally, the apparatus further comprises: and the termination module is used for saving the training progress and parameters of the model and terminating the training of the model if the resource utilization rate is greater than a preset upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota and no vacant resource exists and the resource utilization rate is greater than or equal to the rated threshold of the resource utilization rate.
Optionally, the apparatus further comprises: and the forbidding module is used for forbidding to perform capacity reduction processing on the current value of the resource quota when the resource utilization rate is less than the lower limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota.
Optionally, the prohibiting module is further configured to prohibit performing capacity expansion processing on the current value of the resource quota when the resource utilization is greater than the upper threshold of the resource utilization and the rated value of the resource quota is less than or equal to the current value of the resource quota.
Optionally, the apparatus further comprises: and the estimation module is used for estimating the resource quota data of the training task of the model at the current PS node according to the historical training record data if the historical training record data of the model exists before the acquisition module acquires the resource utilization rate, the resource quota rated value and the current value of the resource quota of the current PS node in the PS cluster and before the training task of the model is started.
Optionally, the obtaining module includes: the curve fitting module is used for fitting a resource occupation growth curve according to the resource occupation amount of the training stage of the model; the time point calculating module is used for predicting the completion time point of the training task of the model according to the training average time consumption duration of the single sample data of the model and the number of the sample data; and the rated value acquisition module is used for acquiring the rated value of the resource quota according to the resource occupancy increase curve and the completion time point.
Compared with the prior art, the technical scheme provided by the embodiment of the invention has the following advantages:
the resource quota processing scheme of the PS node provided by the embodiment of the invention obtains the resource utilization rate, the resource quota rated value and the resource quota current value of the current PS node in the PS cluster. When the resource utilization rate is smaller than a preset lower limit threshold of the resource utilization rate and the rated value of the resource quota is smaller than the current value of the resource quota, carrying out capacity reduction processing on the current value of the resource quota; and when the resource utilization rate is greater than a preset upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota, performing capacity expansion processing on the current value of the resource quota. The embodiment of the invention carries out real-time monitoring on the resource utilization rate of the PS node, compares the resource utilization rate with a resource utilization rate lower limit threshold and a resource utilization rate upper limit threshold respectively, compares a resource quota rated value with a resource quota current value, and finally carries out capacity reduction processing or capacity expansion processing on the resource quota current value according to a comparison result. The embodiment of the invention not only realizes the dynamic adjustment of the resource quota of the PS node, but also avoids the increase or reduction of the resource quota of the PS node by simply increasing or reducing the number of the PS nodes, thereby fully utilizing the system resources.
Drawings
Fig. 1 is a flowchart illustrating steps of a method for processing resource quotas of a PS node according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a resource occupancy growth curve according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of an automatic flexible capacity expansion and reduction scheme of a PS node resource quota under a PS architecture according to an embodiment of the present invention;
fig. 4 is a block diagram of a resource quota processing apparatus of a PS node according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1, a flowchart illustrating steps of a resource quota processing method of a PS node according to an embodiment of the present invention is shown. The resource quota processing method of the PS node may specifically include the following steps:
step 101, acquiring a resource utilization rate, a resource quota rated value and a resource quota current value of a current PS node in a PS cluster.
In the embodiment of the invention, the PS cluster is regarded as a resource pool, the training tasks of one model occupy different resource quotas on different PS nodes of the PS cluster, and the training tasks of each model share the resources of each PS node. Modeling the resource allocation problem of the PS nodes as a binning problem (Bin-packing), wherein in the binning problem, the number of bins is equal to the number N of the PS nodes in the PS cluster, the capacity of each Bin is equal to the total resource amount of the corresponding PS nodes, and the resource quotas of one model training task on the N PS nodes correspond to the volume of N articles placed in each Bin. It should be noted that, unlike the conventional boxing problem, the volume of the N articles still changes after being boxed, that is, the resource quota of a model training task on the N PS nodes dynamically changes along with the adjustment instruction of the flexible scaling algorithm.
In the embodiment of the invention, the rated value of the resource quota of the current PS node is obtained by pre-estimation. How to predict the resource quota rating of the current PS node will be described in detail later.
And 102, when the resource utilization rate is smaller than a preset lower limit threshold of the resource utilization rate and the rated value of the resource quota is smaller than the current value of the resource quota, carrying out capacity reduction processing on the current value of the resource quota.
In the embodiment of the present invention, a lower threshold (low _ bound) of resource utilization rate may be preset. The resource utilization is compared to a resource utilization lower threshold, and a resource quota rating is compared to a resource quota current value. And if the resource utilization rate is less than the lower limit threshold of the resource utilization rate and the rated value of the resource quota is less than the current value of the resource quota, carrying out capacity reduction processing on the current value of the resource quota of the current PS node. It should be noted that the capacity reduction process may be used to flexibly reduce the resource quota of the current PS node.
And 103, when the resource utilization rate is greater than a preset upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota, performing capacity expansion processing on the current value of the resource quota.
In the embodiment of the present invention, the upper limit threshold (upper _ bound) of the resource utilization rate may be set in advance. The resource utilization is compared to a resource utilization upper threshold, and a resource quota rating is compared to a resource quota current value. And if the resource utilization rate is greater than the upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota, performing capacity expansion processing on the current value of the resource quota of the current PS node. It should be noted that the capacity expansion process may be used to flexibly increase the resource quota of the current PS node.
The resource quota processing scheme of the PS node provided by the embodiment of the invention obtains the resource utilization rate, the resource quota rated value and the resource quota current value of the current PS node in the PS cluster. When the resource utilization rate is smaller than a preset lower limit threshold of the resource utilization rate and the rated value of the resource quota is smaller than the current value of the resource quota, carrying out capacity reduction processing on the current value of the resource quota; and when the resource utilization rate is greater than a preset upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota, performing capacity expansion processing on the current value of the resource quota. The embodiment of the invention carries out real-time monitoring on the resource utilization rate of the PS node, compares the resource utilization rate with a resource utilization rate lower limit threshold and a resource utilization rate upper limit threshold respectively, compares a resource quota rated value with a resource quota current value, and finally carries out capacity reduction processing or capacity expansion processing on the resource quota current value according to a comparison result. The embodiment of the invention not only realizes the dynamic adjustment of the resource quota of the PS node, but also avoids the increase or reduction of the resource quota of the PS node by simply increasing or reducing the number of the PS nodes, thereby fully utilizing the system resources.
In a preferred embodiment of the present invention, when the resource utilization rate is greater than a preset upper threshold of resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota, an implementation manner of performing capacity expansion processing on the current value of the resource quota is that the current value of the resource quota can be processed only when a condition that the resource utilization rate is greater than the preset upper threshold of resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota is satisfied and a condition that there is vacant resource of the current PS node is satisfied.
In a preferred embodiment of the present invention, when the resource utilization rate is greater than a preset upper threshold of resource utilization rate, and the rated value of the resource quota is greater than or equal to the current value of the resource quota, if there is no vacant resource of the current PS node and the resource utilization rate is less than the rated threshold of the resource utilization rate, the capacity expansion processing on the current value of the resource quota is prohibited, and the model continues to be trained.
In a preferred embodiment of the present invention, when the resource utilization rate is greater than a preset upper limit threshold of resource utilization rate, and the resource quota rated value is greater than or equal to the current value of the resource quota, if there is no vacant resource of the current PS node, and the resource utilization rate is greater than or equal to the rated threshold of resource utilization rate, it indicates that the current value of the resource quota of the current PS node is not enough to support the model to continue training, the training progress and parameters of the model are saved, and the training of the model is terminated.
In practical applications, the resource utilization rate rated threshold may be set to 100% according to practical application conditions. And the resource utilization rate rated threshold is larger than the resource utilization rate upper limit threshold.
In a preferred embodiment of the present invention, when the resource utilization rate is less than the lower threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota, the capacity reduction processing on the current value of the resource quota is prohibited. The reason is that the estimated rated value of the resource quota is greater than or equal to the current value of the resource quota, and the later stage of the training task of the representation model may trigger the capacity expansion processing of the current value of the resource quota.
In a preferred embodiment of the present invention, when the resource utilization is greater than the upper threshold of the resource utilization, and the rated value of the resource quota is less than or equal to the current value of the resource quota, the capacity expansion processing on the current value of the resource quota is prohibited. The reason is that the estimated rated value of the resource quota is less than or equal to the current value of the resource quota, which means that the training task of the model can be ensured to be completed smoothly.
In a preferred embodiment of the present invention, before acquiring the resource utilization rate, the resource quota rated value, and the resource quota current value of the current PS node in the PS cluster, and before starting a training task of the model, it is determined whether there is historical training record data of the model (including record data of a successful training task and record data of a failed training task). If the historical training record data of the model exists, resource quota data of the training task of the model at the current PS node is estimated according to the historical training record data, and the training task of the model is started according to the resource quota data. And if the historical training record data of the model does not exist, estimating that the training task of the model stably proceeds and completes the required resource quota data through the definition file of the model, and starting the training task of the model according to the resource quota data.
In a preferred embodiment of the present invention, one implementation of obtaining the rating of the resource quota is to fit a resource occupancy growth curve according to the resource occupancy of the training phase of the model; estimating the completion time point of the training task of the model according to the training average time consumption duration of the single sample data of the model and the number of the sample data; and acquiring a rated value of the resource quota according to the resource occupancy increasing curve and the completion time point.
In practical application, in the training process of the model, the number of parameters of the model can expand, and the resource occupation amount of the PS node can increase along with the expansion. The increasing phase of the resource occupancy can be divided into the following two phases: a loading phase and a training phase. The embodiment of the invention mainly aims at fitting a resource occupation growth curve of the resource occupation amount in the training phase.
Referring to fig. 2, a schematic diagram of a resource occupancy growth curve according to an embodiment of the present invention is shown. In fig. 2, the abscissa represents each time point in the model training process, and the ordinate represents the resource occupancy of the PS node. In embodiments of the present invention, resources include, but are not limited to, memory resources, CPU resources, GPU resources, and the like.
When the completion time point of the training task of the model is estimated, the training average time consumption duration of the single sample data can be multiplied by the number of the sample data to obtain the total time consumption duration of the training task of the model, and then the training starting time point of the first sample data is taken as a time starting point, or the starting time point of the training task is taken as a time starting point, and the time point after the total time consumption duration is taken as the completion time point of the training task. For example, the average training time duration of a single sample datum is 3 minutes, and the starting time point of the training task is 05-2900: 00, the number of sample data is 5, and the completion time point is 05-2900: 15.
when the resource quota rated value is obtained according to the resource occupancy increase curve and the completion time point, the resource occupancy amount corresponding to the completion time point in the resource occupancy increase curve can be used as the resource quota rated value. For example, the completion time points are 05-2900: 15, 05-2900: and 15, if the corresponding resource occupation amount is 4GB, the rated value of the resource quota is 4 GB.
Based on the above description about the embodiment of the resource quota processing method for the PS node, an automatic flexible capacity expansion and reduction scheme for the resource quota of the PS node in the PS architecture is described below.
Referring to fig. 3, a flowchart of an automatic flexible capacity expansion scheme for PS node resource quotas under a PS architecture according to an embodiment of the present invention is shown. The automatic flexible capacity expansion and contraction scheme of the PS node resource quota under the PS framework mainly comprises a monitoring module, a capacity expansion module, a capacity contraction module and a starting module.
The training data or the sample data are segmented and distributed to different Worker nodes, the different Worker nodes perform parallel training of the training data or the sample data, the gradient for updating the model parameters is calculated and reported to the PS node, and meanwhile, the Worker nodes write the training progress of the model into the database.
And the monitoring module acquires the resource utilization rate from the PS nodes in the PS cluster and writes the resource utilization rate into the database. The monitoring module judges whether the resource utilization rate is greater than a resource utilization rate upper limit threshold or less than a resource utilization rate lower limit threshold. When the resource utilization rate is greater than the upper limit threshold of the resource utilization rate, informing the capacity expansion module to carry out capacity expansion processing; and when the resource utilization rate is less than the lower limit threshold of the resource utilization rate, informing the capacity reduction module to carry out capacity reduction processing.
And the capacity reduction module performs capacity reduction processing on the current value of the resource quota when the estimated model training is completed and the ideal rated value of the resource quota is smaller than the current value of the resource quota. If the resource utilization rate is less than the lower threshold of the resource utilization rate, but the estimated ideal rated value of the resource quota is greater than or equal to the current value of the resource quota, the capacity reduction processing is not performed, and the capacity expansion processing may be triggered subsequently.
And when the estimated model training is finished and the ideal resource quota rated value is larger than or equal to the current value of the resource quota, if the vacant resources exist in the PS cluster, the capacity expansion module calculates the increasable resource quota according to the estimated ideal resource quota rated value and the vacant resources in the PS cluster, and performs capacity expansion according to the calculated resource quota. If the PS cluster does not have vacant resources, judging whether the resource utilization rate reaches 100%, if the resource utilization rate is less than 100%, giving up the capacity expansion processing, and continuing to train the model; if the resource utilization rate reaches 100%, the current value of the resource quota is not enough to support continuous training of the model, the training progress of the model is saved, and a starting module is informed to restart the training task.
The method comprises the steps that a starting module judges whether historical training record data of the module exist or not before starting a training task (including a new task and a restarted old task) of a certain model, if the historical training record data of the model exist, a resource quota rated value is estimated according to the historical training record data, and the task is started according to the resource quota rated value; if the historical training record data of the model does not exist, estimating a resource quota initial value according to a definition file of the model, and starting a task according to the resource quota initial value. The probability of resource scaling after the model training begins is reduced to some extent.
According to the embodiment of the invention, when the training task of the model is started, the resource quota rated value can be estimated through the historical training record data, the model is trained according to the resource quota rated value, and the probability of resource elastic expansion and contraction capacity in the model training process is reduced. In the training process of the model, the resource utilization rate of the PS node is monitored, the current value of the resource quota of the PS node is subjected to elastic expansion and contraction processing, and the stability and the resource utilization rate of the model training task are improved.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 4, a block diagram of a structure of a PS node resource quota processing apparatus according to an embodiment of the present invention is shown, where the PS node resource quota processing apparatus may specifically include the following modules:
an obtaining module 41, configured to obtain a resource utilization rate, a resource quota rated value, and a resource quota current value of a current PS node in the PS cluster, where the current PS node is used to store and update parameters of the model;
a capacity reduction module 42, configured to perform capacity reduction processing on the current value of the resource quota when the resource utilization rate is smaller than a preset lower threshold of the resource utilization rate and the rated value of the resource quota is smaller than the current value of the resource quota;
and the capacity expansion module 43 is configured to perform capacity expansion processing on the current value of the resource quota when the resource utilization rate is greater than a preset upper threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota.
In a preferred embodiment of the present invention, the apparatus further comprises:
and the maintaining module is used for forbidding to perform capacity expansion processing on the current value of the resource quota and continuing to train the model if no vacant resource exists and the resource utilization rate is less than the rated threshold of the resource utilization rate when the resource utilization rate is greater than a preset upper threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota.
In a preferred embodiment of the present invention, the apparatus further comprises:
and the termination module is used for saving the training progress and parameters of the model and terminating the training of the model if the resource utilization rate is greater than a preset upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota and no vacant resource exists and the resource utilization rate is greater than or equal to the rated threshold of the resource utilization rate.
In a preferred embodiment of the present invention, the apparatus further comprises:
and the forbidding module is used for forbidding to perform capacity reduction processing on the current value of the resource quota when the resource utilization rate is less than the lower limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota.
In a preferred embodiment of the present invention, the prohibiting module is further configured to prohibit a capacity expansion process on the current value of the resource quota when the resource utilization is greater than the upper threshold of the resource utilization and the rated value of the resource quota is less than or equal to the current value of the resource quota.
In a preferred embodiment of the present invention, the apparatus further comprises:
and the estimating module is configured to estimate, before the obtaining module 41 obtains the resource utilization rate, the resource quota rated value, and the current value of the resource quota of the current PS node in the PS cluster, and before the training task of the model is started, if there is historical training record data of the model, resource quota data of the training task of the model at the current PS node according to the historical training record data.
In a preferred embodiment of the present invention, the obtaining module 41 includes:
the curve fitting module is used for fitting a resource occupation growth curve according to the resource occupation amount of the training stage of the model;
the time point calculating module is used for predicting the completion time point of the training task of the model according to the training average time consumption duration of the single sample data of the model and the number of the sample data;
and the rated value acquisition module is used for acquiring the rated value of the resource quota according to the resource occupancy increase curve and the completion time point.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method and the device for processing the resource quota of the PS node provided by the present invention are described in detail above, and a specific example is applied in the present document to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (14)

1. A method for processing resource quotas of PS nodes is characterized by comprising the following steps:
acquiring the resource utilization rate, the resource quota rated value and the resource quota current value of a current PS node in a PS cluster, wherein the current PS node is used for storing and updating parameters of a model;
when the resource utilization rate is smaller than a preset lower limit threshold of the resource utilization rate and the rated value of the resource quota is smaller than the current value of the resource quota, carrying out capacity reduction processing on the current value of the resource quota;
and when the resource utilization rate is greater than a preset upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota, performing capacity expansion processing on the current value of the resource quota.
2. The method of claim 1, wherein when the resource utilization is greater than a preset upper threshold resource utilization and the resource quota rating is greater than or equal to the current value of the resource quota, the method further comprises:
and when no vacant resource exists, if the resource utilization rate is smaller than the rated threshold of the resource utilization rate, forbidding to carry out capacity expansion processing on the current value of the resource quota, and continuing to train the model.
3. The method of claim 1, wherein when the resource utilization is greater than a preset upper threshold resource utilization and the resource quota rating is greater than or equal to the current value of the resource quota, the method further comprises:
and when no vacant resource exists, if the resource utilization rate is greater than or equal to the resource utilization rate rated threshold, saving the training progress and parameters of the model, and terminating the training of the model.
4. The method of claim 1, further comprising:
and when the resource utilization rate is smaller than the lower limit threshold of the resource utilization rate and the rated value of the resource quota is larger than or equal to the current value of the resource quota, forbidding the capacity reduction processing on the current value of the resource quota.
5. The method of claim 1, further comprising:
and when the resource utilization rate is greater than the upper threshold of the resource utilization rate and the rated value of the resource quota is less than or equal to the current value of the resource quota, prohibiting the capacity expansion processing on the current value of the resource quota.
6. The method of claim 1, wherein prior to the obtaining the resource utilization, the resource quota rating, and the resource quota nonce for the current PS node in the PS cluster, the method further comprises:
before the training task of the model is started, if historical training record data of the model exist, resource quota data of the training task of the model at the current PS node are estimated according to the historical training record data.
7. The method of claim 1, wherein obtaining the resource quota rating comprises:
fitting a resource occupation growth curve according to the resource occupation amount of the model in the training stage;
estimating the completion time point of the training task of the model according to the training average time consumption duration of the single sample data of the model and the number of the sample data;
and acquiring the rated value of the resource quota according to the resource occupancy increasing curve and the completion time point.
8. A resource quota processing apparatus of a PS node, comprising:
the acquisition module is used for acquiring the resource utilization rate, the resource quota rated value and the resource quota current value of the current PS node in the PS cluster, wherein the current PS node is used for storing and updating the parameters of the model;
the capacity reduction module is used for carrying out capacity reduction processing on the current value of the resource quota when the resource utilization rate is smaller than a preset lower limit threshold of the resource utilization rate and the rated value of the resource quota is smaller than the current value of the resource quota;
and the capacity expansion module is used for carrying out capacity expansion processing on the current value of the resource quota when the resource utilization rate is greater than a preset upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota.
9. The apparatus of claim 8, further comprising:
and the maintaining module is used for forbidding to perform capacity expansion processing on the current value of the resource quota and continuing to train the model if no vacant resource exists and the resource utilization rate is less than the rated threshold of the resource utilization rate when the resource utilization rate is greater than a preset upper threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota.
10. The apparatus of claim 8, further comprising:
and the termination module is used for saving the training progress and parameters of the model and terminating the training of the model if the resource utilization rate is greater than a preset upper limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota and no vacant resource exists and the resource utilization rate is greater than or equal to the rated threshold of the resource utilization rate.
11. The apparatus of claim 8, further comprising:
and the forbidding module is used for forbidding to perform capacity reduction processing on the current value of the resource quota when the resource utilization rate is less than the lower limit threshold of the resource utilization rate and the rated value of the resource quota is greater than or equal to the current value of the resource quota.
12. The apparatus of claim 11, wherein the prohibiting module is further configured to prohibit a capacity expansion process from being performed on the current value of the resource quota when the resource utilization is greater than the upper threshold of the resource utilization and the rating of the resource quota is less than or equal to the current value of the resource quota.
13. The apparatus of claim 8, further comprising:
and the estimation module is used for estimating the resource quota data of the training task of the model at the current PS node according to the historical training record data if the historical training record data of the model exists before the acquisition module acquires the resource utilization rate, the resource quota rated value and the current value of the resource quota of the current PS node in the PS cluster and before the training task of the model is started.
14. The apparatus of claim 8, wherein the obtaining module comprises:
the curve fitting module is used for fitting a resource occupation growth curve according to the resource occupation amount of the training stage of the model;
the time point calculating module is used for predicting the completion time point of the training task of the model according to the training average time consumption duration of the single sample data of the model and the number of the sample data;
and the rated value acquisition module is used for acquiring the rated value of the resource quota according to the resource occupancy increase curve and the completion time point.
CN202111621545.0A 2021-12-23 2021-12-23 Resource quota processing method and device for PS (packet switched) node Pending CN114416286A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111621545.0A CN114416286A (en) 2021-12-23 2021-12-23 Resource quota processing method and device for PS (packet switched) node

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111621545.0A CN114416286A (en) 2021-12-23 2021-12-23 Resource quota processing method and device for PS (packet switched) node

Publications (1)

Publication Number Publication Date
CN114416286A true CN114416286A (en) 2022-04-29

Family

ID=81269050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111621545.0A Pending CN114416286A (en) 2021-12-23 2021-12-23 Resource quota processing method and device for PS (packet switched) node

Country Status (1)

Country Link
CN (1) CN114416286A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568746A (en) * 2021-07-27 2021-10-29 北京达佳互联信息技术有限公司 Load balancing method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568746A (en) * 2021-07-27 2021-10-29 北京达佳互联信息技术有限公司 Load balancing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110413391B (en) Deep learning task service quality guarantee method and system based on container cluster
US10572285B2 (en) Method and apparatus for elastically scaling virtual machine cluster
US20170331705A1 (en) Resource Scaling Method on Cloud Platform and Cloud Platform
CN113010260A (en) Elastic expansion method and system for container quantity
CN110865842B (en) OTA upgrading method and equipment
CN110289994B (en) Cluster capacity adjusting method and device
CN104113576A (en) Method and device for updating client
CN114416286A (en) Resource quota processing method and device for PS (packet switched) node
CN104104645B (en) A kind of cross-platform method for managing resource and system
CN106775470B (en) Data storage method and system
CN106464733A (en) Method and device for adjusting virtual resources in cloud computing
CN113190405B (en) Node health detection method and device, electronic equipment and storage medium
CN104484222A (en) Virtual machine dispatching method based on hybrid genetic algorithm
CN106100901B (en) Flow velocity control method and device
CN111858200B (en) Throughput control method and device in system test and electronic equipment
CN105491117A (en) Flow chart data processing system and method for real time data analysis
CN111124673A (en) Data acquisition system and method
CN103578274B (en) A kind of traffic flow forecasting method and device
CN116627659B (en) Model check point file storage method, device, equipment and storage medium
CN112395045A (en) Virtual machine recovery and resource adjustment method thereof
CN115658116B (en) Storage cluster upgrade control method, device, equipment and storage medium
CN107368355B (en) Dynamic scheduling method and device of virtual machine
CN109002264B (en) Method and device for determining data distribution based on system capacity expansion
CN113822307A (en) Image prediction method, device and storage medium
CN111176814A (en) Task execution method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination