CN116089477A

CN116089477A - Distributed training method and system

Info

Publication number: CN116089477A
Application number: CN202310374312.8A
Authority: CN
Inventors: 高礼; 殷贞玲
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-04-10
Filing date: 2023-04-10
Publication date: 2023-05-09
Anticipated expiration: 2043-04-10
Also published as: CN116089477B

Abstract

The application provides a distributed training method and a distributed training system. The method comprises the following steps: creating a cache process of a first cache service corresponding to the first data set in a plurality of cache nodes of the first cluster; creating a first computing task process corresponding to a first computing task in a plurality of first computing nodes of a first cluster, and setting a first cache service as input of the first computing task, wherein the first computing nodes belong to a computing node group; determining whether to expand the capacity according to the first cache service, the first computing task and the first computing node; if the capacity expansion is determined, a cache process of the first cache service is created in the first computing node, so that in the process of training the first computing task, the first computing node reads data from the cache process in the first computing node to complete training of the first computing task process on the first computing node. Therefore, the data required by training can be read locally in the training process of the calculation task, and the training speed is improved.

Description

Distributed training method and system

Technical Field

The present application relates to the field of terminal devices, and in particular, to a distributed training method and system.

Background

At present, a large number of machine learning tasks are trained in a cloud primary container environment, so that computing resources are conveniently and efficiently utilized. The data sets required for training are typically stored in remote storage services such as obs, nfs, etc., and the training task on the compute node requires reading the remote data sets.

The training time of the training task includes the reading time and the calculating time of the training data. When the calculation time of the training task is smaller than the remote data reading time, the data reading speed limits the task training speed, so that the training speed is slower.

Disclosure of Invention

In order to solve the technical problems, the application provides a distributed training method and system, which can automatically expand the cache process corresponding to the data set required by the calculation task training to the calculation node where the calculation task is located, so that the data required by the training can be read locally in the calculation task training process, and the training speed is improved.

In a first aspect, the present application provides a distributed training method, the method comprising: creating a cache process of a first cache service corresponding to a first data set in a plurality of cache nodes of a first cluster, wherein the first data set is a data set in a remote database outside the first cluster; creating a first computing task process corresponding to a first computing task in a plurality of first computing nodes of a first cluster, and setting a first cache service as input of the first computing task, wherein the first computing nodes belong to a computing node group; determining whether to expand the capacity according to the first cache service, the first computing task and the first computing node; if the capacity expansion is determined, a cache process of the first cache service is created in the first computing node, so that in the process of training the first computing task, the first computing node reads data from the cache process in the first computing node to complete training of the first computing task process on the first computing node. Therefore, the buffer process corresponding to the data set required by the calculation task training can be automatically expanded to the calculation node where the calculation task is located, so that the data required by the training can be read locally in the calculation task training process, and the training speed is improved.

According to a first aspect, determining whether to expand according to a first cache service, a first computing task, and a first computing node includes: the method comprises the steps of obtaining first characteristic data corresponding to a first cache service, second characteristic data corresponding to a first computing task and third characteristic data corresponding to a computing node group; extracting a first feature vector from the first feature data, extracting a second feature vector from the second feature data, and extracting a third feature vector from the third feature data; obtaining a combined feature vector according to the first feature vector, the second feature vector and the third feature vector; and inputting the combined feature vector into a trained capacity expansion decision model, and outputting a decision result of whether capacity expansion is performed or not by the capacity expansion decision model.

According to a first aspect, the capacity expansion decision model is a classification model.

According to a first aspect, the first characteristic data comprises statistics, cache setting information and cache application information of the first data set.

According to a first aspect, the statistical information of the first data set includes a total file size, a total number of files, a file format of the first data set; the cache setting information of the first data set comprises cache capacity, cache medium and cache process number; the cache application information of the first data set includes the number of computing tasks to which the cache of the first data set is applied, and the computing task history information to which the cache of the first data set is applied.

According to the first aspect, the second characteristic data includes any one or more of the following: task priority, user information, applied CPU resources, applied GPU resources, applied memory resources, used input data information, corresponding algorithm types and historical execution information.

According to the first aspect, the third characteristic data includes any one or more of the following: the CPU information, the GPU information, the memory information and the solid state disk information which can be allocated by each computing node are allocated, and the network topology structure of each computing node is located.

According to a first aspect, a caching process of a first caching service corresponding to a first data set is created in a plurality of caching nodes of a first cluster, including: receiving a first cache service creation request; acquiring the data volume of a first data set; if the data volume of the first data set is smaller than the data volume threshold value, setting the cache capacity of a cache process of the first cache service to be equal to the data volume of the first data set; setting a cache initialization tag and a cache service tag for a first cache service resource corresponding to a first data set; sending a first instruction to a first cluster, wherein the first instruction carries a first cache service resource; according to a first instruction, a cache process of a first cache service corresponding to a first data set is created in a plurality of cache nodes with cache initialization tags in a first cluster; the data in the first data set is loaded into a caching process.

According to a first aspect, determining whether to expand according to a first cache service, a first computing task, and a first computing node includes: if the available storage resources of the first computing node are greater than the data amount of the first data set, a capacity expansion is determined.

According to a first aspect, determining whether to expand according to a first cache service, a first computing task, and a first computing node includes: acquiring the priority of a first computing task; and if the priority of the first computing task is higher than the preset level, determining the capacity expansion.

According to a first aspect, determining whether to expand according to a first cache service, a first computing task, and a first computing node includes: acquiring historical training speed of an algorithm of a first computing task; and if the historical training speed of the algorithm of the first computing task is smaller than a preset speed value, determining expansion.

According to a first aspect, each caching process stores all data of the first data set.

In a second aspect, the present application provides a distributed training system comprising a control node and a first cluster, wherein: a control node for: creating a cache process of a first cache service corresponding to a first data set in a plurality of cache nodes of a first cluster, wherein the first data set is a data set in a remote database outside the first cluster; creating a first computing task process corresponding to a first computing task in a plurality of first computing nodes of a first cluster, and setting a first cache service as input of the first computing task, wherein the first computing nodes belong to a computing node group; determining whether to expand the capacity according to the first cache service, the first computing task and the first computing node; if the capacity expansion is determined, a cache process of a first cache service is created in the first computing node; and the first computing nodes in the first cluster are used for reading data from the cache processes in the first computing nodes in the process of training the first computing tasks so as to complete the training of the first computing task processes on the first computing nodes.

In a third aspect, the present application provides an electronic device, comprising: a memory and a processor, the memory coupled to the processor; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the distributed training method of any of the first aspects.

In a fourth aspect, the present application provides a computer readable storage medium comprising a computer program which, when run on an electronic device, causes the electronic device to perform the distributed training method of any one of the preceding aspects.

Drawings

FIG. 1 is an exemplary diagram of a system architecture for distributed training correlation as shown by way of example;

FIG. 2 is an exemplary diagram illustrating the deployment of a distributed caching service in connection with distributed training;

FIG. 3 is an exemplary diagram of an exemplary illustrated caching service creation process;

FIG. 4 is an exemplary diagram of an exemplary illustrated caching service application process;

fig. 5 is an exemplary diagram of an exemplary illustrated cache service expansion process;

FIG. 6 is a schematic diagram illustrating a process of obtaining a result of whether to expand from input data required by an expansion decision model;

fig. 7 is a diagram illustrating the capacity expansion and reading of different data sets.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone.

The terms first and second and the like in the description and in the claims of embodiments of the present application are used for distinguishing between different objects and not necessarily for describing a particular sequential order of objects. For example, the first target object and the second target object, etc., are used to distinguish between different target objects, and are not used to describe a particular order of target objects.

In the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" means two or more. For example, the plurality of processing units refers to two or more processing units; the plurality of systems means two or more systems.

Fig. 1 is a diagram illustrating an exemplary architecture of a distributed training-related system. Referring to fig. 1, in this embodiment, a distributed training related system may include a cluster 1, a remote server, and a control node. Wherein, the control node is deployed with a management service, and the management service is an application on the control node device. Wherein the remote server has data sets stored therein, these data sets stored in the remote server are referred to herein as remote data sets for convenience of description.

Wherein cluster 1 is located in a cloud native environment, i.e., a k8s machine environment. The remote server and the control node are both k8s external machines.

It should be noted that, although only one data set is illustrated in the remote servers in fig. 1, it should be understood that each remote server may store a plurality of data sets, and is not limited to one.

Also, while only one remote server is illustrated in FIG. 1, it should be understood that the number of remote servers in a distributed training-related system may be multiple.

It should be noted that the control node may be located in the cluster 1, or may be located outside the cluster 1.

It will be appreciated that the management service may be deployed at a plurality of nodes, each of which may act as a control node. The management service may follow a micro-service architecture.

With continued reference to FIG. 1, cluster 1 includes a set of compute nodes and a set of cache initialization nodes, as well as other sets of nodes. Wherein the computing node group may include a plurality of computing nodes, and the cache initialization node group may include a plurality of cache nodes. These node groups are pre-partitioned by an administrator. The nodes in the computing node group are called computing nodes, and each computing node is provided with a GPU tag; nodes in the cache initialization node group are called cache nodes, and each cache node is provided with a cpu label, and the cpu is called a cache initialization label. Thus, depending on whether the label of the node is GPU or cpu, it is possible to distinguish between compute nodes and cache nodes. It should be noted that, the cpu tag indicates that the node with the tag belongs to the cache initialization node group, and the tag GPU indicates that the node with the tag belongs to the computing node group, and at this time, the cpu and the GPU are only tag names. In other embodiments, nodes in the cache initialization node group may use other labels besides cpu, and nodes in the computing node group may use other labels besides GPU.

Wherein, each computing node is provided with a cache client daemon. The cache client daemon is configured to create a cache client on a node that has been set with a cache client tag client-id upon detecting that the node has been set with a cache client tag client-id.

The other node groups are provided with an api interface service, a controller, a scheduler and a database. Devices outside the cluster 1, such as control nodes, may call controllers, schedulers, etc. in the cloud native environment (i.e., the k8s machine environment) through the api interface service to control the nodes within the cluster 1 to perform corresponding operations. Thus, the management service in the control node may control the nodes in the cluster 1 by sending instructions or information to the api interface service in other node groups in the cluster 1.

FIG. 2 is an exemplary diagram illustrating deployment of a distributed caching service in connection with distributed training. The process of distributed training is described below in connection with fig. 1, 2 and subsequent fig. 3, 4 and 5.

First, a description is given of a creation process of the distributed cache service.

Referring to fig. 1, a process 1 illustrates creation of a cache service. The process 1 comprises the following steps:

1.1, a management service in a control node sends an instruction 1 for creating a cache service resource to an api interface service, wherein the instruction 1 carries the cache service resource, and the cache service resource is provided with a cache initialization tag cpu and a cache service tag cache-id.

Each cache service corresponds to one cache-id, and the cache-ids corresponding to different cache services are different. The cache initialization tag cpu is used for indicating that a cache service is created on a node in the cache initialization node group. The scheduler and controller within the cluster determines to create a cache service on nodes having the same tag (i.e., cache initialization tag cpu) based on the cache initialization tag cpu. The cache service tag cache-id is used to indicate which cache service is created.

1.2, the api interface service calls a controller, a scheduler and the like to create a caching process worker of the caching service in a plurality of caching nodes of the caching initialization node group according to the instruction 1.

It should be noted that, although fig. 1 shows only one cache node that creates a cache process worker for a cache service, it should be understood that, in an application, the cache process worker may be created in a plurality of cache nodes.

The method comprises the steps that a caching process worker is created in which caching nodes of a caching initialization node group, and schedulers in other node groups are determined according to a preset scheduling policy. After creating the cache process worker in the cache node, the controller sets a cache-id tag for the cache node that has created the cache process worker to indicate that the node has created a cache service that is tagged as cache-id. While a cache node without a cache process worker has a cache initialization tag cpu but no cache-id tag.

1.3, preloading the original data of the data set 1 in the remote server into a caching process worker of the caching node.

In this way, the data set 1 is cached in the caching process worker of the caching node, and the computing nodes in the cluster 1 do not need to read data from a remote server in the process of training the computing task, but read data from the caching process worker of the caching node.

For a detailed procedure of 1.1 to 1.3, please refer to the flow chart shown in fig. 3.

Fig. 3 is an exemplary diagram of an exemplary illustrated caching service creation process. Referring to fig. 3, in this embodiment, the cache service creation process may include the following steps:

s301, the management service in the control node receives a cache service creation request.

In an application, a user may issue a cache service creation request to a management service by clicking on an option to create a cache service in the management service.

S302, the management service responds to the operation of selecting the target data set by the user, and obtains the related information of the target data set, wherein the related information of the target data set comprises the data volume of the target data set.

In this embodiment, the target data set is data set 1 in fig. 1.

The user may select a target data set from the current list of data sets. Wherein the list of data sets may comprise all data sets in the respective remote servers to which the control node is currently connected. The database of the management service of the control node stores information of all data sets of the remote server, and the data set information includes information such as data amount of the data sets. In one example, the data set information stored in the database of the management service may be input in advance by the user.

For example. Assuming that the dataset list includes dataset 1, dataset 2, dataset 3, dataset 4, and dataset 5 datasets, wherein the dataset in the remote server in fig. 2 is dataset 1, when the user selects dataset 1 in the dataset list, then dataset 1 is the target dataset, and the management service obtains relevant information of dataset 1 from the database, the relevant information including the data volume of dataset 1.

S303, the management service judges whether the data volume of the target data set is smaller than a data volume threshold, if so, step S305 is executed, and otherwise, step S304 is executed.

In one example, the database threshold may be set according to the hard disk storage capacity of the node. For example, the data amount threshold may be set equal to 60% of the hard disk storage capacity.

In this embodiment, the management service sets the buffer capacity of a single buffer process worker in the buffer service according to the data amount of the target data set and the data amount threshold.

The caching service may include one or more caching processes worker. The number of caching processes worker included in the caching service may be set based on the data volume of the data set and the data volume threshold.

S304, setting the buffer capacity of a single buffer process worker to be equal to a data quantity threshold value, closing the elastic scheduling, and executing step S306.

When the data volume of the target data set is greater than the data volume threshold, it indicates that the data set is too large. At this time, all original data of a whole target data set are cached by using a plurality of caching process workers, the caching capacity of a single caching process worker is set to be equal to a data quantity threshold, and the number of the caching process workers is equal to the quotient obtained by dividing the data quantity of the target data set by the data quantity threshold. In the case of no integer division, the number of caching processes worker is obtained by means of a round-up.

When the data volume of the target data set is greater than the data volume threshold, the flexible schedule does not need to be turned on to cache the data in the data set to the computing node local, and therefore the flexible schedule is turned off. The management service may set a status flag bit for each cache process worker, where the status flag bit is used to indicate whether resilient scheduling is on. For example, in one example, if the status flag bit of the cache process worker is 1, indicating that the resilient dispatch of the cache process worker is on, the cache process worker may be expanded to the compute node; if the status flag bit of the caching process worker is 0, the elastic scheduling of the caching process worker is closed, and the caching process worker cannot be expanded to the computing node.

S305, setting the buffer capacity of a single buffer process worker to be equal to the data volume of the target data set, and starting the flexible scheduling.

When the data volume of the target data set is smaller than or equal to the data volume threshold, one caching process worker is used for caching all original data of the whole target data set, and the caching capacity of a single caching process worker is set to be equal to the data volume of the target data set. In one example, the number of initialization caching processes, worker, may be set to 2, taking into account both resource utilization and data read speed. In this way, the method can not only resist the influence of possible single-point faults, but also save storage resources to a large extent.

Assuming that the data size of the data set 1 in fig. 2 is smaller than the data size threshold, 2 buffer processes are set for the data set 1, and the buffer capacity of a single buffer process is set to be equal to the data size of the data set 1, and the function of flexibly scheduling the buffer process is started, so that the buffer process of the data set 1 can be flexibly scheduled.

S306, setting an affinity tag for the cache service resource.

The affinity tag in this step includes a cache service tag cache-id and a cache initialization tag cpu, where the cache service tag cache-id may control creation of a cache process worker, and the cache initialization tag may control cache process worker initialization scheduling to a cache initialization node group. The affinity weight of the initialization tag may be set lower than the cache-service tag cache-id to ensure preferential scheduling to the cache-service tag cache-id node.

Wherein steps S301 to S306 are completed in the management service before step 1.1 in fig. 1. After step S306, the management service performs step 1.1 in fig. 1.

After step S306, the management service generates instruction 1 in step 1.1 in fig. 1, and then performs step 1.1, so that cluster 1 receives instruction 1.

S307, creating a caching process worker.

In this step, the k8s scheduler and controller in cluster 1 creates a caching process worker according to instruction 1. Wherein the scheduler is used for node scheduling (i.e. determining which nodes to create the cache service), and the controller is used for creating the cache process worker on the corresponding node according to the scheduling result (i.e. the nodes scheduled by the scheduler).

Referring to fig. 2, it is assumed that the current creation results in that a cache process worker1 and a cache process worker2 are created on 2 cache nodes (i.e., the cache node 1 and the cache node 2 in fig. 2), respectively.

S308, starting a preloading task, and loading the original data of the target data set to a caching process worker.

The management service timing task queries k8s, and whether the caching process is created to be completed. If so, k8s initiates the data preloading task.

With the embodiment shown in fig. 3, a user may create a plurality of cache services, respectively, and after the creation is completed, the cache services may be displayed in a cache service list. A list of cache services may be displayed in the interface for creating the computing task for the user to select a cache service from the list of cache services when creating the computing task.

Then, a description will be given of a creation process of the computing task (corresponding to the cache service application process).

Continuing with FIG. 1, process 2 therein represents creating a computing task. Process 2 may include the steps of:

2.1, the management service in the control node sends an instruction 2 for creating the computing task resource to the api interface service, wherein the instruction 2 carries cache service information and the computing task resource.

The cache service information may include a cache service tag cache-id.

2.2, creating a computing task process on the computing node.

The method comprises the steps of creating a computing task process in which computing nodes of a computing node group, wherein the computing task process is determined by schedulers in other node groups according to resource conditions of all computing nodes and a preset scheduling policy.

And 2.3, setting a cache client tag client-id for the computing node where the computing task process is located.

The role of the client side tab is to control client side creation. When a node is tagged with a client, a cache client daemon in the management service creates a corresponding cache client at the node.

2.4, setting a cache client daemon on the computing node of the cache client label client-id, and after detecting that the computing node is set with the cache client label client-id, creating a cache client on the computing node.

The cache client is used for reading data from the cache process worker.

Each cache client corresponds to all the cache process workers of one cache service, that is, each cache client can read data from all the cache process workers of one cache service.

2.1 to 2.4 refer to fig. 4.

Fig. 4 is an exemplary diagram of an exemplary illustrated caching service application process. Referring to fig. 4, in this embodiment, the cache service application process may include the following steps:

s401, the management service configures the cache service as input of the computing task according to the operation of selecting the cache service when the user creates the computing task.

In the case where a caching service is available, the user may select the caching service as input when creating a computing task.

For example, assuming that there are currently 3 cache services available, cache service 1, cache service 2, and cache service 3, then the 3 cache services are all selectable by the user when creating computing task X. Assuming the user selects cache service 1, the management service configures cache service 1 as an input to computing task X.

Assume that cache service 1 is the cache service created by the embodiment shown in fig. 3.

And S402, after submitting the calculation task to the api interface service, the management service dispatches the calculation task to the calculation node, namely, creates a calculation task process on the calculation node.

The management service sends an instruction 2 for creating the computing task resource to the api interface service, namely, the management service is considered to submit the computing task to the api interface service.

Referring to fig. 2, assume that the present computing task is scheduled to 2 computing nodes, namely computing node 1 and computing node 2 in fig. 2. The computing process 1 in computing node 1 and the computing process 2 in computing node 2 run distributed tasks of the computing task. I.e. the current calculation task is distributed to the calculation node 1 and the calculation node 2.

S403, the controller in the cluster 1 detects the use condition of the cache service resource at regular time, if the cache service is used by the computing node, the node is marked with a cache client label client-id corresponding to the cache service.

The database inside the cluster 1 stores the use of the cache resources. The content of the use case of the cache resource may be: which cache service is used by the computing container. The controller of the other node groups of cluster 1 then tags the nodes where these computing containers are located with cache client tags client-ids based on the detected usage of cache service resources.

Thus, the compute node with the compute task in FIG. 1 is provided with a tag GPU and a cache client tag client-id.

S404, after the client daemon perceives that the computing node has a cache client label, the cache client is created at the computing node.

S405, the computing task reads data in a caching process worker corresponding to the caching service through the caching client.

Then, a capacity expansion process of the cache service will be described.

With continued reference to fig. 1, process 3 represents a capacity expansion process. Process 3 may include the steps of:

and 3.1, managing service decision expansion, setting cache service labels cache-id for the computing nodes, and updating the corresponding number of cache processes worker.

Thus, the compute nodes with compute tasks, cache services in FIG. 1 are provided with a tag GPU, a cache service tag cache-id, and a cache client tag client-id.

3.2, the expansion cache service is used for creating a computing task process on the computing node.

And 3.3, reading data from the caching process worker in the caching node, and storing the data into the caching process worker in the computing node.

3.1 to 3.3 refer to fig. 5.

Fig. 5 is an exemplary diagram of an exemplary illustrated cache service expansion process. Referring to fig. 5, in this embodiment, the cache service expansion process may include the following steps:

S501, under the condition of starting the elastic scheduling, the management service detects that a computing node using the cache service does not have a cache process worker locally, and triggers an elastic scheduling task, wherein the computing task in the computing node is a computing task a.

The computing tasks may also be referred to herein as training tasks, which are tasks that are trained using data in a dataset.

S502, acquiring input data required by the capacity expansion decision model, and inputting the input data into the capacity expansion decision model to obtain a decision result of whether to expand the capacity aiming at the computing task a.

The input data required for the capacity expansion decision model may be determined according to a specific model. The capacity expansion decision model may infer whether to currently expand or not based on historical data.

The capacity expansion decision model can be a two-class model, and the model can be obtained by training a training data set by collecting data of different scenes based on manual experience. For example, the capacity expansion decision model may be a decision tree, logistic regression, svm, etc. model. The capacity expansion decision model is a classification model in the machine learning field, for example, the capacity expansion decision model may be an LGBM model, which is not limited to this model, and other models may be used as the capacity expansion decision model.

The input data required by the capacity expansion decision model may include the following data:

(1) Cache service characterization data associated with a computing task including, but not limited to:

statistical information of the original data set, namely total file size, total file number and file format.

Buffer setting information, namely buffer capacity, buffer media (ram, ssd) and buffer workbench number; the cache setting information may also be referred to as cache node details.

Caching application information: the number of computational tasks to which the cache of the native dataset applies, and the computational task history information to which the cache of the native dataset applies. Caching application information may also be referred to as using task details.

(2) Computing task feature data associated with a computing task, including, but not limited to:

task priority, user information, application cpu resource, application gpu resource, application memory resource, used input data information, corresponding algorithm type and history execution information.

(3) Computing node group feature data associated with a computing task including, but not limited to:

and each computing node can distribute information such as idle CPU, gpu, memory, solid state disk and the like, and each computing node distributes information such as CPU, gpu, memory, solid state disk and the like, and the network topology structure of each computing node is located. The process of obtaining the result of whether to expand or not from the input data required by the expansion decision model is shown in fig. 6. Fig. 6 is a schematic diagram illustrating a process of obtaining a result of whether to expand from input data required for the expansion decision model.

Referring to fig. 6, after the feature data is obtained, a cache service feature vector is extracted from the cache service feature data, a calculation task feature vector is extracted from the calculation task feature data, and a calculation node group feature vector is extracted from the calculation node group feature data;

and then combining the cache service feature vector, the calculation task feature vector and the calculation node group feature vector into a combined feature vector, inputting the combined feature vector into a trained capacity expansion decision model, and outputting a decision result of whether capacity expansion is performed by the capacity expansion decision model.

Wherein the capacity expansion decision model is a trained model.

The training process of the capacity expansion decision model can comprise the following steps:

constructing a first classification model and setting initial parameter values;

acquiring a plurality of sets of sample data, each set of sample data comprising: combining the feature vector samples and corresponding decision result tag data;

and training the first classification model by using a plurality of groups of sample data to obtain a trained first classification model, and taking the trained first classification model as a trained capacity expansion decision model.

The process of obtaining the combined feature vector sample in the sample data is consistent with the process of obtaining the combined feature vector from the input data required by the capacity expansion decision model, and will not be described herein.

The training of the first classification model by using a plurality of groups of sample data to obtain a trained first classification model may be:

determining a first classification model obtained after training the previous group of sample data as an initial classification model corresponding to the sample data of the group;

inputting the combined feature vector samples in the group of sample data into an initial classification model to obtain a decision result output by the initial classification model, and recording the decision result as an output decision result;

according to the difference between the output decision result and the decision result label data in the sample data, adjusting the parameter value in the initial classification model, and taking the classification model with the parameter value adjusted at the present time as a first classification model obtained after training the sample data;

judging whether the convergence condition of training is met, if so, stopping training, and taking a first classification model obtained after training the sample data of the group as a trained capacity-expanding decision model; otherwise, training of the next set of sample data is continued.

The first classification model corresponding to the first group of sample data is a constructed first classification model with initial parameter values.

Of course, the above is merely an exemplary illustration of the training method, and is not limited to the present embodiment, and the present embodiment may not be limited to the training method listed above.

In this embodiment, by adopting a classification model in the machine learning field to decide whether to expand the caching process worker to the computing node, the decision accuracy can be improved.

Of course, in addition to using the classification model to determine whether to expand, other manners may be used to determine whether to expand, which is not limited by the embodiment. For example, a capacity expansion decision mode based on a preset rule.

In one example, the preset rule may be: if the available storage resources of the computing node are greater than the data volume of the data set, the capacity is expanded, otherwise, if the available storage resources of the computing node are less than or equal to the data volume of the data set, the capacity is not expanded.

In one example, the preset rule may be: if the priority of the computing task is higher than the preset level, capacity expansion is performed, otherwise, if the priority of the computing task is lower than or equal to the preset level, capacity expansion is not performed.

In one example, the preset rule may be: if the historical training speed of the algorithm for the computing task is less than the preset speed value, the capacity is expanded, otherwise, if the historical training speed of the algorithm for the computing task is greater than or equal to the preset speed value, the capacity is not expanded.

S503, judging whether the decision result is capacity expansion, if yes, executing step S504, otherwise executing step S507.

S504, marking cache service labels of all computing nodes related to the computing task a, counting the total node number n of the cache service labels, setting the number of the expanding cache processes worker to be equal to the node number n, and updating the cloud primary resources.

The way to update the cloud native resources is: and submitting a request to the cloud primary service, and updating the number of caching process workers.

The process of steps S501 to S504 corresponds to the combined process of deciding whether to expand or not and step 3.1 in fig. 1.

S505, the dispatcher creates a caching process worker at a computing node for marking the caching service label cache-id according to the caching service label cache-id.

Step S505 corresponds to step 3.2 in fig. 1.

Referring to fig. 2, as a result of the creation, a caching process worker3 and a caching process worker4 are created on 2 computing nodes (i.e., the computing node 1 and the computing node 2 in fig. 2), respectively.

S506, the cache client of the computing process in the computing node reads data from the cache process worker of the cache node, caches the read data into the cache process worker local to the computing node, and ends.

Step S506 corresponds to step 3.3 in fig. 1.

In this way, the subsequent computing node can directly read the data set data from the local caching process worker. Taking fig. 2 as an example, during the training process, the computing process 1 may directly read data from the local caching process worker 3. Similarly, during the training process, the computing process 2 may directly read data from the local caching process worker4. Therefore, the cache service is flexibly scheduled to the computing nodes, so that the computing tasks in the computing nodes can directly read data from the local area in the distributed training process, the data reading speed is improved, and the distributed training speed of the computing tasks is improved.

S507, the cache client of the computing process in the computing node reads data from the cache process worker of the cache node, and the process is ended.

The difference between the speed of reading data by the calculation task in the case of capacity expansion and in the case of no capacity expansion is described by way of comparison.

Fig. 7 is a diagram illustrating the capacity expansion and reading of different data sets. Referring to fig. 7, in the data set 1, the data set 2, the data set 3 and the data set 4, the data set 1 expands the cache service worker to the computing node, so that the computing task using the data set 1 can achieve the effect of locally reading data by the computing node, and the training speed is high. Data set 2 and data set 3 are not expanded and other node data in the cluster is read, so the training speed of the calculation task using data set 2 and data set 3 is slower.

Therefore, after capacity expansion, the computing task can read data locally at the computing node, so that the training speed is high.

According to the distributed training method, the cache process corresponding to the data set required by the calculation task training can be automatically expanded to the calculation node where the calculation task is located, so that the data required by the training can be read locally in the calculation task training process, and the training speed is improved.

Particularly, under the condition that the memory resources of the computing nodes are insufficient and data needs to be cached by using disk resources, the embodiment can adaptively expand the cache process under the condition that the cloud primary scheduler is not changed.

In addition, according to the distributed training method, data required by training can be read locally at the expanded computing node, remote storage service is not required to be accessed, the pressure of the remote storage service can be relieved, and the problem that the performance of the remote storage service is reduced when a large number of computing tasks read the remote storage service is avoided.

Further, according to the distributed training method of the embodiment, data required by training can be read locally at the expanded computing node, remote storage service is not required to be accessed, occupation of communication bandwidth is reduced, communication resources are saved, and the problem that when a large model performs distributed training, a large amount of bandwidth resources are required for parameter exchange of different nodes, and the remote data reading occupies a certain bandwidth resource to reduce exchange performance among parameters is solved.

The embodiment also provides a distributed training system, which includes a control node and a first cluster, wherein:

a control node for:

Creating a cache process of a first cache service corresponding to a first data set in a plurality of cache nodes of a first cluster, wherein the first data set is a data set in a remote database outside the first cluster;

creating a first computing task process corresponding to a first computing task in a plurality of first computing nodes of a first cluster, and setting a first cache service as input of the first computing task, wherein the first computing nodes belong to a computing node group;

determining whether to expand the capacity according to the first cache service, the first computing task and the first computing node;

if the capacity expansion is determined, a cache process of a first cache service is created in the first computing node;

and the first computing nodes in the first cluster are used for reading data from the cache processes in the first computing nodes in the process of training the first computing tasks so as to complete the training of the first computing task processes on the first computing nodes.

The first cluster is referred to as cluster 1 in fig. 1, which is the aforementioned cluster 1.

The embodiment of the application also provides an electronic device, which comprises a memory and a processor, wherein the memory is coupled with the processor, and stores program instructions, and when the program instructions are executed by the processor, the electronic device can make the electronic device execute the distributed training method.

It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware and/or software modules that perform the respective functions. The steps of an algorithm for each example described in connection with the embodiments disclosed herein may be embodied in hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application in conjunction with the embodiments, but such implementation is not to be considered as outside the scope of this application.

The present embodiment also provides a computer storage medium having stored therein computer instructions which, when executed on an electronic device, cause the electronic device to perform the above-described related method steps to implement the distributed training method in the above-described embodiments.

The present embodiment also provides a computer program product which, when run on a computer, causes the computer to perform the above-described related steps to implement the distributed training method in the above-described embodiments.

In addition, the embodiment of the application also provides a device, which can be a chip, a component or a module, and the device can comprise a processor and a memory which are connected; the memory is configured to store computer-executable instructions, and when the device is running, the processor may execute the computer-executable instructions stored in the memory, so that the chip performs the distributed training method in the above method embodiments.

The electronic device, the computer storage medium, the computer program product, or the chip provided in this embodiment are used to execute the corresponding methods provided above, so that the beneficial effects thereof can be referred to the beneficial effects in the corresponding methods provided above, and will not be described herein.

It will be appreciated by those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

Any of the various embodiments of the application, as well as any of the same embodiments, may be freely combined. Any combination of the above is within the scope of the present application.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

The steps of a method or algorithm described in connection with the disclosure of the embodiments disclosed herein may be embodied in hardware, or may be embodied in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access Memory (Random Access Memory, RAM), flash Memory, read Only Memory (ROM), erasable programmable Read Only Memory (Erasable Programmable ROM), electrically Erasable Programmable Read Only Memory (EEPROM), registers, hard disk, a removable disk, a compact disc Read Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Claims

1. A distributed training method, the method comprising:

creating a first computing task process corresponding to a first computing task in a plurality of first computing nodes of the first cluster, and setting the first cache service as input of the first computing task, wherein the first computing nodes belong to a computing node group;

if the capacity expansion is determined, a cache process of the first cache service is created in the first computing node, so that in the process of training the first computing task, the first computing node reads data from the cache process in the first computing node to complete training of the first computing task process on the first computing node.

2. The method of claim 1, wherein determining whether to expand based on the first cache service, the first computing task, the first computing node, comprises:

The first characteristic data corresponding to the first cache service, the second characteristic data corresponding to the first computing task and the third characteristic data corresponding to the computing node group are obtained;

extracting a first feature vector from the first feature data, extracting a second feature vector from the second feature data, and extracting a third feature vector from the third feature data;

obtaining a combined feature vector according to the first feature vector, the second feature vector and the third feature vector;

and inputting the combined feature vector into a trained capacity expansion decision model, and outputting a decision result of whether capacity expansion is performed or not by the capacity expansion decision model.

3. The method of claim 2, wherein the capacity expansion decision model is a classification model.

4. The method of claim 2, wherein the first characteristic data comprises statistics, cache setup information, and cache application information for the first data set.

5. The method of claim 4, wherein the statistics of the first data set include a total file size, a total number of files, a file format of the first data set; the cache setting information of the first data set comprises cache capacity, cache media and cache process quantity; the cache application information of the first data set comprises the number of computing tasks of the cache of the first data set and the historical information of the computing tasks of the cache of the first data set.

6. The method of claim 2, wherein the second characteristic data comprises any one or more of the following:

task priority, user information, applied CPU resources, applied GPU resources, applied memory resources, used input data information, corresponding algorithm types and historical execution information.

7. The method of claim 2, wherein the third characteristic data comprises any one or more of the following:

the CPU information, the GPU information, the memory information and the solid state disk information which can be allocated by each computing node are allocated, and the network topology structure of each computing node is located.

8. The method of claim 1, wherein creating a caching process for the first caching service corresponding to the first data set in the plurality of caching nodes of the first cluster comprises:

receiving a first cache service creation request;

acquiring the data volume of a first data set;

if the data volume of the first data set is smaller than a data volume threshold value, setting the cache capacity of a cache process of a first cache service to be equal to the data volume of the first data set;

Setting a cache initialization tag and a cache service tag for a first cache service resource corresponding to the first data set;

sending a first instruction to the first cluster, wherein the first instruction carries the first cache service resource;

according to the first instruction, a cache process of a first cache service corresponding to the first data set is created in a plurality of cache nodes with the cache initialization tag in the first cluster;

and loading the data in the first data set into the caching process.

9. The method of claim 1, wherein determining whether to expand based on the first cache service, the first computing task, the first computing node, comprises:

and if the available storage resources of the first computing node are larger than the data volume of the first data set, determining expansion.

10. The method of claim 1, wherein determining whether to expand based on the first cache service, the first computing task, the first computing node, comprises:

acquiring the priority of the first computing task;

and if the priority of the first computing task is higher than a preset level, determining expansion.

11. The method of claim 1, wherein determining whether to expand based on the first cache service, the first computing task, the first computing node, comprises:

acquiring the historical training speed of an algorithm of the first computing task;

and if the historical training speed of the algorithm of the first computing task is smaller than a preset speed value, determining expansion.

12. The method of claim 1, wherein each of the caching processes stores all of the data of the first data set.

13. A distributed training system, the system comprising a control node and a first cluster, wherein:

the control node is configured to:

creating a cache process of a first cache service corresponding to a first data set in a plurality of cache nodes of the first cluster, wherein the first data set is a data set in a remote database outside the first cluster;

if the capacity expansion is determined, a cache process of the first cache service is created in the first computing node;

the first computing node in the first cluster is configured to read data from the cache process in the first computing node during training of the first computing task, so as to complete training of a first computing task process on the first computing node.

14. An electronic device, comprising:

a memory and a processor, the memory coupled with the processor;

the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the distributed training method of any of claims 1 to 12.

15. A computer readable storage medium comprising a computer program, characterized in that the computer program, when run on an electronic device, causes the electronic device to perform the distributed training method of any of claims 1 to 12.