CN118051779A - Automatic parameter searching method and device for large model training and electronic equipment - Google Patents

Automatic parameter searching method and device for large model training and electronic equipment Download PDF

Info

Publication number
CN118051779A
CN118051779A CN202410438532.7A CN202410438532A CN118051779A CN 118051779 A CN118051779 A CN 118051779A CN 202410438532 A CN202410438532 A CN 202410438532A CN 118051779 A CN118051779 A CN 118051779A
Authority
CN
China
Prior art keywords
parameter
training
parameters
model training
configuration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410438532.7A
Other languages
Chinese (zh)
Other versions
CN118051779B (en
Inventor
汪玉
黄子潇
宁雪妃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202410438532.7A priority Critical patent/CN118051779B/en
Publication of CN118051779A publication Critical patent/CN118051779A/en
Application granted granted Critical
Publication of CN118051779B publication Critical patent/CN118051779B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of deep learning, in particular to a parameter automatic searching method and device for large model training and electronic equipment, comprising the following steps: acquiring a parameter configuration file which comprises a large model training frame name, a plurality of parameters and a parameter interval of each parameter; determining a target model training frame according to the name of the large model training frame, and determining training flows of all configuration combinations according to the target model training frame, a plurality of parameters and parameter intervals of each parameter; and starting the training processes of all the configuration combinations, and determining the optimal parameter combination for training the large model from the training results of the training processes of all the configuration combinations based on the evaluation indexes. Therefore, the optimal parameter configuration combination can be obtained by enumeration training of the parameter configuration combination through the target model training framework, the problem that the model development period is long due to the fact that the current process of determining the optimal parameter configuration is tedious and time-consuming is solved, the efficiency of determining the optimal parameter configuration by a user is improved, and the development cost is reduced.

Description

Automatic parameter searching method and device for large model training and electronic equipment
Technical Field
The invention relates to the technical field of deep learning, in particular to an automatic parameter searching method and device for large model training and electronic equipment.
Background
Large models refer to neural network models with a large number of model parameters (typically 10 billion and more), and are typically used in many fields such as images, text, audio, etc. Before the large model training framework appears, the user needs to realize the model structure by himself or herself, and when the model parameters are large, the user also needs to realize the parallel strategy of the model by himself or herself so as to speed up the training speed of the model and reduce the occupation of the video memory. However, developing a model parallel strategy has high requirements on development thresholds, an inefficient distributed training strategy may greatly reduce training efficiency of the model and even cause abnormal training results, and manually implementing the model-by-model parallel strategy one by one easily causes poor code maintainability and scalability. Therefore, the large model open source framework supporting the efficient distributed training is provided, the development cost of a user on a parallel strategy of developing a model is reduced, and the method is an important research direction in the large model training.
In the related art, a large model training framework supporting efficient distributed training includes: a large model training framework Megatron-LM based on a deep learning framework PyTorch; an open source large model training framework Megatron-DEEPSPEED adopting a zero redundancy optimizer memory optimization technique (Zero Redundancy Optimizer, zeRO for short); a large model training framework for a seamless integrated mainstream deep learning framework PyTorch.
However, the process of determining the optimal parameter configuration by using the large model training framework is still complicated and time-consuming, and when the user changes the cluster topology or the machine model, the user needs to search the optimal configuration of the model training again manually, which results in a longer model development period and needs to be solved.
Disclosure of Invention
The invention provides a parameter automatic searching method and device for large model training and electronic equipment, which are used for solving the problem that the current process of determining the optimal parameter configuration is tedious and time-consuming, so that the model development period is longer, improving the efficiency of determining the optimal parameter configuration by a user and reducing the development cost.
To achieve the above object, an embodiment of a first aspect of the present invention provides an automatic parameter searching method for training a large model, including the following steps:
Acquiring a parameter configuration file, wherein the parameter configuration file comprises a large model training frame name, a plurality of parameters used for arrangement and combination and a parameter interval of each parameter, and the parameters comprise model structure parameters and parallel training parameters;
Determining a target model training frame according to the name of the large model training frame, and determining training flows of all configuration combinations according to the target model training frame, the plurality of parameters and the parameter interval of each parameter;
and starting the training flows of all the configuration combinations, and determining the optimal parameter combination for large model training from the training results of the training flows of all the configuration combinations based on the evaluation indexes.
According to one embodiment of the present invention, after the parameter configuration file is obtained, the method further includes:
Identifying a target parameter in the plurality of parameters in the parameter configuration file for which no parameter interval is given;
And acquiring a default parameter interval of the target parameter, and taking the default parameter interval as a parameter interval of the target parameter.
According to one embodiment of the present invention, after determining the target model training frame according to the large model training frame name, further comprising:
Checking whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters or not by utilizing the target model training framework;
if the incompatible parameters which do not meet the preset compatible conditions exist in the plurality of parameters, error reporting reminding is conducted on the incompatible parameters.
According to one embodiment of the present invention, after performing the error reporting reminder for the incompatible parameter, the method further includes:
receiving a parameter modification instruction fed back by a user aiming at the incompatible parameters;
Modifying the incompatible parameter based on the parameter modification instruction.
According to one embodiment of the present invention, the determining a training process of all configuration combinations according to the target model training framework, the plurality of parameters and the parameter interval of each parameter includes:
acquiring iteration times of each training from the parameter configuration file;
determining configuration combinations of all parameters according to the plurality of parameters and the parameter interval of each parameter;
and determining the training flow of all configuration combinations based on the iteration times and the configuration combinations of all parameters.
According to one embodiment of the present invention, the determining, based on the evaluation index, an optimal parameter combination for training the large model from training results of the training flows of all configuration combinations includes:
acquiring the number of the optimal parameter combinations to be reserved from the parameter configuration file;
Acquiring an evaluation index value of each configuration combination based on the training result;
And determining the optimal parameter combination based on the number of the optimal parameter combinations and the evaluation index value of each configuration combination.
According to one embodiment of the present invention, when the training process of all configuration combinations is started, the method further includes:
Configuration combinations of training start failures are recorded.
According to the parameter automatic searching method for large model training, which is provided by the embodiment of the invention, through obtaining the parameter configuration file, which comprises a large model training frame name, a plurality of parameters and a parameter interval of each parameter, a target model training frame can be determined according to the large model training frame name, training flows of all configuration combinations are determined according to the target model training frame, the plurality of parameters and the parameter interval of each parameter, the training flows of all configuration combinations are started, and the optimal parameter combination for large model training is determined from training results of the training flows of all configuration combinations based on the evaluation index. Therefore, the optimal parameter configuration combination can be obtained by enumeration training of the parameter configuration combination through the target model training framework, the problem that the model development period is long due to the fact that the current process of determining the optimal parameter configuration is tedious and time-consuming is solved, the efficiency of determining the optimal parameter configuration by a user is improved, and the development cost is reduced.
To achieve the above object, a second aspect of the present invention provides an automatic parameter searching apparatus for large model training, comprising:
The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a parameter configuration file, the parameter configuration file comprises a large model training frame name, a plurality of parameters used for arrangement and combination and a parameter interval of each parameter, and the parameters comprise model structure parameters and parallel training parameters;
The first determining module is used for determining a target model training frame according to the name of the large model training frame and determining training flows of all configuration combinations according to the target model training frame, the parameters and the parameter interval of each parameter;
And the second determining module is used for starting the training flows of all the configuration combinations and determining the optimal parameter combination for training the large model from the training results of the training flows of all the configuration combinations based on the evaluation indexes.
According to one embodiment of the present invention, after the parameter configuration file is acquired, the acquiring module is further configured to:
Identifying a target parameter in the plurality of parameters in the parameter configuration file for which no parameter interval is given;
And acquiring a default parameter interval of the target parameter, and taking the default parameter interval as a parameter interval of the target parameter.
According to one embodiment of the present invention, after determining the target model training frame according to the large model training frame name, the first determining module further includes:
the verification unit is used for verifying whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters or not by utilizing the target model training framework;
and the error reporting unit is used for reporting error reminding aiming at the incompatible parameters when the incompatible parameters which do not meet the preset compatible conditions exist in the plurality of parameters.
According to one embodiment of the present invention, after performing the error reporting on the incompatible parameter, the error reporting unit is further configured to:
receiving a parameter modification instruction fed back by a user aiming at the incompatible parameters;
Modifying the incompatible parameter based on the parameter modification instruction.
According to one embodiment of the present invention, the first determining module is specifically configured to:
acquiring iteration times of each training from the parameter configuration file;
determining configuration combinations of all parameters according to the plurality of parameters and the parameter interval of each parameter;
and determining the training flow of all configuration combinations based on the iteration times and the configuration combinations of all parameters.
According to an embodiment of the present invention, the second determining module is specifically configured to:
acquiring the number of the optimal parameter combinations to be reserved from the parameter configuration file;
Acquiring an evaluation index value of each configuration combination based on the training result;
And determining the optimal parameter combination based on the number of the optimal parameter combinations and the evaluation index value of each configuration combination.
According to an embodiment of the present invention, when the training process of all configuration combinations is started, the second determining module is further configured to:
Configuration combinations of training start failures are recorded.
According to the parameter automatic searching device for large model training, provided by the embodiment of the invention, through obtaining the parameter configuration file, which comprises a large model training frame name, a plurality of parameters and a parameter interval of each parameter, a target model training frame can be determined according to the large model training frame name, training flows of all configuration combinations are determined according to the target model training frame, the plurality of parameters and the parameter interval of each parameter, the training flows of all configuration combinations are started, and the optimal parameter combination for large model training is determined from training results of the training flows of all configuration combinations based on the evaluation index. Therefore, the optimal parameter configuration combination can be obtained by enumeration training of the parameter configuration combination through the target model training framework, the problem that the model development period is long due to the fact that the current process of determining the optimal parameter configuration is tedious and time-consuming is solved, the efficiency of determining the optimal parameter configuration by a user is improved, and the development cost is reduced.
To achieve the above object, an embodiment of a third aspect of the present invention provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the parameter automatic searching method for large model training as described in the embodiment.
To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program for execution by a processor for implementing the parameter automatic search method for large model training as described in the above embodiments.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of a method for automatic searching of parameters for large model training according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method of automatic searching for parameters for large model training in accordance with another embodiment of the present invention;
FIG. 3 is a block diagram of an automatic parameter searching apparatus for large model training according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
The following describes a parameter automatic searching method, a device and electronic equipment for training a large model according to an embodiment of the invention with reference to the accompanying drawings.
Before introducing the parameter automatic searching method for large model training provided by the embodiment of the invention, a large model training framework which can realize efficient parallel strategies in the related technology is simply introduced.
In the related technology, (1) the large model training framework Megatron-LM is based on the deep learning framework PyTorch, so that an efficient parallel strategy comprising model parallel, data parallel and pipeline parallel is realized, and meanwhile, hybrid precision training is supported, so that the calculation performance can be improved and the memory consumption can be reduced; (2) The main stream open source large model training frame Megatron-DEEPSPEED adopts a zero redundancy optimizer memory optimization technology to reduce the video memory occupation of model training, and meanwhile, the model training frame still has good expandability when the nodes of the cluster machine are increased; (3) The large model training framework generally integrates the mainstream deep learning framework PyTorch seamlessly, provides an efficient video memory optimization technology and model parallel policy interface, and a user can adjust appropriate model parameters and parallel policies according to the provided interface.
However, currently mainstream large model training frames always directly expose interfaces to users, the users can manually adjust the structure and the parameter number of the model and parallel strategies during model training in one training process, in order to obtain the highest training efficiency on a specific cluster, the users often need to try multiple times between the arrangement combinations of multiple parameter configurations and wait for obtaining the data of the model training efficiency to record, after all the arrangement combinations try, the more suitable training parameter configurations are selected according to the recorded data, and the process is very tedious and time-consuming, and when the users replace the cluster topology or the machine model, the users often need to manually search the best configuration of the model training again, so that the model development period is longer.
Based on the above problems, the embodiment of the invention provides an automatic parameter searching method for large model training, which performs enumeration training on provided parameter configuration combinations through a target model training framework, acquires indexes such as display memory occupation conditions, training speed and the like in real time according to the running conditions of different parameter configuration combinations, sorts and filters the running results of different parameter configuration combinations, helps a user to efficiently search out optimal parameter configuration and parallel strategy configuration under given hardware topology and model structure, and reduces development cost of the user in searching training configuration.
FIG. 1 is a flow chart of a method of automatic search of parameters for large model training in accordance with one embodiment of the present invention.
Illustratively, as shown in fig. 1, the parameter automatic search method for large model training includes the following steps:
In step S101, a parameter configuration file is obtained, where the parameter configuration file includes a large model training frame name, a plurality of parameters for permutation and combination, and a parameter interval of each parameter, and the parameters include model structure parameters and parallel training parameters.
It will be appreciated that the parameter profile may be provided by a user that includes a large model training frame name, a plurality of parameters for permutation and combination, and a parameter interval for each parameter, where the parameters include model structure parameters and parallel training parameters. Wherein, the big model training frame refers to a tool specially used for training a large-scale deep learning model, which supports efficient large-scale parallel computation and can process large-scale data and models, and currently, the mainstream big model training frame comprises: tensorFlow (developed by ***, supporting distributed training, having a strong ecosystem and extensive community support), pyTorch (developed by Facebook, having a compact and easy-to-use API and flexible dynamic diagram features), PADDLEPADDLE (flyer, developed by PADDLEPADDLE open source community, supporting a variety of hardware platforms and a variety of application scenarios), and the like; in order to achieve a better search result under a specific model configuration, a user may provide a plurality of parameters for permutation and combination, and common parameters include: (1) Model structure parameters such as num-layers (number of layers of large language model), hidden-size (hidden layer dimension), seq-length (input sequence length), micro-batch-size (number of samples selected for a single training), train-iters (number of training iterations), etc. (2) parallel training parameters including but not limited to: nproc-per-node (number of single processes, i.e., number of GPUs called (Graphics Processing Unit, graphics processor)), tensor-model-parallel-size (model parallelism, i.e., parallelism in which model parameters are split equally across all GPUs), pipeline-model-parallel-size (pipeline parallelism), sequence-parallel, etc.; in addition, the user also needs to provide a parameter interval (a value range or list) of each parameter, such as seq-length= {1024, 2048, 4096}.
In step S102, a target model training frame is determined according to the name of the large model training frame, and training flows of all configuration combinations are determined according to the target model training frame, the plurality of parameters, and the parameter interval of each parameter.
That is, after the large model training frame name is provided in step S101, the target model training frame may be determined according to the large model training frame name, and the training flow for all configuration combinations may be determined based on the target model training frame, the plurality of parameters, and the parameter interval of each parameter, where the target training frame may automatically enumerate the configuration combinations of all parameters as a configuration for a single training.
For further understanding, the following details how the training process for all configuration combinations is determined from the goal model training framework, the plurality of parameters, and the parameter intervals for each parameter.
As one possible implementation, determining a training process of all configuration combinations according to the target model training framework, the plurality of parameters and the parameter interval of each parameter includes: acquiring iteration times of each training from a parameter configuration file; determining configuration combinations of all parameters according to the plurality of parameters and the parameter interval of each parameter; the training process for all configuration combinations is determined based on the number of iterations and the configuration combinations for all parameters.
Specifically, the number of iterations of each training depends on total-iters (total training iteration number) parameters in the parameter configuration file, and from the multiple parameters and the parameter interval of each parameter, the configuration combination of all the parameters can be determined. For example, assume that the parameter interval of the parameter seq-length is: when the configuration combination of the seq-length is enumerated, three values of 1024, 2048 and 4096 can be used for sequentially and orderly arranging and combining with other parameters, wherein the three values are used as the configuration combination of a primary training model which mainly runs the parameter seq-length, namely (1024,num-layers)、(1024,hidden-size)、……、(4096,pipeline-model-parallel-size)、(4096,sequence-parallel) and the like, after one training is finished by using the parameter (seq-length), the configuration combination mainly comprises the following parameter can be enumerated in a similar way, and the next training is started, so that the configuration combination of all the parameters can be determined according to a plurality of parameters and parameter intervals of each parameter, and after the configuration combination of all the parameters is obtained, the training flow of all the configuration combinations is determined based on the iteration times and the configuration combination of all the parameters until the configuration combination of all the parameters is enumerated.
It should be noted that the total training number is the radix of the cartesian product (i.e., the number of elements in the cartesian product) of the parameter interval of all the parameters, where the cartesian product is also called a direct product, such as two sets X and Y in the field of optics, each element in set X and each element in set Y form an ordered pair, and the sets formed by all ordered pairs are called the cartesian product of sets X and Y, i.e., assuming that the sets x= { a, b }, the sets y= {0, 1, 2}, the cartesian products of the two sets are { (a, 0), (a, 1), (a, 2), (b, 0), (b, 1), (b, 2) }, and the cartesian products of two or more sets are the same, so as to avoid redundancy, and no redundant description is made herein.
In step S103, the training flows of all the configuration combinations are started, and the optimal parameter combination for large model training is determined from the training results of the training flows of all the configuration combinations based on the evaluation index.
That is, based on the provided parameters and the configuration combinations of all the parameters, the training process can be split into a plurality of parallel tasks, each task corresponds to a specific configuration combination of the parameters, the configuration combination mainly comprising each parameter is distributed to a plurality of computing nodes for parallel processing, each computing node uses the configuration combination of the corresponding parameters for large model training, after each node completes training, the training results of the training process of all the configuration combinations can be obtained, and the best parameter combination for large model training can be determined by comprehensively evaluating all the training results by using proper evaluation indexes (such as accuracy, cross validation loss and the like).
How to determine the optimal parameter combination for large model training from the training results of the training flows of all configuration combinations based on the evaluation index is described in detail below.
As one possible implementation, determining an optimal parameter combination for large model training from training results of training flows of all configuration combinations based on the evaluation index includes: acquiring the number of the optimal parameter combinations to be reserved from the parameter configuration file; acquiring an evaluation index value of each configuration combination based on the training result; an optimal parameter combination is determined based on the number of optimal parameter combinations and the evaluation index value of each configuration combination.
Specifically, in connection with the illustration of fig. 2, each time a training instance is started, the index recorded in the training process may write the training result into the report file through the report module of the target model training framework, the report file may record the training result of the configuration combination of all the parameters provided by the user, and the training result is measured by the evaluation index provided by the user in the parameter configuration file. When the target model training framework is used for model training, the target model training framework can start a process in the last step of training, and an evaluation index value (such as training time, memory occupation, MFU (Model FLOPs Utilization, model calculation power utilization rate) of each configuration combination is obtained from standard output through a log and a regular expression. Wherein, partial evaluation index values such as video memory occupation and the like can be directly obtained through monitoring tools such as NVTOP (NVidia TOP, injeida video card monitoring tool) and the like, and if the evaluation index values are multiple video cards, average values are required to be calculated; for the evaluation index values which cannot be directly obtained, such as TFLOPS (Tera Floating Point Operations Per Second, three trillion times of floating point operation per second), MFU and the like, the evaluation index values can be obtained by calculation by a target model training framework; for the user-defined (e.g., user-defined in the parameter configuration file) evaluation index, the evaluation index value can be calculated by calling an index calculation mode which can be realized through the target model training framework.
It will be appreciated that after all the parameter configuration combinations are enumerated, a report of the training results needs to be output to the user, and if the training results of the training processes of all the configuration combinations are directly output to the user, the report is too lengthy, and it is difficult to determine the optimal parameter combinations for large model training from the training results of the training processes of many configuration combinations, so that it is also necessary to determine the number of the optimal parameter combinations that need to be retained. Based on the number of the optimal parameter combinations to be reserved, the target model training framework can sort and filter training results of training processes of all configuration combinations according to the evaluation indexes of interest of the user to obtain a screened optimal configuration combination set, and then output a report of the training results of the optimal configuration combination set to the user, so that the algorithm user can finally determine the optimal parameter combinations based on the number of the optimal parameter combinations and the evaluation index value of each configuration combination.
For example, the training results of the training processes of all configuration combinations may be ordered according to the training time of one step, and further filtered according to the size of the memory occupation or the hardware utilization, so as to screen out the training results of the training processes of the configuration combinations meeting the requirements, and further determine the optimal parameter combination. For the index customized by the user, the user can filter and sort the training results while realizing the index calculation mode.
Furthermore, in some embodiments, when the training process of all configuration combinations is started, further comprising: configuration combinations of training start failures are recorded.
That is, if the training process of the configuration combination fails to start due to insufficient machine video memory caused by too many parameters or unreasonable parallel policy configuration, the configuration combination with the training configuration failure can be recorded, so that subsequent screening configuration is facilitated, and the parameters which cause insufficient machine video memory or other training failures can be directly removed.
Further, in some embodiments, after obtaining the parameter configuration file, further comprising: identifying a target parameter in the parameter configuration file, wherein a parameter interval is not given in the plurality of parameters; and acquiring a default parameter interval of the target parameter, and taking the default parameter interval as a parameter interval of the target parameter.
If a certain parameter user does not provide the parameter interval or does not list the parameter interval in the parameter configuration file, the target parameter, which does not provide the parameter interval, in the plurality of parameters may be identified, a default parameter interval of the target parameter when the user starts training may be obtained, and the default parameter interval may be used as the parameter interval of the target parameter.
Further, in some embodiments, after determining the target model training frame from the large model training frame name, further comprising: checking whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters or not by utilizing a target model training framework; if the incompatible parameters which do not meet the preset compatible conditions exist in the plurality of parameters, error reporting reminding is conducted on the incompatible parameters.
It can be understood that, in order to be compatible with a plurality of mainstream large model training frames, after determining the target model training frame according to the large model training frame names, whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters or not can be checked by using the target model training frame, wherein when the preset compatibility conditions are defined and implemented, the compatibility conditions are ensured to be reasonable and necessary, and the preset compatibility conditions can be adjusted and updated at any time along with the change of technical and business requirements, and the method is not particularly limited herein. When incompatible parameters which do not meet preset compatible conditions exist in the plurality of parameters, error reporting reminding can be conducted on the incompatible parameters, and users can know the conditions in time conveniently.
For example, when the stage zero-stage parameter of the zero redundancy optimizer ZeRO is 2, it is not compatible with the pipeline parallel pipeline-model-parallel-size parameter, and when the incompatible parameter is set, an error should be reported in advance at the frame level to alert the user.
Further, in some embodiments, after performing the error notification for the incompatible parameter, the method further includes: receiving a parameter modification instruction fed back by a user aiming at incompatible parameters; the incompatible parameters are modified based on the parameter modification instructions.
That is, after the error alert is made for the incompatible parameter, the user may issue a parameter modification instruction based on the incompatible parameter to modify the incompatible parameter and the desired index before starting the enumeration training parameter, for example, if the stage zero-stage parameter of the zero redundancy optimizer is set to 2, the pipeline parallelism parameter pipeline-parallel-size may only be set to 0, indicating that the pipeline parallelism is not turned on.
According to the parameter automatic searching method for large model training, which is provided by the embodiment of the invention, through obtaining the parameter configuration file, which comprises a large model training frame name, a plurality of parameters and a parameter interval of each parameter, a target model training frame can be determined according to the large model training frame name, training flows of all configuration combinations are determined according to the target model training frame, the plurality of parameters and the parameter interval of each parameter, the training flows of all configuration combinations are started, and the optimal parameter combination for large model training is determined from training results of the training flows of all configuration combinations based on the evaluation index. Therefore, the optimal parameter configuration combination can be obtained by enumeration training of the parameter configuration combination through the target model training framework, the problem that the model development period is long due to the fact that the current process of determining the optimal parameter configuration is tedious and time-consuming is solved, the efficiency of determining the optimal parameter configuration by a user is improved, and the development cost is reduced.
The parameter automatic searching device for large model training according to the embodiment of the invention is described next with reference to the accompanying drawings.
FIG. 3 is a block schematic diagram of an automatic parameter search apparatus for large model training according to one embodiment of the present invention.
As shown in fig. 3, the parameter automatic search device 10 for large model training includes: the acquisition module 100, the first determination module 200 and the second determination module 300.
The acquiring module 100 is configured to acquire a parameter configuration file, where the parameter configuration file includes a name of a large model training frame, a plurality of parameters for permutation and combination, and a parameter interval of each parameter, and the parameters include a model structure parameter and a parallel training parameter;
The first determining module 200 is configured to determine a target model training frame according to the name of the large model training frame, and determine training flows of all configuration combinations according to the target model training frame, the plurality of parameters and a parameter interval of each parameter;
The second determining module 300 is configured to start the training flows of all configuration combinations, and determine the optimal parameter combination for training the large model from the training results of the training flows of all configuration combinations based on the evaluation index.
Further, in some embodiments, after the parameter configuration file is obtained, the obtaining module 100 is further configured to:
Identifying a target parameter in the parameter configuration file, wherein a parameter interval is not given in the plurality of parameters;
And acquiring a default parameter interval of the target parameter, and taking the default parameter interval as a parameter interval of the target parameter.
Further, in some embodiments, after determining the target model training frame from the large model training frame name, the first determination module 200 further includes:
the verification unit is used for utilizing the target model training framework to verify whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters;
And the error reporting unit is used for reporting error reminding aiming at the incompatible parameters when the incompatible parameters which do not meet the preset compatible conditions exist in the plurality of parameters.
Further, in some embodiments, after performing the error reporting alert for the incompatible parameter, the error reporting unit is further configured to:
receiving a parameter modification instruction fed back by a user aiming at incompatible parameters;
The incompatible parameters are modified based on the parameter modification instructions.
Further, in some embodiments, the first determining module 200 is specifically configured to:
Acquiring iteration times of each training from a parameter configuration file;
Determining configuration combinations of all parameters according to the plurality of parameters and the parameter interval of each parameter;
the training process for all configuration combinations is determined based on the number of iterations and the configuration combinations for all parameters.
Further, in some embodiments, the second determining module 300 is specifically configured to:
Acquiring the number of the optimal parameter combinations to be reserved from the parameter configuration file;
Acquiring an evaluation index value of each configuration combination based on the training result;
an optimal parameter combination is determined based on the number of optimal parameter combinations and the evaluation index value of each configuration combination.
Further, in some embodiments, when the training process of all configuration combinations is started, the second determining module 300 is further configured to:
Configuration combinations of training start failures are recorded.
It should be noted that the foregoing explanation of the embodiment of the automatic parameter searching method for large model training is also applicable to the automatic parameter searching device for large model training of this embodiment, and will not be repeated here.
According to the parameter automatic searching device for large model training, provided by the embodiment of the invention, through obtaining the parameter configuration file, which comprises a large model training frame name, a plurality of parameters and a parameter interval of each parameter, a target model training frame can be determined according to the large model training frame name, training flows of all configuration combinations are determined according to the target model training frame, the plurality of parameters and the parameter interval of each parameter, the training flows of all configuration combinations are started, and the optimal parameter combination for large model training is determined from training results of the training flows of all configuration combinations based on the evaluation index. Therefore, the optimal parameter configuration combination can be obtained by enumeration training of the parameter configuration combination through the target model training framework, the problem that the model development period is long due to the fact that the current process of determining the optimal parameter configuration is tedious and time-consuming is solved, the efficiency of determining the optimal parameter configuration by a user is improved, and the development cost is reduced.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device may include:
Memory 401, processor 402, and a computer program stored on memory 401 and executable on processor 402.
The processor 402, when executing the program, implements the parameter automatic search method for large model training provided in the above-described embodiment.
Further, the electronic device further includes:
A communication interface 403 for communication between the memory 401 and the processor 402.
A memory 401 for storing a computer program executable on the processor 402.
Memory 401 may include high-speed RAM (Random Access Memory ) memory, and may also include non-volatile memory, such as at least one disk memory.
If the memory 401, the processor 402, and the communication interface 403 are implemented independently, the communication interface 403, the memory 401, and the processor 402 may be connected to each other by a bus and perform communication with each other. The bus may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT INTERCONNECT, external device interconnect) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.
Alternatively, in a specific implementation, if the memory 401, the processor 402, and the communication interface 403 are integrated on a chip, the memory 401, the processor 402, and the communication interface 403 may perform communication with each other through internal interfaces.
The processor 402 may be a CPU (Central Processing Unit ) or an ASIC (Application SPECIFIC INTEGRATED Circuit, application specific integrated Circuit) or one or more integrated circuits configured to implement embodiments of the present invention.
The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the parameter automatic search method for large model training as above.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (10)

1. An automatic parameter searching method for large model training is characterized by comprising the following steps:
Acquiring a parameter configuration file, wherein the parameter configuration file comprises a large model training frame name, a plurality of parameters used for arrangement and combination and a parameter interval of each parameter, and the parameters comprise model structure parameters and parallel training parameters;
Determining a target model training frame according to the name of the large model training frame, and determining training flows of all configuration combinations according to the target model training frame, the plurality of parameters and the parameter interval of each parameter;
and starting the training flows of all the configuration combinations, and determining the optimal parameter combination for large model training from the training results of the training flows of all the configuration combinations based on the evaluation indexes.
2. The method for automatic searching for parameters for large model training according to claim 1, further comprising, after obtaining the parameter profile:
Identifying a target parameter in the plurality of parameters in the parameter configuration file for which no parameter interval is given;
And acquiring a default parameter interval of the target parameter, and taking the default parameter interval as a parameter interval of the target parameter.
3. The method for automatic searching of parameters for large model training according to claim 1, further comprising, after determining the target model training frame from the large model training frame name:
Checking whether incompatible parameters which do not meet preset compatibility conditions exist in the plurality of parameters or not by utilizing the target model training framework;
if the incompatible parameters which do not meet the preset compatible conditions exist in the plurality of parameters, error reporting reminding is conducted on the incompatible parameters.
4. The method for automatic searching of parameters for large model training according to claim 3, further comprising, after performing an error notification for the incompatible parameters:
receiving a parameter modification instruction fed back by a user aiming at the incompatible parameters;
Modifying the incompatible parameter based on the parameter modification instruction.
5. The method for automatic searching for parameters for large model training according to any one of claims 1 to 4, wherein the determining training flows of all configuration combinations according to the target model training framework, the plurality of parameters, and the parameter interval of each parameter comprises:
acquiring iteration times of each training from the parameter configuration file;
determining configuration combinations of all parameters according to the plurality of parameters and the parameter interval of each parameter;
and determining the training flow of all configuration combinations based on the iteration times and the configuration combinations of all parameters.
6. The method according to any one of claims 1 to 4, wherein determining an optimal parameter combination for large model training from training results of the training flows of all configuration combinations based on the evaluation index comprises:
acquiring the number of the optimal parameter combinations to be reserved from the parameter configuration file;
Acquiring an evaluation index value of each configuration combination based on the training result;
And determining the optimal parameter combination based on the number of the optimal parameter combinations and the evaluation index value of each configuration combination.
7. The method for automatic searching of parameters for large model training according to any one of claims 1 to 4, further comprising, when starting the training flow of all configuration combinations:
Configuration combinations of training start failures are recorded.
8. An automatic parameter searching device for large model training, comprising:
The system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a parameter configuration file, the parameter configuration file comprises a large model training frame name, a plurality of parameters used for arrangement and combination and a parameter interval of each parameter, and the parameters comprise model structure parameters and parallel training parameters;
The first determining module is used for determining a target model training frame according to the name of the large model training frame and determining training flows of all configuration combinations according to the target model training frame, the parameters and the parameter interval of each parameter;
And the second determining module is used for starting the training flows of all the configuration combinations and determining the optimal parameter combination for training the large model from the training results of the training flows of all the configuration combinations based on the evaluation indexes.
9. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method for automatic searching of parameters for large model training of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor for implementing the parameter automatic search method for large model training according to any one of claims 1 to 7.
CN202410438532.7A 2024-04-12 2024-04-12 Automatic parameter searching method and device for large model training and electronic equipment Active CN118051779B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410438532.7A CN118051779B (en) 2024-04-12 2024-04-12 Automatic parameter searching method and device for large model training and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410438532.7A CN118051779B (en) 2024-04-12 2024-04-12 Automatic parameter searching method and device for large model training and electronic equipment

Publications (2)

Publication Number Publication Date
CN118051779A true CN118051779A (en) 2024-05-17
CN118051779B CN118051779B (en) 2024-07-16

Family

ID=91045125

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410438532.7A Active CN118051779B (en) 2024-04-12 2024-04-12 Automatic parameter searching method and device for large model training and electronic equipment

Country Status (1)

Country Link
CN (1) CN118051779B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109032671A (en) * 2018-06-25 2018-12-18 电子科技大学 A kind of distributed deep learning method and system based on data parallel strategy
CN110991658A (en) * 2019-11-28 2020-04-10 重庆紫光华山智安科技有限公司 Model training method and device, electronic equipment and computer readable storage medium
US20230076967A1 (en) * 2021-09-06 2023-03-09 Institute For Information Industry Automatic optimization method and automatic optimization system of diagnosis model
CN115996173A (en) * 2022-11-14 2023-04-21 中国科学技术大学 Communication optimization method and system for parallel training of distributed deep learning operator
CN116128019A (en) * 2022-11-17 2023-05-16 北京大学 Parallel training method and device for transducer model
CN116629352A (en) * 2023-04-10 2023-08-22 苏州互微智速科技有限公司 Hundred million-level parameter optimizing platform
CN116956991A (en) * 2023-09-21 2023-10-27 牛津大学(苏州)科技有限公司 Multi-layer perceptron model generation method, device, computer equipment and storage medium
CN117093871A (en) * 2023-10-16 2023-11-21 之江实验室 Deep learning-oriented distributed training evaluation method and system
CN117396851A (en) * 2021-12-30 2024-01-12 华为技术有限公司 Method, device and system for determining distributed training algorithm framework configuration
CN117407713A (en) * 2023-10-17 2024-01-16 支付宝(杭州)信息技术有限公司 Training management method and related device for distributed model training

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109032671A (en) * 2018-06-25 2018-12-18 电子科技大学 A kind of distributed deep learning method and system based on data parallel strategy
CN110991658A (en) * 2019-11-28 2020-04-10 重庆紫光华山智安科技有限公司 Model training method and device, electronic equipment and computer readable storage medium
US20230076967A1 (en) * 2021-09-06 2023-03-09 Institute For Information Industry Automatic optimization method and automatic optimization system of diagnosis model
CN117396851A (en) * 2021-12-30 2024-01-12 华为技术有限公司 Method, device and system for determining distributed training algorithm framework configuration
CN115996173A (en) * 2022-11-14 2023-04-21 中国科学技术大学 Communication optimization method and system for parallel training of distributed deep learning operator
CN116128019A (en) * 2022-11-17 2023-05-16 北京大学 Parallel training method and device for transducer model
CN116629352A (en) * 2023-04-10 2023-08-22 苏州互微智速科技有限公司 Hundred million-level parameter optimizing platform
CN116956991A (en) * 2023-09-21 2023-10-27 牛津大学(苏州)科技有限公司 Multi-layer perceptron model generation method, device, computer equipment and storage medium
CN117093871A (en) * 2023-10-16 2023-11-21 之江实验室 Deep learning-oriented distributed training evaluation method and system
CN117407713A (en) * 2023-10-17 2024-01-16 支付宝(杭州)信息技术有限公司 Training management method and related device for distributed model training

Also Published As

Publication number Publication date
CN118051779B (en) 2024-07-16

Similar Documents

Publication Publication Date Title
CN107807982B (en) Consistency checking method and device for heterogeneous database
EP4369180A2 (en) Callpath finder
CN110647999A (en) Method and device for improving deep learning training speed based on topological structure
CN105700956A (en) Distributed job processing method and system
CN110891000B (en) GPU bandwidth performance detection method, system and related device
US11663113B2 (en) Real time fault localization using combinatorial test design techniques and test case priority selection
WO2020259516A1 (en) Unit testing system and unit testing method
CN115576834A (en) Software test multiplexing method, system, terminal and medium for supporting fault recovery
CN118051779B (en) Automatic parameter searching method and device for large model training and electronic equipment
US20210263838A1 (en) Assignment of test case priorities based on combinatorial test design model analysis
CN116166967B (en) Data processing method, equipment and storage medium based on meta learning and residual error network
CN110704620B (en) Method and device for identifying same entity based on knowledge graph
CN115827636B (en) Method for storing and reading simulation data of logic system design from waveform database
CN116009889A (en) Deep learning model deployment method and device, electronic equipment and storage medium
CN110362294A (en) Development task executes method, apparatus, electronic equipment and storage medium
CN111027196B (en) Simulation analysis task processing method and device for power equipment and storage medium
CN114443141A (en) Method and device for determining cyclic constraint fault of measurement and control instruction
CN111782641A (en) Data error repairing method and system
US10552760B2 (en) Training set creation for classifying features of a system under agile development
CN112685275B (en) Algorithm policy search method and device, electronic equipment and storage medium
CN113076237B (en) Memory performance testing method and system and computer readable storage medium
CN116955342B (en) Service data consistency rate verification method and device
US9818078B1 (en) Converting a non-workflow program to a workflow program using workflow inferencing
JP6916327B1 (en) Derived test equipment, derived test methods, and derived test programs
CN117743070A (en) Method, device, equipment and medium for testing register bit flash

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant