WO2023093375A1 - Computing resource acquisition method and apparatus, electronic device, and storage medium - Google Patents

Computing resource acquisition method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2023093375A1
WO2023093375A1 PCT/CN2022/125905 CN2022125905W WO2023093375A1 WO 2023093375 A1 WO2023093375 A1 WO 2023093375A1 CN 2022125905 W CN2022125905 W CN 2022125905W WO 2023093375 A1 WO2023093375 A1 WO 2023093375A1
Authority
WO
WIPO (PCT)
Prior art keywords
resource
role
information
operator
framework
Prior art date
Application number
PCT/CN2022/125905
Other languages
French (fr)
Chinese (zh)
Inventor
李明
路明奎
方磊
Original Assignee
北京九章云极科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京九章云极科技有限公司 filed Critical 北京九章云极科技有限公司
Publication of WO2023093375A1 publication Critical patent/WO2023093375A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the technical field of machine learning, and in particular to a computing resource acquisition method, device, electronic equipment and storage medium.
  • the algorithm model based on machine learning has the advantages of short prediction time after sufficient training and high prediction accuracy, and is widely used in various fields.
  • the computing resources required for different algorithm model training will change accordingly.
  • the above computing resources for model training need to be manually configured by the user.
  • the required computing resources for algorithm model training The deviation between the computing resources and the computing resources configured by the user is large. If the configured computing resources are insufficient, memory overflow may easily occur, or if the configured computing resources are larger than the required resources, resources will be wasted.
  • the purpose of the embodiments of the present application is to provide a computing resource acquisition method, device, electronic device, and storage medium, which are used to solve the problem of large deviation between computing resources required for algorithm model training and computing resources configured by users.
  • the embodiment of the present application provides a computing resource acquisition method, including:
  • the obtaining operator resource information of each operator included in the topology information according to the topology information includes:
  • the resource framework information corresponding to each operator
  • calculating the operator resource information of each operator according to the resource framework information corresponding to each operator includes:
  • the calculating the operator resource information of each operator according to at least one framework role information corresponding to each operator includes:
  • the operator calculation method to calculate the role resource corresponding to the framework role included in the operator, including:
  • the framework role information includes default computing resource information, sample parameter information and batch import parameter information;
  • training sample size is greater than a first threshold and smaller than a second threshold, obtain a role resource corresponding to the framework role according to the sample parameter information and the training sample size;
  • the role resource corresponding to the framework role is obtained according to the batch import parameter information.
  • the resource framework information includes a framework identifier
  • the framework role information includes a role identifier
  • the obtaining target resource information for model training according to the operator resource information of each operator includes:
  • the sample parameter information includes first basic computing resource data and sample coefficients
  • obtaining the role resource corresponding to the framework role according to the sample parameter information and the training sample size includes:
  • RR is the role resource corresponding to the framework role
  • BC 1 is the first basic computing resource data
  • CR is the sample coefficient
  • DS is the training sample size.
  • the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
  • obtaining the role resource corresponding to the framework role according to the batch import parameter information includes:
  • the role resource corresponding to the framework role is obtained according to the following expression
  • RR is the role resource corresponding to the framework role
  • BC 2 is the second basic computing resource data
  • BS is the amount of sample data imported in a single batch.
  • the determining topology information for model training according to scene information and training sample data includes:
  • the embodiment of the present application provides a computing resource acquisition device, including:
  • a topology acquisition module configured to determine topology information for model training according to scene information and training sample data
  • An operator acquiring module configured to acquire operator resource information of each operator included in the topology information according to the topology information
  • a resource obtaining module configured to obtain target resource information for model training according to the operator resource information of each operator.
  • the operator acquisition module includes:
  • a framework acquisition submodule configured to determine resource framework information corresponding to each operator according to the topology information
  • the operator acquisition sub-module is configured to calculate the operator resource information of each operator according to the resource framework information corresponding to each operator.
  • the operator acquisition submodule includes:
  • the role acquisition unit determines at least one frame role information corresponding to each operator according to the resource framework information corresponding to each operator;
  • the calculation unit is configured to calculate operator resource information of each operator according to at least one framework role information corresponding to each operator.
  • the calculation unit includes:
  • the first calculation subunit is configured to determine the resource calculation mode of each operator according to the training sample size of the training sample data
  • the second calculation subunit is configured to calculate the role resource corresponding to the framework role included in the operator by using the resource calculation method according to at least one framework role information corresponding to the operator;
  • the third computing subunit is configured to obtain the operator resource information of the operator according to the role resource corresponding to the framework role included in the operator.
  • the second calculation subunit includes:
  • the framework role information includes default computing resource information, sample parameter information and batch import parameter information;
  • training sample size is greater than a first threshold and smaller than a second threshold, obtain a role resource corresponding to the framework role according to the sample parameter information and the training sample size;
  • the role resource corresponding to the framework role is obtained according to the batch import parameter information.
  • the resource framework information includes a framework identifier
  • the framework role information includes a role identifier
  • the resource acquisition module includes:
  • the sample parameter information includes first basic computing resource data and sample coefficients
  • the second computing subunit includes:
  • RR is the role resource corresponding to the framework role
  • BC 1 is the first basic computing resource data
  • CR is the sample coefficient
  • DS is the training sample size.
  • the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
  • the second computing subunit includes:
  • the role resource corresponding to the framework role is obtained according to the following expression
  • RR is the role resource corresponding to the framework role
  • BC 2 is the second basic computing resource data
  • BS is the amount of sample data imported in a single batch.
  • the topology acquisition module includes:
  • the embodiment of the present application provides an electronic device, including:
  • a processor a memory, and a program or instruction stored on the memory and operable on the processor.
  • the program or instruction is executed by the processor, the computing resource acquisition method as described in the first aspect above is implemented A step of.
  • embodiments of the present application provide a readable storage medium, on which programs or instructions are stored, and when the programs or instructions are executed by a processor, the acquisition of computing resources as described in the first aspect above is realized steps in the method.
  • a computer program product includes a computer program stored on a readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the above first Steps in the computing resource acquisition method described in the aspect.
  • the calculation resource acquisition method determines the topology information used for model training based on the scene information and training sample data input by the user, and then calculates the operator resource information of each operator included in the topology information, And the target resource information used for model training, assisting users to complete the configuration of computing resources corresponding to model training, reducing the interference caused by human factors, and reducing the deviation between the computing resources required for model training and the computing resources configured by users .
  • FIG. 1 is a flowchart of a computing resource acquisition method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a model training workflow provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a computing resource acquisition device provided in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a flowchart of a computing resource acquisition method provided in the embodiment of the present application. As shown in FIG. 1, the above computing resource acquisition method includes:
  • Step 101 Determine topology information for model training according to scene information and training sample data.
  • Step 102 obtain operator resource information of each operator included in the topology information.
  • Step 103 according to the operator resource information of each operator, obtain target resource information for model training.
  • the topology information used for model training is determined based on the scene information and training sample data input by the user, and the operator resource information of each operator included in the topology information is subsequently calculated, as well as the target resources used for model training Information, assisting users to complete the configuration of computing resources corresponding to model training, reducing the interference caused by human factors (such as insufficient work experience, poor working status, etc.), reducing the computing resources required for model training and computing resources configured by users
  • the target resource information will be displayed to the user through information push, so that the user can complete the configuration of computing resources corresponding to model training based on the displayed target resource information
  • the user can adaptively adjust the target resource information (including but not limited to increasing or decreasing the value of some parameters in the target resource information), that is, the calculation resources corresponding to the target resource information and the computing resources actually configured by the user for model training
  • the resources may be the same or different, which is not limited in this embodiment of the present application.
  • the way of pushing the above information can be completed in the form of a pop-up window on the computing resource configuration interface used for model training; the way of pushing the above information can also be done in the form of automatic It is completed by entering each parameter included in the target resource information.
  • the user can choose any one of the above two methods to complete the information push of the target resource information, or choose other methods to complete the information push of the target resource information, which is not limited in this embodiment of the present application.
  • the training sample data includes at least the storage address of the training set used for model training, and the training sample size of the training sample data (that is, the computer storage capacity occupied by the training set), for example, in the training set used for model training
  • the training sample data at least includes the storage address of the above image set, and the computer storage capacity (such as 30M, 300M or 1G, etc.) occupied by the above image set.
  • the scene information can be understood as the use scene of machine learning, which includes but not limited to feature processing, classification, regression, image recognition, outlier detection, natural language processing (Natural Language Processing, NLP) and so on.
  • machine learning includes but not limited to feature processing, classification, regression, image recognition, outlier detection, natural language processing (Natural Language Processing, NLP) and so on.
  • users can also adaptively add other information for model training to adapt to model training requirements in different scenarios,
  • adding algorithm information includes but not limited to linear regression algorithm, support vector machine algorithm (Support Vector Machine, SVM), nearest neighbor/k-nearest neighbor algorithm (K-Nearest Neighbors, KNN), logistic regression algorithm (Logistic Regression), Decision Tree, K-Means, Random Forest, Naive Bayes, Dimensional Reduction, etc.
  • topology information can be understood as the flow chart information of the workflow corresponding to the model training, the flow chart corresponding to the flow chart information is a directed acyclic graph, and the above operator can be understood as the process corresponding to the flow chart information steps in the diagram.
  • the determining topology information for model training according to scene information and training sample data includes:
  • the acquisition process of topology information can avoid the interference of human factors, and ensure the accuracy of the target resource information obtained later, so as to further reduce the gap between the computing resources required for model training and the computing resources configured by users. deviation.
  • the user can choose the retrieval method (the user needs to pre-store the corresponding topology information in the information database) or obtain the flow chart information through any method in the workload calculation method, which is not limited in the embodiment of the present application .
  • the training sample size and the scene information are input into the pre-acquired workflow calculation model, and the process of obtaining the topology information may be:
  • the topology information is generated based on the flow information.
  • the process of obtaining topology information may also be:
  • the user can be understood as a software user and/or a software program developer who needs to build a workflow
  • the simple code can be understood as a code that conforms to general or self-defined code analysis rules and code writing rules, and the The simple code includes at least the training sample size of the training sample data and the scene information
  • the workflow construction instruction can be issued after the simple code input is completed, and the simple code starts to be parsed after receiving the workflow construction instruction.
  • the obtaining operator resource information of each operator included in the topology information according to the topology information includes:
  • the resource framework information corresponding to each operator
  • calculating the operator resource information of each operator according to the resource framework information corresponding to each operator includes:
  • the above topology information includes a plurality of operators and resource framework information corresponding to each operator; each resource framework information includes at least one framework role information, and each framework role information corresponds to a framework role.
  • calculating the operator resource information of each operator can be understood as determining at least one framework role corresponding to each operator, and calculating the role resources corresponding to each framework role. It should be noted that each operator At least one corresponding framework role belongs to the same resource framework.
  • At least two framework roles belonging to the same resource framework can complete the execution of an operator (that is, a step) through coordination and cooperation.
  • the above-mentioned at least two framework roles correspond to the operator
  • the above-mentioned at least two framework roles The resource frame to which it belongs is the resource frame corresponding to the operator.
  • a resource framework is a type of component library that includes resource framework roles (that is, the aforementioned framework roles).
  • a resource framework includes several sets of resource framework roles (Resource Framework Role Type), and each set of resource framework roles is assumed during the execution of a specific operator. The roles and tasks of each are independent of each other.
  • the resource framework role can be the resource framework role responsible for management, scheduling, merging and other scheduling tasks (such as driver, client (Client)), and the resource framework role can also be A resource framework role (eg, Executor, Worker) for a task to perform work.
  • resource frameworks include but are not limited to stand-alone resource frameworks, PySpark distributed resource frameworks, Dask distributed resource frameworks, TensorFlow2 distributed resource frameworks, and PyTorch distributed resource frameworks.
  • the stand-alone resource framework includes at least the Worker resource framework role
  • the PySpark distributed resource framework (PySpark) at least includes the Driver resource framework role and/or the Executor resource framework role
  • the Dask distributed resource framework includes at least the Client resource framework role and/or Worker resource Framework roles
  • the TensorFlow2 distributed resource framework includes at least the Worker resource framework role
  • the PyTorch distributed resource framework includes at least the Worker resource framework role.
  • the resource framework roles included in different resource frameworks are different (even if the names of the resource framework roles are the same, the resource framework roles with the same name belonging to different resource frameworks are not the same; and each resource framework includes The number of resource framework roles is also different), for example, the TensorFlow2 resource framework includes the Worker resource framework role, the Dask resource framework includes the Client resource framework role and the Worker resource framework role, and the Worker resource framework role and the Dask resource framework role in the TensorFlow2 resource framework The roles of the Worker resource framework are different.
  • the calculating the operator resource information of each operator according to at least one framework role information corresponding to each operator includes:
  • the operator calculation method to calculate the role resource corresponding to the framework role included in the operator, including:
  • the framework role information includes default computing resource information, sample parameter information and batch import parameter information;
  • training sample size is greater than a first threshold and smaller than a second threshold, obtain a role resource corresponding to the framework role according to the sample parameter information and the training sample size;
  • the role resource corresponding to the framework role is obtained according to the batch import parameter information.
  • the first threshold and the second threshold based on the numerical size of the training sample size, adaptively adjust the calculation method of the role resources corresponding to the framework role, to ensure that the role resources corresponding to each framework role in each operator Both are optimal.
  • the above optimality can be understood as that the role resource corresponding to a framework role in a certain operator can make the parameters included in the role resource The smallest value.
  • the numerical size of the training sample size does not affect the application of the algorithm indicated by the algorithm information ( For example, the algorithm imports the training set in batches during the training phase, and the data volume of the training set imported in batches each time is less than the first threshold), then according to the default computing resource information corresponding to each framework role information, the corresponding framework role information is obtained Role resources; if the numerical size of the training sample size affects the application of the algorithm indicated by the algorithm information, then by comparing the numerical size of the training sample size, the first threshold, and the second threshold, the corresponding selection among the above three calculation methods A calculation method to calculate the role resource corresponding to the framework role.
  • the training sample size is less than or equal to the first threshold, it can be understood that the training sample size is too small; when the training sample size is greater than or equal to the second threshold, it can be understood that the training sample size is too large.
  • the way of importing split the training sample data with too large data volume into training subsets with small data volume, and import the split training subsets one by one) can ensure the normal progress of the model training process.
  • the role resource corresponding to the framework role includes but not limited to memory, central processing unit (Central Processing Unit, CPU), graphics processing unit (Graphics Processing Unit, GPU) and so on.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • the sample parameter information includes first basic computing resource data and sample coefficients
  • obtaining the role resource corresponding to the framework role includes:
  • RR is the role resource corresponding to the framework role
  • BC 1 is the first basic computing resource data
  • CR is the sample coefficient
  • DS is the training sample size.
  • the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
  • obtaining the role resource corresponding to the framework role according to the batch import parameter information includes:
  • the role resource corresponding to the framework role is obtained according to the following expression
  • RR is the role resource corresponding to the framework role
  • BC 2 is the second basic computing resource data
  • BS is the amount of sample data imported in a single batch.
  • first basic computing resource data and the second basic computing resource data may be the same or different, and the user may adaptively adjust the above two values, which is not limited in this embodiment of the present application.
  • the process of obtaining the role resource corresponding to the framework role according to the default computing resource information may include setting the default computing resource information as the role resource corresponding to the framework role.
  • the default computing resource information is set to be 2G memory, 1-core CPU, and 1-core GPU, then the role resource corresponding to the frame role obtained according to the default computing resource information is also 2G memory, 1-core CPU, and 1-core GPU.
  • the resource framework information includes a framework identifier
  • the framework role information includes a role identifier
  • the obtaining target resource information for model training according to the operator resource information of each operator includes:
  • the process of determining the maximum role resource from all obtained role resources may be:
  • the PySpark distributed resource framework includes the Driver resource framework role (role ID) and Executor resource framework role (role ID).
  • the role resources of the Driver resource frame role corresponding to the No. 1 operator are set to 2G memory, 1-core CPU and 2-core GPU
  • the role resources of the Driver resource frame role of the No. 2 operator are 3G memory, 2-core CPU and 1 core.
  • the recommended computing resources for the Driver resource frame role in the PySpark distributed resource framework are 3G memory, 2-core CPU and 2-core GPU (that is, the maximum role resource corresponding to the Driver role in the PySpark framework).
  • the recommended computing resources for the Executor role in the PySpark distributed resource framework are 2G memory, 2-core CPU and 2-core GPU (that is, the maximum role resources corresponding to the Executor role in the PySpark framework).
  • the recommended computing resources corresponding to the PySpark distributed resource framework in the target resource information are:
  • Driver resource framework role 3G memory, 2-core CPU and 2-core GPU;
  • Executor resource framework role 2G memory, 2-core CPU, and 2-core GPU.
  • the computer storage capacity that the training sample occupies is set as 30M
  • the scene information of the model is classification
  • the algorithm information for training the model is the perception network (Xception) in the TensorFlow2 distributed algorithm.
  • the Described training sample data obtain topological information as shown in Figure 2, described topological information includes image classification step, pipeline (Pipeline) initialization step, data set splitting step, Xception step, multi-classification evaluation Xception step and generation pipeline_Xception step.
  • the image classification step, the pipeline initialization step, the dataset splitting step, the multi-classification evaluation step based on the perceptual network, and the step of generating the target perceptual network all belong to the stand-alone resource framework; while the perceptual network training step is It belongs to the TensorFlow2 distributed resource framework; that is, the target resource information includes the framework computing resource information of the stand-alone resource framework and the framework computing resource information of the TensorFlow2 distributed resource framework.
  • Each step is set to correspond to only one resource framework role, and in the corresponding resource framework In the same situation, the resource framework roles corresponding to different steps are the same.
  • the role configuration information of the resource framework role included in the image classification step is shown in Table 1:
  • the role configuration information of the resource framework role included in the pipeline initialization step is shown in Table 2:
  • the role configuration information of the resource framework role included in the dataset splitting step is shown in Table 3:
  • the role configuration information of the resource frame role included in the multi-category evaluation step based on the perception network is shown in Table 4:
  • the role configuration information of the resource framework role included in the training step of the perception network is shown in Table 6:
  • the target resource information is:
  • Stand-alone resource framework 1 core CPU, 3G memory and the number is 1;
  • Tensorflow2 distributed resource framework 1 core CPU, 8G memory, and the number is 3.
  • each step can be executed by only one framework role.
  • the parameter useDefaultFlag being 1 can be understood as the training sample size is less than or equal to the first threshold
  • the parameter useBatchSizeRatioFlag being 1 can be understood as the training sample size is greater than or equal to the second threshold
  • the parameter useDefaultFlag being 0 and the parameter useBatchSizeRatioFlag being 0 can be understood as the training sample
  • the amount is greater than the first threshold and less than the second threshold.
  • the data obtained by dividing the parameter batchSize by the parameter batchSizeRatio can be understood as the amount of sample data imported in a single batch, and the parameter baseCapacity can be understood as the aforementioned first basic computing resource data and second basic computing resource data.
  • the target resource information can be understood as multiple resource frames, multiple frame roles included in each resource frame, and the maximum role resource corresponding to each frame role.
  • the topology acquisition module 201 is configured to determine topology information for model training according to scene information and training sample data;
  • An operator obtaining module 202 configured to obtain operator resource information of each operator included in the topology information according to the topology information;
  • the resource obtaining module 203 is configured to obtain target resource information for model training according to the operator resource information of each operator.
  • the operator acquisition module 202 includes:
  • a framework acquisition submodule configured to determine resource framework information corresponding to each operator according to the topology information
  • the operator acquisition sub-module is configured to calculate the operator resource information of each operator according to the resource framework information corresponding to each operator.
  • the operator acquisition submodule includes:
  • the role acquisition unit determines at least one frame role information corresponding to each operator according to the resource framework information corresponding to each operator;
  • the calculation unit is configured to calculate operator resource information of each operator according to at least one framework role information corresponding to each operator.
  • the calculation unit includes:
  • the first calculation subunit is configured to determine the resource calculation mode of each operator according to the training sample size of the training sample data
  • the second calculation subunit is configured to calculate the role resource corresponding to the framework role included in the operator by using the resource calculation method according to at least one framework role information corresponding to the operator;
  • the third calculation subunit is configured to obtain operator resource information of the operator according to the role resource corresponding to the framework role included in the operator.
  • the second calculation subunit includes:
  • the framework role information includes default computing resource information, sample parameter information and batch import parameter information;
  • training sample size is greater than a first threshold and smaller than a second threshold, obtain a role resource corresponding to the framework role according to the sample parameter information and the training sample size;
  • the role resource corresponding to the framework role is obtained according to the batch import parameter information.
  • the resource framework information includes a framework identifier
  • the framework role information includes a role identifier
  • the resource acquisition module 203 includes:
  • the sample parameter information includes first basic computing resource data and sample coefficients
  • the second computing subunit includes:
  • the role resource corresponding to the framework role is obtained according to the following expression:
  • RR is the role resource corresponding to the framework role
  • BC 1 is the first basic computing resource data
  • CR is the sample coefficient
  • DS is the training sample size.
  • the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
  • the role resource corresponding to the framework role is obtained according to the following expression
  • the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
  • the second computing subunit includes:
  • the role resource corresponding to the framework role is obtained according to the following expression:
  • RR is the role resource corresponding to the framework role
  • BC 2 is the second basic computing resource data
  • BS is the amount of sample data imported in a single batch.
  • the topology acquisition module 201 includes:
  • the computing resource acquisition apparatus 200 in the embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in an electronic device.
  • FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device includes: a bus 301, a transceiver 302, an antenna 303, a bus interface 304, a processor 305 and memory 306 .
  • the processor 305 can implement the various processes of the above embodiments of the method for obtaining computing resources, and can achieve the same technical effect. To avoid repetition, details are not repeated here.
  • bus 301 may include any number of interconnected buses and bridges, bus 301 will include one or more processors represented by processor 305 and memory represented by memory 306
  • the various circuits are linked together.
  • the bus 301 may also link together various other circuits such as peripherals, voltage regulators, and power management circuits, etc., which are well known in the art and thus will not be further described herein.
  • the bus interface 304 provides an interface between the bus 301 and the transceiver 302 .
  • Transceiver 302 may be a single element or multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other devices over a transmission medium.
  • the data processed by the processor 305 is transmitted on the wireless medium through the antenna 303 , further, the antenna 303 also receives the data and transmits the data to the processor 305 .
  • Processor 305 is responsible for managing bus 301 and general processing, and may also provide various functions including timing, peripheral interfacing, voltage regulation, power management, and other control functions. Instead, the memory 306 may be used to store data used by the processor 305 when performing operations.
  • the processor 305 can be a CPU, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable logic gate array (Field Programmable Gate Array, FPGA) or a complex programmable logic device (Complex Programmable logic device, CPLD ).
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • CPLD Complex Programmable logic device
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, each process of the above-mentioned method embodiment can be realized, and the same technical effect can be achieved. To avoid repetition, details are not repeated here.
  • a computer-readable storage medium such as a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
  • the embodiment of the present application further provides a computer program product, the computer program product includes a computer program stored on a readable storage medium, the computer program includes program instructions, and when the program instructions are executed by the computer, the above calculation is realized.
  • the computer program product includes a computer program stored on a readable storage medium
  • the computer program includes program instructions, and when the program instructions are executed by the computer, the above calculation is realized.
  • the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of software products in essence or the part that contributes to related technologies, and the computer software products are stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk, etc.) ) includes several instructions to enable a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a second terminal device, etc.) to execute the methods described in various embodiments of the present application.
  • a terminal which may be a mobile phone, a computer, a server, an air conditioner, or a second terminal device, etc.

Abstract

The present application provides a computing resource acquisition method and apparatus, an electronic device, and a storage medium. The method comprises: determining, according to scenario information and training sample data, topology information for model training; obtaining, according to the topology information, operator resource information of each operator included in the topology information; and obtaining, according to the operator resource information of each operator, target resource information for model training.

Description

一种计算资源获取方法、装置、电子设备和存储介质A computing resource acquisition method, device, electronic device and storage medium
相关申请的交叉引用Cross References to Related Applications
本申请主张在2021年11月25日在中国提交的中国专利申请No.202111411238.X的优先权,其全部内容通过引用包含于此。This application claims priority to Chinese Patent Application No. 202111411238.X filed in China on November 25, 2021, the entire contents of which are hereby incorporated by reference.
技术领域technical field
本申请涉及机器学习技术领域,具体涉及一种计算资源获取方法、装置、电子设备和存储介质。The present application relates to the technical field of machine learning, and in particular to a computing resource acquisition method, device, electronic equipment and storage medium.
背景技术Background technique
基于机器学习的算法模型具有充分训练后预测时间短,预测准确率高等优点,被广泛应用于各个领域中。The algorithm model based on machine learning has the advantages of short prediction time after sufficient training and high prediction accuracy, and is widely used in various fields.
在算法模型的训练过程中,不同算法模型训练时所需要的计算资源会相应变化,上述用于模型训练的计算资源需用户手动配置,在人为因素的干扰下,使得算法模型训练时所需的计算资源和用户配置的计算资源之间的偏差较大,若配置的计算资源不足容易造成内存溢出,或者,若造成配置的计算资源大于所需资源,造成资源浪费。During the training process of the algorithm model, the computing resources required for different algorithm model training will change accordingly. The above computing resources for model training need to be manually configured by the user. Under the interference of human factors, the required computing resources for algorithm model training The deviation between the computing resources and the computing resources configured by the user is large. If the configured computing resources are insufficient, memory overflow may easily occur, or if the configured computing resources are larger than the required resources, resources will be wasted.
发明内容Contents of the invention
本申请实施例的目的在于提供一种计算资源获取方法、装置、电子设备和存储介质,用于解决算法模型训练时所需的计算资源和用户配置的计算资源之间的偏差较大的问题。The purpose of the embodiments of the present application is to provide a computing resource acquisition method, device, electronic device, and storage medium, which are used to solve the problem of large deviation between computing resources required for algorithm model training and computing resources configured by users.
第一方面,本申请实施例提供一种计算资源获取方法,包括:In the first aspect, the embodiment of the present application provides a computing resource acquisition method, including:
根据场景信息和训练样本数据,确定用于模型训练的拓扑信息;According to the scene information and training sample data, determine the topology information for model training;
根据所述拓扑信息,获得所述拓扑信息包括的每个算子的算子资源信息;Obtain operator resource information of each operator included in the topology information according to the topology information;
根据所述每个算子的算子资源信息,获得用于模型训练的目标资源信息。Obtain target resource information for model training according to the operator resource information of each operator.
可选的,所述根据所述拓扑信息,获得所述拓扑信息包括的每个算子的算子资源信息,包括:Optionally, the obtaining operator resource information of each operator included in the topology information according to the topology information includes:
根据所述拓扑信息,确定每个算子对应的资源框架信息;According to the topology information, determine the resource framework information corresponding to each operator;
根据所述每个算子对应的资源框架信息,计算所述每个算子的算子资源信息。Calculate the operator resource information of each operator according to the resource framework information corresponding to each operator.
可选的,所述根据所述每个算子对应的资源框架信息,计算所述每个算子的算子资源信息,包括:Optionally, calculating the operator resource information of each operator according to the resource framework information corresponding to each operator includes:
根据所述每个算子对应的资源框架信息,确定每个算子对应的至少一个框架角色信息;Determine at least one framework role information corresponding to each operator according to the resource framework information corresponding to each operator;
根据所述每个算子对应的至少一个框架角色信息,计算所述每个算子的算子资源信息。Calculate operator resource information of each operator according to at least one piece of framework role information corresponding to each operator.
可选的,所述根据所述每个算子对应的至少一个框架角色信息,计算所述每个算子的算子资源信息,包括:Optionally, the calculating the operator resource information of each operator according to at least one framework role information corresponding to each operator includes:
根据所述训练样本数据的训练样本量,确定所述每个算子的资源计算方式;Determine the resource calculation method of each operator according to the training sample size of the training sample data;
根据算子对应的至少一个框架角色信息,利用所述资源计算方式,计算所述算子包含的框架角色对应的角色资源;According to at least one framework role information corresponding to the operator, using the resource calculation method, calculate the role resource corresponding to the framework role included in the operator;
根据所述算子包含的框架角色对应的角色资源,获得所述算子的算子资源信息。Obtain the operator resource information of the operator according to the role resource corresponding to the framework role included in the operator.
可选的,所述根据算子对应的至少一个框架角色信息,利用所述算子计算方式,计算所述算子包含的框架角色对应的角色资源,包括:Optionally, according to at least one frame role information corresponding to the operator, using the operator calculation method to calculate the role resource corresponding to the framework role included in the operator, including:
所述框架角色信息包括默认计算资源信息、样本参数信息和批量导入参数信息;The framework role information includes default computing resource information, sample parameter information and batch import parameter information;
在所述训练样本量小于或等于第一阈值的情况下,根据所述默认计算资源信息,获得所述框架角色对应的角色资源;When the training sample size is less than or equal to a first threshold, according to the default computing resource information, obtain the role resource corresponding to the framework role;
在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据所述样本参数信息和所述训练样本量,获得所述框架角色对应的角色资源;If the training sample size is greater than a first threshold and smaller than a second threshold, obtain a role resource corresponding to the framework role according to the sample parameter information and the training sample size;
在所述训练样本量大于或等于第二阈值的情况下,根据所述批量导入参数信息,获得所述框架角色对应的角色资源。In a case where the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the batch import parameter information.
可选的,所述资源框架信息包括框架标识,所述框架角色信息包括角色标识;Optionally, the resource framework information includes a framework identifier, and the framework role information includes a role identifier;
所述根据所述每个算子的算子资源信息,获得用于模型训练的目标资源信息,包括:The obtaining target resource information for model training according to the operator resource information of each operator includes:
获取框架标识以及角色标识均相同的所有角色资源;Obtain all role resources with the same framework ID and role ID;
从获取到的所有角色资源中确定最大角色资源;Determine the maximum character resource from all obtained character resources;
基于所述最大角色资源,获得用于模型训练的目标资源信息。Obtain target resource information for model training based on the maximum role resource.
可选的,所述样本参数信息包括第一基础计算资源数据和样本系数;Optionally, the sample parameter information includes first basic computing resource data and sample coefficients;
所述在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据所述样本参数信息和所述训练样本量,获得所述框架角色对应的角色资源,包括:In the case that the training sample size is greater than the first threshold and smaller than the second threshold, obtaining the role resource corresponding to the framework role according to the sample parameter information and the training sample size includes:
在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源;When the training sample size is greater than the first threshold and less than the second threshold, obtain the role resource corresponding to the framework role according to the following expression;
Figure PCTCN2022125905-appb-000001
Figure PCTCN2022125905-appb-000001
其中,RR为所述框架角色对应的角色资源,BC 1为所述第一基础计算资源数据,CR为所述样本系数,DS为所述训练样本量。 Wherein, RR is the role resource corresponding to the framework role, BC 1 is the first basic computing resource data, CR is the sample coefficient, and DS is the training sample size.
可选的,所述批量导入参数信息包括第二基础计算资源数据和单批导入的样本数据量;Optionally, the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
所述在所述训练样本量大于或等于第二阈值的情况下,根据所述批量导入参数信息,获得所述框架角色对应的角色资源,包括:In the case that the training sample size is greater than or equal to the second threshold, obtaining the role resource corresponding to the framework role according to the batch import parameter information includes:
在所述训练样本量大于或等于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源;In the case that the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the following expression;
Figure PCTCN2022125905-appb-000002
Figure PCTCN2022125905-appb-000002
其中,RR为所述框架角色对应的角色资源,BC 2为所述第二基础计算资源数据,BS为所述单批导入的样本数据量。 Wherein, RR is the role resource corresponding to the framework role, BC 2 is the second basic computing resource data, and BS is the amount of sample data imported in a single batch.
可选的,所述根据场景信息和训练样本数据,确定用于模型训练的拓扑信息,包括:Optionally, the determining topology information for model training according to scene information and training sample data includes:
以所述训练样本数据的训练样本量和所述场景信息作为检索条件,在信息库中进行检索,获得所述拓扑信息;Using the training sample size of the training sample data and the scene information as retrieval conditions, searching in an information base to obtain the topology information;
或者,or,
将所述训练样本量和所述场景信息输入预获取的工作流计算模型中,获得所述拓扑信息。Input the training sample size and the scene information into the pre-acquired workflow calculation model to obtain the topology information.
第二方面,本申请实施例提供一种计算资源获取装置,包括:In the second aspect, the embodiment of the present application provides a computing resource acquisition device, including:
拓扑获取模块,用于根据场景信息和训练样本数据,确定用于模型训练的拓扑信息;A topology acquisition module, configured to determine topology information for model training according to scene information and training sample data;
算子获取模块,用于根据所述拓扑信息,获得所述拓扑信息包括的每个算子的算子资源信息;An operator acquiring module, configured to acquire operator resource information of each operator included in the topology information according to the topology information;
资源获取模块,用于根据所述每个算子的算子资源信息,获得用于模型训练的目标资源信息。A resource obtaining module, configured to obtain target resource information for model training according to the operator resource information of each operator.
可选的,所述算子获取模块包括:Optionally, the operator acquisition module includes:
框架获取子模块,用于根据所述拓扑信息,确定每个算子对应的资源框架信息;A framework acquisition submodule, configured to determine resource framework information corresponding to each operator according to the topology information;
算子获取子模块,用于根据所述每个算子对应的资源框架信息,计算所述每个算子的算子资源信息。The operator acquisition sub-module is configured to calculate the operator resource information of each operator according to the resource framework information corresponding to each operator.
可选的,所述算子获取子模块包括:Optionally, the operator acquisition submodule includes:
角色获取单元,根据所述每个算子对应的资源框架信息,确定每个算子对应的至少一个框架角色信息;The role acquisition unit determines at least one frame role information corresponding to each operator according to the resource framework information corresponding to each operator;
计算单元,根据所述每个算子对应的至少一个框架角色信息,计算所述每个算子的算子资源信息。The calculation unit is configured to calculate operator resource information of each operator according to at least one framework role information corresponding to each operator.
可选的,所述计算单元包括:Optionally, the calculation unit includes:
第一计算子单元,用于根据所述训练样本数据的训练样本量,确定所述每个算子的资源计算方式;The first calculation subunit is configured to determine the resource calculation mode of each operator according to the training sample size of the training sample data;
第二计算子单元,用于根据算子对应的至少一个框架角色信息,利用所述资源计算方式,计算所述算子包含的框架角色对应的角色资源;The second calculation subunit is configured to calculate the role resource corresponding to the framework role included in the operator by using the resource calculation method according to at least one framework role information corresponding to the operator;
第三计算子单元,用于根据所述算子包含的框架角色对应的角色资源,获得所述算子 的算子资源信息。The third computing subunit is configured to obtain the operator resource information of the operator according to the role resource corresponding to the framework role included in the operator.
可选的,所述第二计算子单元包括:Optionally, the second calculation subunit includes:
所述框架角色信息包括默认计算资源信息、样本参数信息和批量导入参数信息;The framework role information includes default computing resource information, sample parameter information and batch import parameter information;
在所述训练样本量小于或等于第一阈值的情况下,根据所述默认计算资源信息,获得所述框架角色对应的角色资源;When the training sample size is less than or equal to a first threshold, according to the default computing resource information, obtain the role resource corresponding to the framework role;
在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据所述样本参数信息和所述训练样本量,获得所述框架角色对应的角色资源;If the training sample size is greater than a first threshold and smaller than a second threshold, obtain a role resource corresponding to the framework role according to the sample parameter information and the training sample size;
在所述训练样本量大于或等于第二阈值的情况下,根据所述批量导入参数信息,获得所述框架角色对应的角色资源。In a case where the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the batch import parameter information.
可选的,所述资源框架信息包括框架标识,所述框架角色信息包括角色标识;Optionally, the resource framework information includes a framework identifier, and the framework role information includes a role identifier;
所述资源获取模块包括:The resource acquisition module includes:
获取框架标识以及角色标识均相同的所有角色资源;Obtain all role resources with the same framework ID and role ID;
从获取到的所有角色资源中确定最大角色资源;Determine the maximum character resource from all obtained character resources;
基于所述最大角色资源,获得用于模型训练的目标资源信息。Obtain target resource information for model training based on the maximum role resource.
可选的,所述样本参数信息包括第一基础计算资源数据和样本系数;Optionally, the sample parameter information includes first basic computing resource data and sample coefficients;
所述第二计算子单元包括:The second computing subunit includes:
在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源;When the training sample size is greater than the first threshold and less than the second threshold, obtain the role resource corresponding to the framework role according to the following expression;
Figure PCTCN2022125905-appb-000003
Figure PCTCN2022125905-appb-000003
其中,RR为所述框架角色对应的角色资源,BC 1为所述第一基础计算资源数据,CR为所述样本系数,DS为所述训练样本量。 Wherein, RR is the role resource corresponding to the framework role, BC 1 is the first basic computing resource data, CR is the sample coefficient, and DS is the training sample size.
可选的,所述批量导入参数信息包括第二基础计算资源数据和单批导入的样本数据量;Optionally, the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
所述第二计算子单元包括:The second computing subunit includes:
在所述训练样本量大于或等于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源;In the case that the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the following expression;
Figure PCTCN2022125905-appb-000004
Figure PCTCN2022125905-appb-000004
其中,RR为所述框架角色对应的角色资源,BC 2为所述第二基础计算资源数据,BS为所述单批导入的样本数据量。 Wherein, RR is the role resource corresponding to the framework role, BC 2 is the second basic computing resource data, and BS is the amount of sample data imported in a single batch.
可选的,所述拓扑获取模块,包括:Optionally, the topology acquisition module includes:
以所述训练样本数据的训练样本量和所述场景信息作为检索条件,在信息库中进行检索,获得所述拓扑信息;Using the training sample size of the training sample data and the scene information as retrieval conditions, searching in an information base to obtain the topology information;
或者,or,
将所述训练样本量和所述场景信息输入预获取的工作流计算模型中,获得所述拓扑信息。Input the training sample size and the scene information into the pre-acquired workflow calculation model to obtain the topology information.
第三方面,本申请实施例提供一种电子设备,包括:In a third aspect, the embodiment of the present application provides an electronic device, including:
处理器,存储器及存储在所述存储器上并可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如上第一方面所述的计算资源获取方法中的步骤。A processor, a memory, and a program or instruction stored on the memory and operable on the processor. When the program or instruction is executed by the processor, the computing resource acquisition method as described in the first aspect above is implemented A step of.
第四方面,本申请实施例提供一种可读存储介质,所述可读存储介质上存储有程序或指令,所述程序或指令被处理器执行时实现如上第一方面所述的计算资源获取方法中的步骤。In a fourth aspect, embodiments of the present application provide a readable storage medium, on which programs or instructions are stored, and when the programs or instructions are executed by a processor, the acquisition of computing resources as described in the first aspect above is realized steps in the method.
第五方面,提供了一种计算机程序产品,所述计算机程序产品包括存储在可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时实现如上第一方面所述的计算资源获取方法中的步骤。In a fifth aspect, a computer program product is provided, the computer program product includes a computer program stored on a readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the above first Steps in the computing resource acquisition method described in the aspect.
本申请实施例所提供的计算资源获取方法,基于用户输入的场景信息和训练样本数据来确定用于模型训练的拓扑信息,后续计算拓扑信息中所包括的每个算子的算子资源信息,以及用于模型训练的目标资源信息,辅助用户完成模型训练对应的计算资源的配置,降低人为因素带来的干扰,减小模型训练时所需的计算资源和用户配置的计算资源之间的偏差。The calculation resource acquisition method provided by the embodiment of the present application determines the topology information used for model training based on the scene information and training sample data input by the user, and then calculates the operator resource information of each operator included in the topology information, And the target resource information used for model training, assisting users to complete the configuration of computing resources corresponding to model training, reducing the interference caused by human factors, and reducing the deviation between the computing resources required for model training and the computing resources configured by users .
附图说明Description of drawings
图1是本申请实施例提供的一种计算资源获取方法的流程图;FIG. 1 is a flowchart of a computing resource acquisition method provided by an embodiment of the present application;
图2是本申请实施例提供的一种模型训练工作流的示意图;FIG. 2 is a schematic diagram of a model training workflow provided by an embodiment of the present application;
图3是本申请实施例提供的一种计算资源获取装置的结构示意图;FIG. 3 is a schematic structural diagram of a computing resource acquisition device provided in an embodiment of the present application;
图4是本申请实施例提供的一种电子设备的结构示意图。FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获取的所有其他实施例,都属于本申请保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
请参阅图1,图1是本申请实施例提供的一种计算资源获取方法的流程图,如图1所示,上述计算资源获取方法包括:Please refer to FIG. 1. FIG. 1 is a flowchart of a computing resource acquisition method provided in the embodiment of the present application. As shown in FIG. 1, the above computing resource acquisition method includes:
步骤101、根据场景信息和训练样本数据,确定用于模型训练的拓扑信息。 Step 101. Determine topology information for model training according to scene information and training sample data.
步骤102、根据所述拓扑信息,获得所述拓扑信息包括的每个算子的算子资源信息。 Step 102, according to the topology information, obtain operator resource information of each operator included in the topology information.
步骤103、根据所述每个算子的算子资源信息,获得用于模型训练的目标资源信息。 Step 103, according to the operator resource information of each operator, obtain target resource information for model training.
如上所述,基于用户输入的场景信息和训练样本数据来确定用于模型训练的拓扑信息,后续计算拓扑信息中所包括的每个算子的算子资源信息,以及用于模型训练的目标资源信息,辅助用户完成模型训练对应的计算资源的配置,降低人为因素(如工作经验不足、工作状态不佳等)带来的干扰,减小模型训练时所需的计算资源和用户配置的计算资源之间的偏差,即在确保所配置的计算资源被充分使用的前提下,提高模型训练的效率,避免模型训练时计算资源溢出导致用户重新配置计算资源的情况。As mentioned above, the topology information used for model training is determined based on the scene information and training sample data input by the user, and the operator resource information of each operator included in the topology information is subsequently calculated, as well as the target resources used for model training Information, assisting users to complete the configuration of computing resources corresponding to model training, reducing the interference caused by human factors (such as insufficient work experience, poor working status, etc.), reducing the computing resources required for model training and computing resources configured by users The deviation between, that is, under the premise of ensuring that the configured computing resources are fully used, improve the efficiency of model training, and avoid the situation where computing resources overflow during model training and cause users to reconfigure computing resources.
需要说明的是,在确定所述目标资源信息以后,将通过信息推送的方式,向用户展示所述目标资源信息,以便用户能基于所展示的目标资源信息,完成模型训练对应的计算资源的配置操作,实际中,用户可适应性调整目标资源信息(包括但不限于增减目标资源信息中部分参数的数值大小),即用户实际配置的用于模型训练的计算资源与目标资源信息对应的计算资源可以相同,也可以不同,本申请实施例对此并不加以限定。It should be noted that after the target resource information is determined, the target resource information will be displayed to the user through information push, so that the user can complete the configuration of computing resources corresponding to model training based on the displayed target resource information Operation, in practice, the user can adaptively adjust the target resource information (including but not limited to increasing or decreasing the value of some parameters in the target resource information), that is, the calculation resources corresponding to the target resource information and the computing resources actually configured by the user for model training The resources may be the same or different, which is not limited in this embodiment of the present application.
示例性的,上述信息推送的方式可以为,在用于模型训练的计算资源配置界面以弹窗的方式完成;上述信息推送的方式也可以为,在用于模型训练的计算资源配置界面以自动录入目标资源信息所包括的各个参数的方式完成。实际中,用户可以选择上述两种方式中的任意一种完成目标资源信息的信息推送,也可以选择其他方式来完成目标资源信息的信息推送,本申请实施例对此并不加以限定。Exemplarily, the way of pushing the above information can be completed in the form of a pop-up window on the computing resource configuration interface used for model training; the way of pushing the above information can also be done in the form of automatic It is completed by entering each parameter included in the target resource information. In practice, the user can choose any one of the above two methods to complete the information push of the target resource information, or choose other methods to complete the information push of the target resource information, which is not limited in this embodiment of the present application.
其中,所述训练样本数据至少包括用于模型训练的训练集的存储地址,以及训练样本数据的训练样本量(即训练集所占用的计算机存储容量),举例来说,在用于模型训练的训练集为多个图像共同构成的图像集的情况下,所述训练样本数据至少包括上述图像集的存储地址,以及上述图像集所占用的计算机存储容量(如30M、300M或1G等)。Wherein, the training sample data includes at least the storage address of the training set used for model training, and the training sample size of the training sample data (that is, the computer storage capacity occupied by the training set), for example, in the training set used for model training When the training set is an image set composed of multiple images, the training sample data at least includes the storage address of the above image set, and the computer storage capacity (such as 30M, 300M or 1G, etc.) occupied by the above image set.
所述场景信息可理解为机器学习的使用场景,上述机器学习的使用场景包括但不限于特征加工,分类,回归,图像识别,异常值检测,自然语言处理(Natural Language Processing,NLP)等。The scene information can be understood as the use scene of machine learning, which includes but not limited to feature processing, classification, regression, image recognition, outlier detection, natural language processing (Natural Language Processing, NLP) and so on.
可选的,在根据场景信息和训练样本数据,确定用于模型训练的拓扑信息的过程中,用户还可以适应性增加用于模型训练的其他信息,以适配不同场景下的模型训练需求,例如,增加算法信息,上述算法信息包括但不限于线性回归算法、支持向量机算法(Support Vector Machine,SVM)、最近邻居/k-近邻算法(K-Nearest Neighbors,KNN)、逻辑回归算法(Logistic Regression)、决策树算法(Decision Tree)、k-平均算法(K-Means)、随机森林算法(Random Forest)、朴素贝叶斯算法(Naive Bayes)、降维算法(Dimensional Reduction)等。Optionally, during the process of determining topology information for model training based on scene information and training sample data, users can also adaptively add other information for model training to adapt to model training requirements in different scenarios, For example, adding algorithm information, the above algorithm information includes but not limited to linear regression algorithm, support vector machine algorithm (Support Vector Machine, SVM), nearest neighbor/k-nearest neighbor algorithm (K-Nearest Neighbors, KNN), logistic regression algorithm (Logistic Regression), Decision Tree, K-Means, Random Forest, Naive Bayes, Dimensional Reduction, etc.
需要注意的是,上述拓扑信息可理解为模型训练对应的工作流的流程图信息,该流程图信息所对应的流程图为有向无环图,上述算子可理解为流程图信息对应的流程图中的步 骤。It should be noted that the above topology information can be understood as the flow chart information of the workflow corresponding to the model training, the flow chart corresponding to the flow chart information is a directed acyclic graph, and the above operator can be understood as the process corresponding to the flow chart information steps in the diagram.
可选的,所述根据场景信息和训练样本数据,确定用于模型训练的拓扑信息,包括:Optionally, the determining topology information for model training according to scene information and training sample data includes:
以所述训练样本数据的训练样本量和所述场景信息作为检索条件,在信息库中进行检索,获得所述拓扑信息;Using the training sample size of the training sample data and the scene information as retrieval conditions, searching in an information base to obtain the topology information;
或者,or,
将所述训练样本量和所述场景信息输入预获取的工作流计算模型中,获得所述拓扑信息。Input the training sample size and the scene information into the pre-acquired workflow calculation model to obtain the topology information.
通过上述设置,使拓扑信息的获取过程能规避人为因素的干扰,确保后续所获得的目标资源信息的准确性,以进一步减小模型训练时所需的计算资源和用户配置的计算资源之间的偏差。实际中,用户能选择检索方式(用户需在信息库中预先存储相应的拓扑信息)或通过工作量计算方式中的任意一种方式来获取流程图信息,本申请实施例对此并不加以限定。Through the above settings, the acquisition process of topology information can avoid the interference of human factors, and ensure the accuracy of the target resource information obtained later, so as to further reduce the gap between the computing resources required for model training and the computing resources configured by users. deviation. In practice, the user can choose the retrieval method (the user needs to pre-store the corresponding topology information in the information database) or obtain the flow chart information through any method in the workload calculation method, which is not limited in the embodiment of the present application .
示例性的,将所述训练样本量和所述场景信息输入预获取的工作流计算模型中,获得所述拓扑信息的过程可以为:Exemplarily, the training sample size and the scene information are input into the pre-acquired workflow calculation model, and the process of obtaining the topology information may be:
获取用户输入的所述训练样本量和所述场景信息;Obtain the training sample size and the scene information input by the user;
基于所述工作流计算模型对所述训练样本量和所述场景信息进行处理,以构建得到模型训练的流程信息;Processing the training sample size and the scene information based on the workflow calculation model to construct process information for model training;
基于所述流程信息生成所述拓扑信息。The topology information is generated based on the flow information.
示例性的,获得拓扑信息的过程还可以为:Exemplarily, the process of obtaining topology information may also be:
响应于工作流构建指令,解析用户输入的简易代码,得到被调用算子标识和被调用算子参数;根据被调用算子标识和被调用算子参数,调用算子,以构建工作流,并获得工作流对应的拓扑信息。In response to the workflow construction instruction, parse the simple code input by the user to obtain the called operator ID and called operator parameters; call the operator according to the called operator ID and called operator parameters to build the workflow, and Get the topology information corresponding to the workflow.
其中,所述用户可以理解为需要构建工作流的软件用户和/或软件程序开发人员,所述简易代码可以理解为符合通用的或自定义的代码解析规则、代码编写规则的代码,且所述简易代码至少包括所述训练样本数据的训练样本量和所述场景信息,所述工作流构建指令可以在简易代码输入完成后发出,在接收到工作流构建指令后,开始解析简易代码。Wherein, the user can be understood as a software user and/or a software program developer who needs to build a workflow, and the simple code can be understood as a code that conforms to general or self-defined code analysis rules and code writing rules, and the The simple code includes at least the training sample size of the training sample data and the scene information, the workflow construction instruction can be issued after the simple code input is completed, and the simple code starts to be parsed after receiving the workflow construction instruction.
可选的,所述根据所述拓扑信息,获得所述拓扑信息包括的每个算子的算子资源信息,包括:Optionally, the obtaining operator resource information of each operator included in the topology information according to the topology information includes:
根据所述拓扑信息,确定每个算子对应的资源框架信息;According to the topology information, determine the resource framework information corresponding to each operator;
根据所述每个算子对应的资源框架信息,计算所述每个算子的算子资源信息。Calculate the operator resource information of each operator according to the resource framework information corresponding to each operator.
可选的,所述根据所述每个算子对应的资源框架信息,计算所述每个算子的算子资源信息,包括:Optionally, calculating the operator resource information of each operator according to the resource framework information corresponding to each operator includes:
根据所述每个算子对应的资源框架信息,确定每个算子对应的至少一个框架角色信息;Determine at least one framework role information corresponding to each operator according to the resource framework information corresponding to each operator;
根据所述每个算子对应的至少一个框架角色信息,计算所述每个算子的算子资源信息。Calculate operator resource information of each operator according to at least one piece of framework role information corresponding to each operator.
上述拓扑信息包括多个算子、以及每个算子对应的资源框架信息;每一资源框架信息包括至少一个框架角色信息,每一框架角色信息对应一框架角色。The above topology information includes a plurality of operators and resource framework information corresponding to each operator; each resource framework information includes at least one framework role information, and each framework role information corresponds to a framework role.
如上所述,计算每个算子的算子资源信息可理解为,确定每个算子对应的至少一个框架角色,以及计算每个框架角色对应的角色资源,需要说明的是,每个算子对应的至少一个框架角色均属于同一资源框架。As mentioned above, calculating the operator resource information of each operator can be understood as determining at least one framework role corresponding to each operator, and calculating the role resources corresponding to each framework role. It should be noted that each operator At least one corresponding framework role belongs to the same resource framework.
属于同一资源框架的至少两个框架角色通过协调配合,即可完成某个算子(即步骤)的执行,此时,上述至少两个框架角色即对应该算子,且上述至少两个框架角色所属的资源框架即为该算子对应的资源框架。At least two framework roles belonging to the same resource framework can complete the execution of an operator (that is, a step) through coordination and cooperation. At this time, the above-mentioned at least two framework roles correspond to the operator, and the above-mentioned at least two framework roles The resource frame to which it belongs is the resource frame corresponding to the operator.
需要说明的是,算子的执行也可以经由一个框架角色实现。It should be noted that the execution of operators can also be implemented through a framework role.
资源框架是包括资源框架角色(也即前述的框架角色)的一类组件库,一个资源框架包括几组资源框架角色(Resource Framework Role Type),每组资源框架角色在具体算子执行过程中承担的角色任务相互独立,举例来说,资源框架角色可以为负责管理、调度、合并等调度工作的资源框架角色(如驱动端(Driver)、客户端(Client)),资源框架角色也可以为负责任务执行工作的资源框架角色(如执行器(Executor)、工人(Worker))。A resource framework is a type of component library that includes resource framework roles (that is, the aforementioned framework roles). A resource framework includes several sets of resource framework roles (Resource Framework Role Type), and each set of resource framework roles is assumed during the execution of a specific operator. The roles and tasks of each are independent of each other. For example, the resource framework role can be the resource framework role responsible for management, scheduling, merging and other scheduling tasks (such as driver, client (Client)), and the resource framework role can also be A resource framework role (eg, Executor, Worker) for a task to perform work.
示例性的,上述资源框架包括但不限于单机资源框架、PySpark分布式资源框架、Dask分布式资源框架、TensorFlow2分布式资源框架、PyTorch分布式资源框架。Exemplarily, the above-mentioned resource frameworks include but are not limited to stand-alone resource frameworks, PySpark distributed resource frameworks, Dask distributed resource frameworks, TensorFlow2 distributed resource frameworks, and PyTorch distributed resource frameworks.
其中,单机资源框架至少包括Worker资源框架角色,PySpark分布式资源框架(PySpark)至少包括Driver资源框架角色和/或Executor资源框架角色,Dask分布式资源框架至少包括Client资源框架角色和/或Worker资源框架角色,TensorFlow2分布式资源框架至少包括Worker资源框架角色,PyTorch分布式资源框架至少包括Worker资源框架角色。Among them, the stand-alone resource framework includes at least the Worker resource framework role, the PySpark distributed resource framework (PySpark) at least includes the Driver resource framework role and/or the Executor resource framework role, and the Dask distributed resource framework includes at least the Client resource framework role and/or Worker resource Framework roles, the TensorFlow2 distributed resource framework includes at least the Worker resource framework role, and the PyTorch distributed resource framework includes at least the Worker resource framework role.
不同的资源框架中所包括的资源框架角色各不相同(即便在资源框架角色的名称相同的情况下,属于不同资源框架的同名资源框架角色也不相同;并且,每个资源框架中所包括的资源框架角色的数量也有所差异),举例来说,TensorFlow2资源框架包括Worker资源框架角色,Dask资源框架包括Client资源框架角色和Worker资源框架角色,TensorFlow2资源框架中的Worker资源框架角色和Dask资源框架中的Worker资源框架角色不同。The resource framework roles included in different resource frameworks are different (even if the names of the resource framework roles are the same, the resource framework roles with the same name belonging to different resource frameworks are not the same; and each resource framework includes The number of resource framework roles is also different), for example, the TensorFlow2 resource framework includes the Worker resource framework role, the Dask resource framework includes the Client resource framework role and the Worker resource framework role, and the Worker resource framework role and the Dask resource framework role in the TensorFlow2 resource framework The roles of the Worker resource framework are different.
可选的,所述根据所述每个算子对应的至少一个框架角色信息,计算所述每个算子的算子资源信息,包括:Optionally, the calculating the operator resource information of each operator according to at least one framework role information corresponding to each operator includes:
根据所述训练样本数据的训练样本量,确定所述每个算子的资源计算方式;Determine the resource calculation method of each operator according to the training sample size of the training sample data;
根据算子对应的至少一个框架角色信息,利用所述资源计算方式,计算所述算子包含的框架角色对应的角色资源;According to at least one framework role information corresponding to the operator, using the resource calculation method, calculate the role resource corresponding to the framework role included in the operator;
根据所述算子包含的框架角色对应的角色资源,获得所述算子的算子资源信息。Obtain the operator resource information of the operator according to the role resource corresponding to the framework role included in the operator.
可选的,所述根据算子对应的至少一个框架角色信息,利用所述算子计算方式,计算所述算子包含的框架角色对应的角色资源,包括:Optionally, according to at least one frame role information corresponding to the operator, using the operator calculation method to calculate the role resource corresponding to the framework role included in the operator, including:
所述框架角色信息包括默认计算资源信息、样本参数信息和批量导入参数信息;The framework role information includes default computing resource information, sample parameter information and batch import parameter information;
在所述训练样本量小于或等于第一阈值的情况下,根据所述默认计算资源信息,获得所述框架角色对应的角色资源;When the training sample size is less than or equal to a first threshold, according to the default computing resource information, obtain the role resource corresponding to the framework role;
在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据所述样本参数信息和所述训练样本量,获得所述框架角色对应的角色资源;If the training sample size is greater than a first threshold and smaller than a second threshold, obtain a role resource corresponding to the framework role according to the sample parameter information and the training sample size;
在所述训练样本量大于或等于第二阈值的情况下,根据所述批量导入参数信息,获得所述框架角色对应的角色资源。In a case where the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the batch import parameter information.
如上所述,通过第一阈值和第二阈值的设置,基于训练样本量的数值大小,适应性调整框架角色对应的角色资源的计算方式,确保每一算子中每一框架角色对应的角色资源均为最优,上述最优可以理解为,某一算子中某一框架角色对应的角色资源能在满足该算子执行所需计算资源的前提下,使角色资源中所包括的各个参数的数值最小。As mentioned above, through the setting of the first threshold and the second threshold, based on the numerical size of the training sample size, adaptively adjust the calculation method of the role resources corresponding to the framework role, to ensure that the role resources corresponding to each framework role in each operator Both are optimal. The above optimality can be understood as that the role resource corresponding to a framework role in a certain operator can make the parameters included in the role resource The smallest value.
需要说明的是,如前所述,在基于场景信息、训练样本数据、算法信息确定用于模型训练的拓扑信息的情况下,若训练样本量的数值大小不影响算法信息所指示算法的应用(例如,算法在训练阶段批量导入训练集,且每次批量导入的训练集的数据量小于第一阈值的情况),则根据每一框架角色信息对应的默认计算资源信息,获得该框架角色对应的角色资源;若训练样本量的数值大小影响算法信息所指示算法的应用,则根据通过比较训练样本量的数值大小、第一阈值、第二阈值的方式,以相应选择上述三种计算方式中的一种计算方式来计算框架角色对应的角色资源。It should be noted that, as mentioned above, in the case of determining the topology information for model training based on scene information, training sample data, and algorithm information, if the numerical size of the training sample size does not affect the application of the algorithm indicated by the algorithm information ( For example, the algorithm imports the training set in batches during the training phase, and the data volume of the training set imported in batches each time is less than the first threshold), then according to the default computing resource information corresponding to each framework role information, the corresponding framework role information is obtained Role resources; if the numerical size of the training sample size affects the application of the algorithm indicated by the algorithm information, then by comparing the numerical size of the training sample size, the first threshold, and the second threshold, the corresponding selection among the above three calculation methods A calculation method to calculate the role resource corresponding to the framework role.
所述训练样本量小于或等于第一阈值的情况,可理解为训练样本量过小;所述训练样本量大于或等于第二阈值的情况,可理解为训练样本量过大,此时通过批量导入(将数据量过大的训练样本数据拆分为数据量较小的训练子集,并逐一导入所拆分得到的训练子集)的方式,能在保障模型训练过程正常进行的前提下,使所分配的计算资源得到充分应用;举例来说,上述批量导入训练样本数据的过程可以为,设定训练样本数据所占用的计算机存储容量为10单位,每次将1单位的训练样本导入框架角色对应的算子执行,重复执行10次,直至将训练样本全部导入框架角色对应的算子执行完成。When the training sample size is less than or equal to the first threshold, it can be understood that the training sample size is too small; when the training sample size is greater than or equal to the second threshold, it can be understood that the training sample size is too large. The way of importing (split the training sample data with too large data volume into training subsets with small data volume, and import the split training subsets one by one) can ensure the normal progress of the model training process. Make the allocated computing resources fully utilized; for example, the process of importing training sample data in batches above can be as follows: set the computer storage capacity occupied by the training sample data to 10 units, and import 1 unit of training samples into the framework each time The operator corresponding to the role is executed, and the execution is repeated 10 times until all the training samples are imported into the framework and the operator corresponding to the role is executed.
其中,所述框架角色对应的角色资源包括但不限于内存、中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)等。Wherein, the role resource corresponding to the framework role includes but not limited to memory, central processing unit (Central Processing Unit, CPU), graphics processing unit (Graphics Processing Unit, GPU) and so on.
可选的,所述样本参数信息包括第一基础计算资源数据和样本系数;Optionally, the sample parameter information includes first basic computing resource data and sample coefficients;
所述在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据所述样本参数信 息和所述训练样本量,获得所述框架角色对应的角色资源,包括:In the case where the training sample size is greater than the first threshold and smaller than the second threshold, according to the sample parameter information and the training sample size, obtaining the role resource corresponding to the framework role includes:
在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源;When the training sample size is greater than the first threshold and less than the second threshold, obtain the role resource corresponding to the framework role according to the following expression;
Figure PCTCN2022125905-appb-000005
Figure PCTCN2022125905-appb-000005
其中,RR为所述框架角色对应的角色资源,BC 1为所述第一基础计算资源数据,CR为所述样本系数,DS为所述训练样本量。 Wherein, RR is the role resource corresponding to the framework role, BC 1 is the first basic computing resource data, CR is the sample coefficient, and DS is the training sample size.
可选的,所述批量导入参数信息包括第二基础计算资源数据和单批导入的样本数据量;Optionally, the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
所述在所述训练样本量大于或等于第二阈值的情况下,根据所述批量导入参数信息,获得所述框架角色对应的角色资源,包括:In the case that the training sample size is greater than or equal to the second threshold, obtaining the role resource corresponding to the framework role according to the batch import parameter information includes:
在所述训练样本量大于或等于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源;In the case that the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the following expression;
Figure PCTCN2022125905-appb-000006
Figure PCTCN2022125905-appb-000006
其中,RR为所述框架角色对应的角色资源,BC 2为所述第二基础计算资源数据,BS为所述单批导入的样本数据量。 Wherein, RR is the role resource corresponding to the framework role, BC 2 is the second basic computing resource data, and BS is the amount of sample data imported in a single batch.
需要说明的是,第一基础计算资源数据和第二基础计算资源数据可以相同,也可以不同,用户可以适应性调整上述两者的数值,本申请实施例对此并不加以限定。It should be noted that the first basic computing resource data and the second basic computing resource data may be the same or different, and the user may adaptively adjust the above two values, which is not limited in this embodiment of the present application.
示例性的,根据所述默认计算资源信息,获得所述框架角色对应的角色资源的过程可以为,将所述默认计算资源信息设定为框架角色对应的角色资源。举例来说,若设定所述默认计算资源信息具体为2G内存、1核CPU和1核GPU,则根据默认计算资源信息所获得的框架角色对应的角色资源同样为2G内存、1核CPU和1核GPU。Exemplarily, the process of obtaining the role resource corresponding to the framework role according to the default computing resource information may include setting the default computing resource information as the role resource corresponding to the framework role. For example, if the default computing resource information is set to be 2G memory, 1-core CPU, and 1-core GPU, then the role resource corresponding to the frame role obtained according to the default computing resource information is also 2G memory, 1-core CPU, and 1-core GPU.
可选的,所述资源框架信息包括框架标识,所述框架角色信息包括角色标识;Optionally, the resource framework information includes a framework identifier, and the framework role information includes a role identifier;
所述根据所述每个算子的算子资源信息,获得用于模型训练的目标资源信息,包括:The obtaining target resource information for model training according to the operator resource information of each operator includes:
获取框架标识以及角色标识均相同的所有角色资源;Obtain all role resources with the same framework ID and role ID;
从获取到的所有角色资源中确定最大角色资源;Determine the maximum character resource from all obtained character resources;
基于所述最大角色资源,获得用于模型训练的目标资源信息。。Obtain target resource information for model training based on the maximum role resource. .
示例性的,从获取到的所有角色资源中确定最大角色资源过程可以为:Exemplarily, the process of determining the maximum role resource from all obtained role resources may be:
设定上述拓扑信息包括一号算子和二号算子,一号算子和二号算子均对应PySpark分布式资源框架(即框架标识),PySpark分布式资源框架包括Driver资源框架角色(角色标识)和Executor资源框架角色(角色标识)。Set the above topology information to include No. 1 operator and No. 2 operator. Both No. 1 operator and No. 2 operator correspond to the PySpark distributed resource framework (that is, the framework identifier). The PySpark distributed resource framework includes the Driver resource framework role (role ID) and Executor resource framework role (role ID).
若设定对应一号算子的Driver资源框架角色的角色资源为2G内存、1核CPU和2核GPU,二号算子的Driver资源框架角色的角色资源为3G内存、2核CPU和1核GPU,则PySpark分布式资源框架中的Driver资源框架角色的推荐计算资源即为3G内存、2核 CPU和2核GPU(即PySpark框架中的Driver角色对应的最大角色资源)。If the role resources of the Driver resource frame role corresponding to the No. 1 operator are set to 2G memory, 1-core CPU and 2-core GPU, the role resources of the Driver resource frame role of the No. 2 operator are 3G memory, 2-core CPU and 1 core. GPU, the recommended computing resources for the Driver resource frame role in the PySpark distributed resource framework are 3G memory, 2-core CPU and 2-core GPU (that is, the maximum role resource corresponding to the Driver role in the PySpark framework).
若设定对应一号算子的Executor资源框架角色的角色资源为1G内存、1核CPU和1核GPU,对应二号算子的Executor资源框架角色的角色资源为2G内存、2核CPU和2核GPU,如上所述,PySpark分布式资源框架中的Executor资源框架角色的推荐计算资源即为2G内存、2核CPU和2核GPU(即PySpark框架中的Executor角色对应的最大角色资源)。If the role resources of the Executor resource framework role corresponding to the No. Core GPU, as mentioned above, the recommended computing resources for the Executor role in the PySpark distributed resource framework are 2G memory, 2-core CPU and 2-core GPU (that is, the maximum role resources corresponding to the Executor role in the PySpark framework).
此时,目标资源信息中关于PySpark分布式资源框架对应的推荐计算资源即为:At this point, the recommended computing resources corresponding to the PySpark distributed resource framework in the target resource information are:
Driver资源框架角色:3G内存、2核CPU和2核GPU;Driver resource framework role: 3G memory, 2-core CPU and 2-core GPU;
Executor资源框架角色:2G内存、2核CPU和2核GPU。Executor resource framework role: 2G memory, 2-core CPU, and 2-core GPU.
示例性的,设定训练样本占用的计算机存储容量为30M,模型的场景信息为分类,训练所述模型的算法信息为TensorFlow2分布式算法中的感知网络(Xception),根据所述场景信息、所述训练样本数据,获得拓扑信息如图2所示,所述拓扑信息包括图像分类步骤、管道(Pipeline)初始化步骤、数据集拆分步骤、Xception步骤、多分类评估Xception步骤以及生成pipesline_Xception步骤。Exemplarily, the computer storage capacity that the training sample occupies is set as 30M, the scene information of the model is classification, and the algorithm information for training the model is the perception network (Xception) in the TensorFlow2 distributed algorithm. According to the scene information, the Described training sample data, obtain topological information as shown in Figure 2, described topological information includes image classification step, pipeline (Pipeline) initialization step, data set splitting step, Xception step, multi-classification evaluation Xception step and generation pipeline_Xception step.
其中,图像分类步骤、管道初始化步骤、数据集拆分步骤、基于感知网络的多分类评估步骤以及生成目标感知网络(即训练完成的感知网络)步骤均属于单机资源框架;而感知网络训练步骤则属于TensorFlow2分布式资源框架;即目标资源信息包括单机资源框架的框架计算资源信息和TensorFlow2分布式资源框架的框架计算资源信息,设定每一步骤仅对应一个资源框架角色,且在所对应资源框架相同的情况下,不同步骤对应的资源框架角色相同。Among them, the image classification step, the pipeline initialization step, the dataset splitting step, the multi-classification evaluation step based on the perceptual network, and the step of generating the target perceptual network (that is, the trained perceptual network) all belong to the stand-alone resource framework; while the perceptual network training step is It belongs to the TensorFlow2 distributed resource framework; that is, the target resource information includes the framework computing resource information of the stand-alone resource framework and the framework computing resource information of the TensorFlow2 distributed resource framework. Each step is set to correspond to only one resource framework role, and in the corresponding resource framework In the same situation, the resource framework roles corresponding to different steps are the same.
图像分类步骤所包括资源框架角色的角色配置信息如表1所示:The role configuration information of the resource framework role included in the image classification step is shown in Table 1:
Figure PCTCN2022125905-appb-000007
Figure PCTCN2022125905-appb-000007
Figure PCTCN2022125905-appb-000008
Figure PCTCN2022125905-appb-000008
表1Table 1
管道初始化步骤所包括资源框架角色的角色配置信息如表2所示:The role configuration information of the resource framework role included in the pipeline initialization step is shown in Table 2:
Figure PCTCN2022125905-appb-000009
Figure PCTCN2022125905-appb-000009
表2Table 2
数据集拆分步骤所包括资源框架角色的角色配置信息如表3所示:The role configuration information of the resource framework role included in the dataset splitting step is shown in Table 3:
Figure PCTCN2022125905-appb-000010
Figure PCTCN2022125905-appb-000010
表3table 3
基于感知网络的多分类评估步骤所包括资源框架角色的角色配置信息如表4所示:The role configuration information of the resource frame role included in the multi-category evaluation step based on the perception network is shown in Table 4:
Figure PCTCN2022125905-appb-000011
Figure PCTCN2022125905-appb-000011
Figure PCTCN2022125905-appb-000012
Figure PCTCN2022125905-appb-000012
表4Table 4
成目标感知网络步骤包括资源框架角色的角色配置信息参见表5:See Table 5 for the role configuration information of the target-aware network steps including resource framework roles:
Figure PCTCN2022125905-appb-000013
Figure PCTCN2022125905-appb-000013
Figure PCTCN2022125905-appb-000014
Figure PCTCN2022125905-appb-000014
表5table 5
感知网络训练步骤所包括资源框架角色的角色配置信息如表6所示:The role configuration information of the resource framework role included in the training step of the perception network is shown in Table 6:
Figure PCTCN2022125905-appb-000015
Figure PCTCN2022125905-appb-000015
表6Table 6
如表1-表6的数据所示,目标资源信息为:As shown in the data in Table 1-Table 6, the target resource information is:
单机资源框架:1核CPU、3G内存且数量为1;Stand-alone resource framework: 1 core CPU, 3G memory and the number is 1;
Tensorflow2分布式资源框架:1核CPU、8G内存且数量为3。Tensorflow2 distributed resource framework: 1 core CPU, 8G memory, and the number is 3.
如上所述,为简化示例,设定每一步骤(算子)仅通过一个框架角色即可执行。其中, 参数useDefaultFlag为1可理解为训练样本量小于或等于第一阈值;参数useBatchSizeRatioFlag为1可理解为训练样本量大于或等于第二阈值;参数useDefaultFlag为0且参数useBatchSizeRatioFlag为0可理解为训练样本量大于第一阈值且小于第二阈值。As mentioned above, to simplify the example, it is assumed that each step (operator) can be executed by only one framework role. Among them, the parameter useDefaultFlag being 1 can be understood as the training sample size is less than or equal to the first threshold; the parameter useBatchSizeRatioFlag being 1 can be understood as the training sample size is greater than or equal to the second threshold; the parameter useDefaultFlag being 0 and the parameter useBatchSizeRatioFlag being 0 can be understood as the training sample The amount is greater than the first threshold and less than the second threshold.
参数batchSize除以参数batchSizeRatio所得到数据可理解为前述单批导入的样本数据量,参数baseCapacity可理解为前述第一基础计算资源数据和第二基础计算资源数据。The data obtained by dividing the parameter batchSize by the parameter batchSizeRatio can be understood as the amount of sample data imported in a single batch, and the parameter baseCapacity can be understood as the aforementioned first basic computing resource data and second basic computing resource data.
如上所述,目标资源信息可理解为多个资源框架、以及每个资源框架包括的多个框架角色、以及每个框架角色对应的最大角色资源。As mentioned above, the target resource information can be understood as multiple resource frames, multiple frame roles included in each resource frame, and the maximum role resource corresponding to each frame role.
请参阅图3,图3是本申请实施例提供的一种计算资源获取装置200的结构示意图,如图3所示,上述计算资源获取装置200包括:Please refer to FIG. 3. FIG. 3 is a schematic structural diagram of a computing resource acquisition device 200 provided in an embodiment of the present application. As shown in FIG. 3, the computing resource acquisition device 200 includes:
拓扑获取模块201,用于根据场景信息和训练样本数据,确定用于模型训练的拓扑信息;The topology acquisition module 201 is configured to determine topology information for model training according to scene information and training sample data;
算子获取模块202,用于根据所述拓扑信息,获得所述拓扑信息包括的每个算子的算子资源信息;An operator obtaining module 202, configured to obtain operator resource information of each operator included in the topology information according to the topology information;
资源获取模块203,用于根据所述每个算子的算子资源信息,获得用于模型训练的目标资源信息。The resource obtaining module 203 is configured to obtain target resource information for model training according to the operator resource information of each operator.
可选的,所述算子获取模块202包括:Optionally, the operator acquisition module 202 includes:
框架获取子模块,用于根据所述拓扑信息,确定每个算子对应的资源框架信息;A framework acquisition submodule, configured to determine resource framework information corresponding to each operator according to the topology information;
算子获取子模块,用于根据所述每个算子对应的资源框架信息,计算所述每个算子的算子资源信息。The operator acquisition sub-module is configured to calculate the operator resource information of each operator according to the resource framework information corresponding to each operator.
可选的,所述算子获取子模块包括:Optionally, the operator acquisition submodule includes:
角色获取单元,根据所述每个算子对应的资源框架信息,确定每个算子对应的至少一个框架角色信息;The role acquisition unit determines at least one frame role information corresponding to each operator according to the resource framework information corresponding to each operator;
计算单元,根据所述每个算子对应的至少一个框架角色信息,计算所述每个算子的算子资源信息。The calculation unit is configured to calculate operator resource information of each operator according to at least one framework role information corresponding to each operator.
可选的,所述计算单元包括:Optionally, the calculation unit includes:
第一计算子单元,用于根据所述训练样本数据的训练样本量,确定所述每个算子的资源计算方式;The first calculation subunit is configured to determine the resource calculation mode of each operator according to the training sample size of the training sample data;
第二计算子单元,用于根据算子对应的至少一个框架角色信息,利用所述资源计算方式,计算所述算子包含的框架角色对应的角色资源;The second calculation subunit is configured to calculate the role resource corresponding to the framework role included in the operator by using the resource calculation method according to at least one framework role information corresponding to the operator;
第三计算子单元,用于根据所述算子包含的框架角色对应的角色资源,获得所述算子的算子资源信息。The third calculation subunit is configured to obtain operator resource information of the operator according to the role resource corresponding to the framework role included in the operator.
可选的,所述第二计算子单元包括:Optionally, the second calculation subunit includes:
所述框架角色信息包括默认计算资源信息、样本参数信息和批量导入参数信息;The framework role information includes default computing resource information, sample parameter information and batch import parameter information;
在所述训练样本量小于或等于第一阈值的情况下,根据所述默认计算资源信息,获得所述框架角色对应的角色资源;When the training sample size is less than or equal to a first threshold, according to the default computing resource information, obtain the role resource corresponding to the framework role;
在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据所述样本参数信息和所述训练样本量,获得所述框架角色对应的角色资源;If the training sample size is greater than a first threshold and smaller than a second threshold, obtain a role resource corresponding to the framework role according to the sample parameter information and the training sample size;
在所述训练样本量大于或等于第二阈值的情况下,根据所述批量导入参数信息,获得所述框架角色对应的角色资源。In a case where the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the batch import parameter information.
可选的,所述资源框架信息包括框架标识,所述框架角色信息包括角色标识;Optionally, the resource framework information includes a framework identifier, and the framework role information includes a role identifier;
所述资源获取模块203包括:The resource acquisition module 203 includes:
获取框架标识以及角色标识均相同的所有角色资源;Obtain all role resources with the same framework ID and role ID;
从获取到的所有角色资源中确定最大角色资源;Determine the maximum character resource from all obtained character resources;
基于所述最大角色资源,获得用于模型训练的目标资源信息。Obtain target resource information for model training based on the maximum role resource.
可选的,所述样本参数信息包括第一基础计算资源数据和样本系数;Optionally, the sample parameter information includes first basic computing resource data and sample coefficients;
所述第二计算子单元包括:The second computing subunit includes:
在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源:In the case that the training sample size is greater than the first threshold and less than the second threshold, the role resource corresponding to the framework role is obtained according to the following expression:
Figure PCTCN2022125905-appb-000016
Figure PCTCN2022125905-appb-000016
其中,RR为所述框架角色对应的角色资源,BC 1为所述第一基础计算资源数据,CR为所述样本系数,DS为所述训练样本量。 Wherein, RR is the role resource corresponding to the framework role, BC 1 is the first basic computing resource data, CR is the sample coefficient, and DS is the training sample size.
可选的,所述批量导入参数信息包括第二基础计算资源数据和单批导入的样本数据量;Optionally, the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
在所述训练样本量大于或等于第二阈值的情况下,根据如下表达式获得所述框架角色对应角色资源;In the case that the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the following expression;
可选的,所述批量导入参数信息包括第二基础计算资源数据和单批导入的样本数据量;Optionally, the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
所述第二计算子单元包括:The second computing subunit includes:
在所述训练样本量大于或等于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源:In the case that the training sample size is greater than or equal to the second threshold, the role resource corresponding to the framework role is obtained according to the following expression:
Figure PCTCN2022125905-appb-000017
Figure PCTCN2022125905-appb-000017
其中,RR为所述框架角色对应的角色资源,BC 2为所述第二基础计算资源数据,BS为所述单批导入的样本数据量。 Wherein, RR is the role resource corresponding to the framework role, BC 2 is the second basic computing resource data, and BS is the amount of sample data imported in a single batch.
可选的,所述拓扑获取模块201,包括:Optionally, the topology acquisition module 201 includes:
以所述训练样本数据的训练样本量和所述场景信息作为检索条件,在信息库中进行检索,获得所述拓扑信息;Using the training sample size of the training sample data and the scene information as retrieval conditions, searching in an information base to obtain the topology information;
或者,or,
将所述训练样本量和所述场景信息输入预获取的工作流计算模型中,获得所述拓扑信息。Input the training sample size and the scene information into the pre-acquired workflow calculation model to obtain the topology information.
需要说明的是,本申请实施例中的计算资源获取装置200可以是装置,也可以是电子设备中的部件、集成电路或芯片。It should be noted that the computing resource acquisition apparatus 200 in the embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in an electronic device.
请参见图4,图4是本申请实施例提供的一种电子设备的结构示意图,如图4所示,电子设备包括:总线301、收发机302、天线303、总线接口304、处理器305和存储器306。处理器305能够实现上述计算资源获取方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。Please refer to FIG. 4, which is a schematic structural diagram of an electronic device provided by an embodiment of the present application. As shown in FIG. 4, the electronic device includes: a bus 301, a transceiver 302, an antenna 303, a bus interface 304, a processor 305 and memory 306 . The processor 305 can implement the various processes of the above embodiments of the method for obtaining computing resources, and can achieve the same technical effect. To avoid repetition, details are not repeated here.
在图4中,总线架构(用总线301来代表),总线301可以包括任意数量的互联的总线和桥,总线301将包括由处理器305代表的一个或多个处理器和存储器306代表的存储器的各种电路链接在一起。总线301还可以将诸如***设备、稳压器和功率管理电路等之类的各种其他电路链接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口304在总线301和收发机302之间提供接口。收发机302可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器305处理的数据通过天线303在无线介质上进行传输,进一步,天线303还接收数据并将数据传送给处理器305。In FIG. 4, the bus architecture (represented by bus 301), bus 301 may include any number of interconnected buses and bridges, bus 301 will include one or more processors represented by processor 305 and memory represented by memory 306 The various circuits are linked together. The bus 301 may also link together various other circuits such as peripherals, voltage regulators, and power management circuits, etc., which are well known in the art and thus will not be further described herein. The bus interface 304 provides an interface between the bus 301 and the transceiver 302 . Transceiver 302 may be a single element or multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other devices over a transmission medium. The data processed by the processor 305 is transmitted on the wireless medium through the antenna 303 , further, the antenna 303 also receives the data and transmits the data to the processor 305 .
处理器305负责管理总线301和通常的处理,还可以提供各种功能,包括定时,***接口,电压调节、电源管理以及其他控制功能。而存储器306可以被用于存储处理器305在执行操作时所使用的数据。 Processor 305 is responsible for managing bus 301 and general processing, and may also provide various functions including timing, peripheral interfacing, voltage regulation, power management, and other control functions. Instead, the memory 306 may be used to store data used by the processor 305 when performing operations.
可选的,处理器305可以是CPU、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程逻辑门阵列(Field Programmable Gate Array,FPGA)或复杂可编程逻辑器件(Complex Programmable logic device,CPLD)。Optionally, the processor 305 can be a CPU, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable logic gate array (Field Programmable Gate Array, FPGA) or a complex programmable logic device (Complex Programmable logic device, CPLD ).
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。其中,的计算机可读存储介质,如只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等。The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, each process of the above-mentioned method embodiment can be realized, and the same technical effect can be achieved. To avoid repetition, details are not repeated here. Wherein, a computer-readable storage medium, such as a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
本申请实施例另提供了一种计算机程序产品,所述计算机程序产品包括存储在可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时实现如上计算资源获取方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。The embodiment of the present application further provides a computer program product, the computer program product includes a computer program stored on a readable storage medium, the computer program includes program instructions, and when the program instructions are executed by the computer, the above calculation is realized. Each process of the embodiment of the resource acquisition method can achieve the same technical effect, so in order to avoid repetition, details are not repeated here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,空调器,或者第二终端设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the technical solution of the present application can be embodied in the form of software products in essence or the part that contributes to related technologies, and the computer software products are stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk, etc.) ) includes several instructions to enable a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a second terminal device, etc.) to execute the methods described in various embodiments of the present application.
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。The embodiments of the present application have been described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Under the inspiration of this application, without departing from the purpose of this application and the scope of protection of the claims, many forms can also be made, all of which belong to the protection of this application.

Claims (19)

  1. 一种计算资源获取方法,包括:A computing resource acquisition method, comprising:
    根据场景信息和训练样本数据,确定用于模型训练的拓扑信息;According to the scene information and training sample data, determine the topology information for model training;
    根据所述拓扑信息,获得所述拓扑信息包括的每个算子的算子资源信息;Obtain operator resource information of each operator included in the topology information according to the topology information;
    根据所述每个算子的算子资源信息,获得用于模型训练的目标资源信息;Obtain target resource information for model training according to the operator resource information of each operator;
    其中,所述根据所述拓扑信息,获得所述拓扑信息包括的每个算子的算子资源信息,包括:Wherein, the obtaining operator resource information of each operator included in the topology information according to the topology information includes:
    根据所述拓扑信息,确定每个算子对应的资源框架信息,其中,资源框架是包括资源框架角色的组件库;According to the topology information, determine the resource framework information corresponding to each operator, where the resource framework is a component library including resource framework roles;
    根据所述每个算子对应的资源框架信息,计算所述每个算子的算子资源信息。Calculate the operator resource information of each operator according to the resource framework information corresponding to each operator.
  2. 根据权利要求1所述的方法,其中,所述根据所述每个算子对应的资源框架信息,计算所述每个算子的算子资源信息,包括:The method according to claim 1, wherein the calculating the operator resource information of each operator according to the resource framework information corresponding to each operator includes:
    根据所述每个算子对应的资源框架信息,确定每个算子对应的至少一个框架角色信息;Determine at least one framework role information corresponding to each operator according to the resource framework information corresponding to each operator;
    根据所述每个算子对应的至少一个框架角色信息,计算所述每个算子的算子资源信息。Calculate operator resource information of each operator according to at least one piece of framework role information corresponding to each operator.
  3. 根据权利要求2所述的方法,其中,所述根据所述每个算子对应的至少一个框架角色信息,计算所述每个算子的算子资源信息,包括:The method according to claim 2, wherein the calculating the operator resource information of each operator according to at least one framework role information corresponding to each operator includes:
    根据所述训练样本数据的训练样本量,确定所述每个算子的资源计算方式;Determine the resource calculation method of each operator according to the training sample size of the training sample data;
    根据算子对应的至少一个框架角色信息,利用所述资源计算方式,计算所述算子包含的框架角色对应的角色资源;According to at least one framework role information corresponding to the operator, using the resource calculation method, calculate the role resource corresponding to the framework role included in the operator;
    根据所述算子包含的框架角色对应的角色资源,获得所述算子的算子资源信息。Obtain the operator resource information of the operator according to the role resource corresponding to the framework role included in the operator.
  4. 根据权利要求3所述的方法,其中,所述根据算子对应的至少一个框架角色信息,利用所述算子计算方式,计算所述算子包含的框架角色对应的角色资源,包括:The method according to claim 3, wherein, according to at least one frame role information corresponding to the operator, using the operator calculation method to calculate the role resource corresponding to the framework role included in the operator, including:
    所述框架角色信息包括默认计算资源信息、样本参数信息和批量导入参数信息;The framework role information includes default computing resource information, sample parameter information and batch import parameter information;
    在所述训练样本量小于或等于第一阈值的情况下,根据所述默认计算资源信息,获得所述框架角色对应的角色资源;When the training sample size is less than or equal to a first threshold, according to the default computing resource information, obtain the role resource corresponding to the framework role;
    在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据所述样本参数信息和所述训练样本量,获得所述框架角色对应的角色资源;If the training sample size is greater than a first threshold and smaller than a second threshold, obtain a role resource corresponding to the framework role according to the sample parameter information and the training sample size;
    在所述训练样本量大于或等于第二阈值的情况下,根据所述批量导入参数信息,获得所述框架角色对应的角色资源。In a case where the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the batch import parameter information.
  5. 根据权利要求3所述的方法,其中,所述资源框架信息包括框架标识,所述框架角色信息包括角色标识;The method according to claim 3, wherein the resource framework information includes a framework identifier, and the framework role information includes a role identifier;
    所述根据所述每个算子的算子资源信息,获得用于模型训练的目标资源信息,包括:The obtaining target resource information for model training according to the operator resource information of each operator includes:
    获取框架标识以及角色标识均相同的所有角色资源;Obtain all role resources with the same framework ID and role ID;
    从获取到的所有角色资源中确定最大角色资源;Determine the maximum character resource from all obtained character resources;
    基于所述最大角色资源,获得用于模型训练的目标资源信息。Obtain target resource information for model training based on the maximum role resource.
  6. 根据权利要求4所述的方法,其中,所述样本参数信息包括第一基础计算资源数据和样本系数;The method according to claim 4, wherein the sample parameter information includes first basic computing resource data and sample coefficients;
    所述在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据所述样本参数信息和所述训练样本量,获得所述框架角色对应的角色资源,包括:In the case that the training sample size is greater than the first threshold and smaller than the second threshold, obtaining the role resource corresponding to the framework role according to the sample parameter information and the training sample size includes:
    在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源;When the training sample size is greater than the first threshold and less than the second threshold, obtain the role resource corresponding to the framework role according to the following expression;
    Figure PCTCN2022125905-appb-100001
    Figure PCTCN2022125905-appb-100001
    其中,RR为所述框架角色对应的角色资源,BC 1为所述第一基础计算资源数据,CR为所述样本系数,DS为所述训练样本量。 Wherein, RR is the role resource corresponding to the framework role, BC 1 is the first basic computing resource data, CR is the sample coefficient, and DS is the training sample size.
  7. 根据权利要求4所述的方法,其中,所述批量导入参数信息包括第二基础计算资源数据和单批导入的样本数据量;The method according to claim 4, wherein the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
    所述在所述训练样本量大于或等于第二阈值的情况下,根据所述批量导入参数信息,获得所述框架角色对应的角色资源,包括:In the case that the training sample size is greater than or equal to the second threshold, obtaining the role resource corresponding to the framework role according to the batch import parameter information includes:
    在所述训练样本量大于或等于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源;In the case that the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the following expression;
    Figure PCTCN2022125905-appb-100002
    Figure PCTCN2022125905-appb-100002
    其中,RR为所述框架角色对应的角色资源,BC 2为所述第二基础计算资源数据,BS为所述单批导入的样本数据量。 Wherein, RR is the role resource corresponding to the framework role, BC 2 is the second basic computing resource data, and BS is the amount of sample data imported in a single batch.
  8. 根据权利要求1所述的方法,其中,所述根据场景信息和训练样本数据,确定用于模型训练的拓扑信息,包括:The method according to claim 1, wherein said determining topology information for model training according to scene information and training sample data includes:
    以所述训练样本数据的训练样本量和所述场景信息作为检索条件,在信息库中进行检索,获得所述拓扑信息;Using the training sample size of the training sample data and the scene information as retrieval conditions, searching in an information base to obtain the topology information;
    或者,or,
    将所述训练样本量和所述场景信息输入预获取的工作流计算模型中,获得所述拓扑信息。Input the training sample size and the scene information into the pre-acquired workflow calculation model to obtain the topology information.
  9. 一种计算资源获取装置,包括:A computing resource acquisition device, comprising:
    拓扑获取模块,用于根据场景信息和训练样本数据,确定用于模型训练的拓扑信息;A topology acquisition module, configured to determine topology information for model training according to scene information and training sample data;
    算子获取模块,用于根据所述拓扑信息,获得所述拓扑信息包括的每个算子的算子资 源信息;An operator obtaining module, configured to obtain operator resource information of each operator included in the topology information according to the topology information;
    资源获取模块,用于根据所述每个算子的算子资源信息,获得用于模型训练的目标资源信息;A resource acquisition module, configured to obtain target resource information for model training according to the operator resource information of each operator;
    其中,所述算子获取模块包括:Wherein, the operator acquisition module includes:
    框架获取子模块,用于根据所述拓扑信息,确定每个算子对应的资源框架信息,其中,资源框架是包括资源框架角色的组件库;A framework acquiring submodule, configured to determine resource framework information corresponding to each operator according to the topology information, wherein the resource framework is a component library including resource framework roles;
    算子获取子模块,用于根据所述每个算子对应的资源框架信息,计算所述每个算子的算子资源信息。The operator acquisition sub-module is configured to calculate the operator resource information of each operator according to the resource framework information corresponding to each operator.
  10. 根据权利要求9所述的装置,其中,所述算子获取子模块包括:The device according to claim 9, wherein the operator acquiring submodule comprises:
    角色获取单元,根据所述每个算子对应的资源框架信息,确定每个算子对应的至少一个框架角色信息;The role acquisition unit determines at least one frame role information corresponding to each operator according to the resource framework information corresponding to each operator;
    计算单元,根据所述每个算子对应的至少一个框架角色信息,计算所述每个算子的算子资源信息。The calculation unit is configured to calculate operator resource information of each operator according to at least one framework role information corresponding to each operator.
  11. 根据权利要求10所述的装置,其中,所述计算单元包括:The apparatus according to claim 10, wherein the computing unit comprises:
    第一计算子单元,用于根据所述训练样本数据的训练样本量,确定所述每个算子的资源计算方式;The first calculation subunit is configured to determine the resource calculation mode of each operator according to the training sample size of the training sample data;
    第二计算子单元,用于根据算子对应的至少一个框架角色信息,利用所述资源计算方式,计算所述算子包含的框架角色对应的角色资源;The second calculation subunit is configured to calculate the role resource corresponding to the framework role included in the operator by using the resource calculation method according to at least one framework role information corresponding to the operator;
    第三计算子单元,用于根据所述算子包含的框架角色对应的角色资源,获得所述算子的算子资源信息。The third calculation subunit is configured to obtain operator resource information of the operator according to the role resource corresponding to the framework role included in the operator.
  12. 根据权利要求11所述的装置,其中,所述第二计算子单元包括:The device according to claim 11, wherein the second calculation subunit comprises:
    所述框架角色信息包括默认计算资源信息、样本参数信息和批量导入参数信息;The framework role information includes default computing resource information, sample parameter information and batch import parameter information;
    在所述训练样本量小于或等于第一阈值的情况下,根据所述默认计算资源信息,获得所述框架角色对应的角色资源;When the training sample size is less than or equal to a first threshold, according to the default computing resource information, obtain the role resource corresponding to the framework role;
    在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据所述样本参数信息和所述训练样本量,获得所述框架角色对应的角色资源;If the training sample size is greater than a first threshold and smaller than a second threshold, obtain a role resource corresponding to the framework role according to the sample parameter information and the training sample size;
    在所述训练样本量大于或等于第二阈值的情况下,根据所述批量导入参数信息,获得所述框架角色对应的角色资源。In a case where the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the batch import parameter information.
  13. 根据权利要求11所述的装置,其中,所述资源框架信息包括框架标识,所述框架角色信息包括角色标识;The apparatus according to claim 11, wherein the resource framework information includes a framework identifier, and the framework role information includes a role identifier;
    所述资源获取模块包括:The resource acquisition module includes:
    获取框架标识以及角色标识均相同的所有角色资源;Obtain all role resources with the same framework ID and role ID;
    从获取到的所有角色资源中确定最大角色资源;Determine the maximum character resource from all obtained character resources;
    基于所述最大角色资源,获得用于模型训练的目标资源信息。Obtain target resource information for model training based on the maximum role resource.
  14. 根据权利要求12所述的装置,其中,所述样本参数信息包括第一基础计算资源数据和样本系数;The device according to claim 12, wherein the sample parameter information includes first basic computing resource data and sample coefficients;
    所述第二计算子单元包括:The second computing subunit includes:
    在所述训练样本量大于第一阈值且小于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源;When the training sample size is greater than the first threshold and less than the second threshold, obtain the role resource corresponding to the framework role according to the following expression;
    Figure PCTCN2022125905-appb-100003
    Figure PCTCN2022125905-appb-100003
    其中,RR为所述框架角色对应的角色资源,BC 1为所述第一基础计算资源数据,CR为所述样本系数,DS为所述训练样本量。 Wherein, RR is the role resource corresponding to the framework role, BC 1 is the first basic computing resource data, CR is the sample coefficient, and DS is the training sample size.
  15. 根据权利要求12所述的装置,其中,所述批量导入参数信息包括第二基础计算资源数据和单批导入的样本数据量;The device according to claim 12, wherein the batch import parameter information includes the second basic computing resource data and the amount of sample data imported in a single batch;
    所述第二计算子单元包括:The second computing subunit includes:
    在所述训练样本量大于或等于第二阈值的情况下,根据如下表达式获得所述框架角色对应的角色资源;In the case that the training sample size is greater than or equal to a second threshold, the role resource corresponding to the framework role is obtained according to the following expression;
    Figure PCTCN2022125905-appb-100004
    Figure PCTCN2022125905-appb-100004
    其中,RR为所述框架角色对应的角色资源,BC 2为所述第二基础计算资源数据,BS为所述单批导入的样本数据量。 Wherein, RR is the role resource corresponding to the framework role, BC 2 is the second basic computing resource data, and BS is the amount of sample data imported in a single batch.
  16. 根据权利要求9所述的装置,其中,所述拓扑获取模块,包括:The device according to claim 9, wherein the topology acquisition module includes:
    以所述训练样本数据的训练样本量和所述场景信息作为检索条件,在信息库中进行检索,获得所述拓扑信息;Using the training sample size of the training sample data and the scene information as retrieval conditions, searching in an information base to obtain the topology information;
    或者,or,
    将所述训练样本量和所述场景信息输入预获取的工作流计算模型中,获得所述拓扑信息。Input the training sample size and the scene information into the pre-acquired workflow calculation model to obtain the topology information.
  17. 一种电子设备,包括处理器,存储器及存储在所述存储器上并可在所述处理器上运行的程序或指令,其中,所述程序或指令被所述处理器执行时实现如权利要求1至8中任一项所述方法的步骤。An electronic device, comprising a processor, a memory, and a program or instruction stored in the memory and operable on the processor, wherein the program or instruction is executed by the processor to achieve claim 1 to the step of any one of the methods described in 8.
  18. 一种可读存储介质,所述可读存储介质上存储有程序或指令,其中,所述程序或指令被处理器执行时实现如权利要求1至8中任一项所述的方法的步骤。A readable storage medium, on which a program or instruction is stored, wherein, when the program or instruction is executed by a processor, the steps of the method according to any one of claims 1 to 8 are realized.
  19. 一种计算机程序产品,所述计算机程序产品包括存储在可读存储介质上的计算程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1至8任一项所述的方法。A computer program product, the computer program product includes a computer program stored on a readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer, the computer is made to perform claims 1 to 8. The method described in any one.
PCT/CN2022/125905 2021-11-25 2022-10-18 Computing resource acquisition method and apparatus, electronic device, and storage medium WO2023093375A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111411238.XA CN114091688B (en) 2021-11-25 2021-11-25 Computing resource obtaining method and device, electronic equipment and storage medium
CN202111411238.X 2021-11-25

Publications (1)

Publication Number Publication Date
WO2023093375A1 true WO2023093375A1 (en) 2023-06-01

Family

ID=80304371

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/125905 WO2023093375A1 (en) 2021-11-25 2022-10-18 Computing resource acquisition method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN114091688B (en)
WO (1) WO2023093375A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091688B (en) * 2021-11-25 2022-05-20 北京九章云极科技有限公司 Computing resource obtaining method and device, electronic equipment and storage medium
CN114898175B (en) * 2022-04-29 2023-03-28 北京九章云极科技有限公司 Target detection method, device and related equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104052811A (en) * 2014-06-17 2014-09-17 华为技术有限公司 Service scheduling method and device and system
CN111930524A (en) * 2020-10-10 2020-11-13 上海兴容信息技术有限公司 Method and system for distributing computing resources
WO2021040584A1 (en) * 2019-08-26 2021-03-04 Telefonaktiebolaget Lm Ericsson (Publ) Entity and method performed therein for handling computational resources
CN112753016A (en) * 2018-09-30 2021-05-04 华为技术有限公司 Management method and device for computing resources in data preprocessing stage in neural network
CN113467922A (en) * 2020-03-30 2021-10-01 阿里巴巴集团控股有限公司 Resource management method, device, equipment and storage medium
CN114091688A (en) * 2021-11-25 2022-02-25 北京九章云极科技有限公司 Computing resource obtaining method and device, electronic equipment and storage medium

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512162B (en) * 2015-09-28 2019-04-16 杭州圆橙科技有限公司 A kind of flow data real-time intelligentization processing frame based on Storm
CN107480717A (en) * 2017-08-16 2017-12-15 北京奇虎科技有限公司 Train job processing method and system, computing device, computer-readable storage medium
CN110018817A (en) * 2018-01-05 2019-07-16 中兴通讯股份有限公司 The distributed operation method and device of data, storage medium and processor
CN108510081A (en) * 2018-03-23 2018-09-07 北京京东尚科信息技术有限公司 machine learning method and platform
CN108665072A (en) * 2018-05-23 2018-10-16 中国电力科学研究院有限公司 A kind of machine learning algorithm overall process training method and system based on cloud framework
CN108874487B (en) * 2018-06-13 2020-01-10 北京九章云极科技有限公司 Data analysis processing method, system, device and storage medium based on workflow
CN109298940B (en) * 2018-09-28 2019-12-31 考拉征信服务有限公司 Computing task allocation method and device, electronic equipment and computer storage medium
CN111435315A (en) * 2019-01-14 2020-07-21 北京沃东天骏信息技术有限公司 Method, apparatus, device and computer readable medium for allocating resources
US11620510B2 (en) * 2019-01-23 2023-04-04 Samsung Electronics Co., Ltd. Platform for concurrent execution of GPU operations
CN109933306B (en) * 2019-02-11 2020-07-14 山东大学 Self-adaptive hybrid cloud computing framework generation method based on operation type recognition
CN110618870B (en) * 2019-09-20 2021-11-19 广东浪潮大数据研究有限公司 Working method and device for deep learning training task
CN110889492B (en) * 2019-11-25 2022-03-08 北京百度网讯科技有限公司 Method and apparatus for training deep learning models
CN111104214B (en) * 2019-12-26 2020-12-15 北京九章云极科技有限公司 Workflow application method and device
CN111190741B (en) * 2020-01-03 2023-05-12 深圳鲲云信息科技有限公司 Scheduling method, equipment and storage medium based on deep learning node calculation
CN111222046B (en) * 2020-01-03 2022-09-20 腾讯科技(深圳)有限公司 Service configuration method, client for service configuration, equipment and electronic equipment
CN113095474A (en) * 2020-01-09 2021-07-09 微软技术许可有限责任公司 Resource usage prediction for deep learning models
CN111309479B (en) * 2020-02-14 2023-06-06 北京百度网讯科技有限公司 Method, device, equipment and medium for realizing task parallel processing
CN111444019B (en) * 2020-03-31 2024-01-26 中国科学院自动化研究所 Cloud collaborative deep learning model distributed training method and system
CN111611240A (en) * 2020-04-17 2020-09-01 第四范式(北京)技术有限公司 Method, apparatus and device for executing automatic machine learning process
CN111611087B (en) * 2020-06-30 2023-03-03 中国人民解放军国防科技大学 Resource scheduling method, device and system
CN112799850A (en) * 2021-02-26 2021-05-14 重庆度小满优扬科技有限公司 Model training method, model prediction method, and model control system
CN113065843A (en) * 2021-03-15 2021-07-02 腾讯科技(深圳)有限公司 Model processing method and device, electronic equipment and storage medium
CN112882696B (en) * 2021-03-24 2024-02-02 国家超级计算天津中心 Full-element model training system based on supercomputer
CN113569987A (en) * 2021-08-19 2021-10-29 北京沃东天骏信息技术有限公司 Model training method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104052811A (en) * 2014-06-17 2014-09-17 华为技术有限公司 Service scheduling method and device and system
CN112753016A (en) * 2018-09-30 2021-05-04 华为技术有限公司 Management method and device for computing resources in data preprocessing stage in neural network
WO2021040584A1 (en) * 2019-08-26 2021-03-04 Telefonaktiebolaget Lm Ericsson (Publ) Entity and method performed therein for handling computational resources
CN113467922A (en) * 2020-03-30 2021-10-01 阿里巴巴集团控股有限公司 Resource management method, device, equipment and storage medium
CN111930524A (en) * 2020-10-10 2020-11-13 上海兴容信息技术有限公司 Method and system for distributing computing resources
CN114091688A (en) * 2021-11-25 2022-02-25 北京九章云极科技有限公司 Computing resource obtaining method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114091688B (en) 2022-05-20
CN114091688A (en) 2022-02-25

Similar Documents

Publication Publication Date Title
WO2023093375A1 (en) Computing resource acquisition method and apparatus, electronic device, and storage medium
US10466873B2 (en) Techniques for asynchronous rendering
WO2020164469A1 (en) Neural network calculation method and apparatus, mobile terminal and storage medium
US20190087490A1 (en) Text classification method and apparatus
US11403006B2 (en) Configurable machine learning systems through graphical user interfaces
US20190102695A1 (en) Generating machine learning systems using slave server computers
US20190102675A1 (en) Generating and training machine learning systems using stored training datasets
EP3731161A1 (en) Model application method and system, and model management method and server
US10453165B1 (en) Computer vision machine learning model execution service
US10313746B2 (en) Server, client and video processing method
EP3617896A1 (en) Method and apparatus for intelligent response
US20140379399A1 (en) Method and System for Dynamically Determining Completion Status in a Human Intelligence System
CN107291337A (en) A kind of method and device that Operational Visit is provided
WO2022048648A1 (en) Method and apparatus for achieving automatic model construction, electronic device, and storage medium
JP2023541742A (en) Sorting model training method and device, electronic equipment, computer readable storage medium, computer program
CN112953767A (en) Resource allocation parameter setting method and device based on Hadoop platform and storage medium
CN110888672B (en) Expression engine implementation method and system based on metadata architecture
CN114924851A (en) Training task scheduling method and device, electronic equipment and storage medium
CN112667368A (en) Task data processing method and device
US10699329B2 (en) Systems and methods for document to order conversion
WO2019052389A1 (en) Task optimization method and device in mobile robot
CN113408632A (en) Method and device for improving image classification accuracy, electronic equipment and storage medium
CN109783134B (en) Front-end page configuration method and device and electronic equipment
US20200226132A1 (en) Profile data store automation via bots
CN111143643A (en) Element identification method and device, readable storage medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897458

Country of ref document: EP

Kind code of ref document: A1