CN109086134A - A kind of operation method and device of deep learning operation - Google Patents
A kind of operation method and device of deep learning operation Download PDFInfo
- Publication number
- CN109086134A CN109086134A CN201810793520.0A CN201810793520A CN109086134A CN 109086134 A CN109086134 A CN 109086134A CN 201810793520 A CN201810793520 A CN 201810793520A CN 109086134 A CN109086134 A CN 109086134A
- Authority
- CN
- China
- Prior art keywords
- deep learning
- calculate node
- docker
- mirror image
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses the operation methods and device of a kind of deep learning operation, this method comprises: receiving selection of the user for resource required for operation deep learning operation and the selection for submitting the docker mirror image of deep learning operation;Deep learning operation is dispatched according to the use of calculate node and loading condition;When by deep learning job scheduling to calculate node, docker mirror image selected by user is pushed from mirror image warehouse, and docker container is created in each calculate node in the cluster;The hardware resource that calculate node is distributed according to deep learning operation is mapped to docker mirror image, and deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.By the above-mentioned means, reducing user using cluster operation deep learning operation the time it takes and energy.
Description
Technical field
The present invention relates to computer cluster, the espespecially a kind of operation method and device of deep learning operation.
Background technique
The concept of deep learning is derived from the research of artificial neural network.Multilayer perceptron containing more hidden layers is exactly a kind of depth
Learning structure.Deep learning, which forms more abstract high level by combination low-level feature, indicates attribute classification or feature, with discovery
The distributed nature of data indicates.The deep learning frame of present mainstream include tensorflow, caffe, pytorch,
mxnet.Cluster is reflected what one group of computer system interconnection was formed with triangular web by high performance network or local area network
The high-performance of picture, enhanced scalability, high performance price ratio computer cluster.As group system is in scientific algorithm, quotient
The extensive use of industry operation etc., the effect that group system plays is also more and more important, and being increasingly becoming can not in above-mentioned field
Or scarce tool.When cluster application is when deep learning, since deep learning needs to be implemented a large amount of calculating, it is therefore desirable to cluster
System there is a large amount of calculate node with provide a large amount of hardware resource (for example, GPU (Graphics Processing Unit,
Graphics processing unit) resource).But cluster node substantial amounts, it is difficult to which United Dispatching provides hardware resource, therefore cluster
Hardware resource utilization rate it is low, and the hardware resource for dispatching the node of cluster can spend user's a large amount of time and essence
Power.In addition, the framing dependence of different deep learning frames is different, user is different depth before model training
It practises frame and configures different training environments, this also needs to expend considerable time and effort.
Summary of the invention
In order to solve the above-mentioned technical problems, the present invention provides the operation method and device of a kind of deep learning operation,
User can be reduced using cluster operation deep learning operation the time it takes and energy.
To achieve the goals above, on the one hand, the embodiment provides a kind of operation sides of deep learning operation
Method, this method comprises:
User is received for the selection of resource required for operation deep learning operation and for submitting deep learning to make
The selection of the docker mirror image of industry;
Deep learning operation is dispatched according to the use of calculate node and loading condition;
When by deep learning job scheduling to calculate node, docker selected by user is pushed from mirror image warehouse
Mirror image, and docker container is created in each calculate node in the cluster;
The hardware resource that calculate node is distributed according to deep learning operation is mapped to docker mirror image, and uses and reflects
Hardware resource and the docker container of docker mirror image are mapped to run deep learning operation.
Further, in an alternative embodiment, required resource includes:
Using the cpu resource of deep learning task training, GPU resource, framework type, queuing message.
Further, in an alternative embodiment, the calculate node in cluster and management node use network file
The mode of system NFS carrys out shared stored file;
In the step of using the hardware resource and docker container for being mapped to docker mirror image to run deep learning operation
Later, this method further include:
It will be using the model file storage of deep learning task training to calculate node, so that calculate node is by model file
Share to management node.
Further, in an alternative embodiment, docker container is created in each calculate node in the cluster
The step of after, this method further include:
Cluster is configured using overlay network tool flannel.
To achieve the goals above, on the other hand, the embodiment of the invention provides a kind of operation of deep learning operation dresses
It sets, which includes: that user selects receiving module, job scheduling module, container creation module and job run module;Its
In,
User select receiving module be used for: receive user for operation deep learning operation required for resource selection with
And the selection for submitting the docker mirror image of deep learning operation;
Job scheduling module is used for: deep learning operation is dispatched according to the use of calculate node and loading condition;
Container creation module is used for: when by deep learning job scheduling to calculate node, being pushed from mirror image warehouse
Docker mirror image selected by user, and docker container is created in each calculate node in the cluster;
Job run module is used for: the hardware resource that calculate node is distributed according to deep learning operation is mapped to
Docker mirror image, and deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.
Further, in an alternative embodiment, required resource includes:
Using the cpu resource of deep learning task training, GPU resource, framework type, queuing message.
Further, in an alternative embodiment, the calculate node in cluster and management node use network file
The mode of system NFS carrys out shared stored file;
The device further includes model file memory module, and model file memory module is used for: being used in job run module
Hardware resource and the docker container of docker mirror image are mapped to will make using deep learning after running deep learning operation
The model file of industry training is stored to calculate node, so that model file is shared to management node by calculate node.
Further, in an alternative embodiment, which further includes cluster configuration module, and cluster configuration module is used
In: after creating docker container in container creation module each calculate node in the cluster, using overlay network tool
Flannel configures cluster.
The beneficial effect of the embodiment of the present invention is that user can select operation deep learning operation institute by client
The hardware resource needed, and CPU and GPU resource in hardware resource are dynamically distributed using dispatcher software, therefore ensure that collection
The high usage of the hardware resource of group, and reduce hardware resource the time it takes and energy that user dispatches cluster.No
With deep learning frame can be convenient and efficiently operates on entire cluster, avoiding user, to be that different frames configure different
Framework environment, bottom run deep learning operation using docker container, avoid different frames and rely on conflict.Reduce user
Configuration surroundings the time it takes and energy.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification
It obtains it is clear that understand through the implementation of the invention.The objectives and other advantages of the invention can be by specification, right
Specifically noted structure is achieved and obtained in claim and attached drawing.
Detailed description of the invention
Attached drawing is used to provide to further understand technical solution of the present invention, and constitutes part of specification, with this
The embodiment of application technical solution for explaining the present invention together, does not constitute the limitation to technical solution of the present invention.
Fig. 1 is a kind of flow chart of the operation method of deep learning operation provided in an embodiment of the present invention;
Fig. 2 is a kind of block diagram of the running gear of deep learning operation provided in an embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention
Embodiment be described in detail.It should be noted that in the absence of conflict, in the embodiment and embodiment in the application
Feature can mutual any combination.
Step shown in the flowchart of the accompanying drawings can be in a computer system such as a set of computer executable instructions
It executes.Also, although logical order is shown in flow charts, and it in some cases, can be to be different from herein suitable
Sequence executes shown or described step.
On the one hand, the embodiment provides a kind of operation methods of deep learning operation, as shown in Figure 1, the party
Method includes step S101-S107.
Step S101 receives user for the selection of resource required for operation deep learning operation and deep for submitting
Spend the selection of the docker mirror image of learning performance.
Wherein, resource and depth required for user is run by the web page selected deep learning performance of client
The docker mirror image of learning performance, and user can also select or input training script.Client uses B/S (Browser/
Server, Browser/Server Mode) architecture management system is a kind of network structure mode after web rises, web browser
It is the most important application software of client.This mode has unified client, and the core that system function is realized is focused on
On server, the exploitation, maintenance and use of system are simplified.As long as installing a browser, such as Netscape in client
Navigator or Internet Explorer, server install the databases such as SQL Server, Oracle, MYSQL.Browser
Data interaction is carried out with database by Web Server.
After this, client can send to the management node in cluster and request, which can be HTTP
(HyperText Transfer Protocol, hypertext transfer protocol) request.Management node upon receiving a request, by institute
Received request is sent to dispatcher software.
Step S103 dispatches deep learning operation according to the use of calculate node and loading condition.
Here, using resource management software TORQUE coordinating operation dispatcher software Maui according to each calculating section in cluster
Point is respective to dispatch deep learning operation using with loading condition, deep learning operation is assigned to each calculate node, respectively
Hardware resource needed for a calculate node provides operation deep learning operation.
Step S105, it is each into cluster from mirror image warehouse when by deep learning job scheduling to calculate node
A calculate node pushes docker mirror image selected by user, and creates docker in each calculate node in the cluster and hold
Device.
Here, the selected docker mirror image push (push) of user is arrived each calculate node, so as to execute depth gauge
It can be regarded as and create docker container in each calculate node of industry.
The hardware resource that calculate node is distributed according to deep learning operation is mapped to docker mirror image by step S107,
And deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.
Docker is the application container engine of an open source, and the application for allowing developer that can be packaged them (refers to herein
Be deep learning operation) and rely on packet into a transplantable container, be then published to the Linux machine of any prevalence
On, it also may be implemented to virtualize, container is not have any interface between each other using sandbox mechanism completely.Due to docker
Include system independent of any language, frame, therefore run deep learning operation using docker in bottom, avoids difference
Deep learning frame framing dependence (framing dependence packet) between conflict.
The beneficial effect of the embodiment of the present invention is that user can select operation deep learning operation institute by client
The hardware resource needed, and CPU and GPU resource in hardware resource are dynamically distributed using dispatcher software, therefore ensure that collection
The high usage of the hardware resource of group, and reduce hardware resource the time it takes and energy that user dispatches cluster.No
With deep learning frame can be convenient and efficiently operates on entire cluster, avoiding user, to be that different frames configure different
Framework environment, bottom run deep learning operation using docker container, avoid different frames and rely on conflict.Reduce user
Configuration surroundings the time it takes and energy.
Further, in an alternative embodiment, required resource includes: using deep learning task training
Cpu resource, GPU resource, framework type, queuing message.
Further, in an alternative embodiment, the calculate node in cluster and management node use network file
The mode of system NFS (Network File System, Network File System) carrys out shared stored file.NFS is
One of the file system that FreeBSD is supported, it allows to pass through TCP/IP network shared resource between the computer in network.
After step S107, this method further include: by the model file storage using deep learning task training to meter
Operator node, so that model file is shared to management node by calculate node.User can obtain the model file from management node.
Further, in one embodiment, after step S105, this method further include: use overlay network tool
Flannel configures cluster.
When creating docker container in calculate node, due to the property of docker container, two calculate nodes
It is not intercommunication between docker container, therefore cluster is configured by deployment overlay network tool flannel, to docker container
IP planned, can be achieved with the communication between the docker container across calculate node.Working directory is mapped to conduct
The calculate node of docker host, setting GPU resource maps, and GPU use environment is arranged.
On the other hand, the embodiment provides a kind of running gears of deep learning operation, as shown in Fig. 2, should
Device includes: that user selects receiving module 201, job scheduling module 203, container creation module 205 and job run module
207;Wherein,
User selects receiving module 201 to be used for: receiving choosing of the user for resource required for operation deep learning operation
Select and for submit deep learning operation docker mirror image selection;
Job scheduling module 203 is used for: deep learning operation is dispatched according to the use of calculate node and loading condition;
Container creation module 205 is used for: when by deep learning job scheduling to calculate node, being pushed away from mirror image warehouse
Docker mirror image selected by user is sent, and creates docker container in each calculate node in the cluster;
Job run module 207 is used for: the hardware resource that calculate node is distributed according to deep learning operation is mapped to
Docker mirror image, and deep learning operation is run using the hardware resource and docker container that are mapped to docker mirror image.
The beneficial effect of the embodiment of the present invention is that user can select operation deep learning operation institute by client
The hardware resource needed, and CPU and GPU resource in hardware resource are dynamically distributed using dispatcher software, therefore ensure that collection
The high usage of the hardware resource of group, and reduce hardware resource the time it takes and energy that user dispatches cluster.No
With deep learning frame can be convenient and efficiently operates on entire cluster, avoiding user, to be that different frames configure different
Framework environment, bottom run deep learning operation using docker container, avoid different frames and rely on conflict.Reduce user
Configuration surroundings the time it takes and energy.
Further, in an alternative embodiment, required resource includes:
Using the cpu resource of deep learning task training, GPU resource, framework type, queuing message.
Further, in an alternative embodiment, the calculate node in cluster and management node use network file
The mode of system NFS carrys out shared stored file;
The device further includes model file memory module, and the model file memory module is used for: in job run mould
After block 207 runs deep learning operation using the hardware resource and docker container that are mapped to docker mirror image, it will use
The model file storage of deep learning task training is to calculate node, so that model file is shared to management section by calculate node
Point.
Further, in an alternative embodiment, which further includes cluster configuration module, and cluster configuration mould
Block is used for: after creating docker container in container creation module each calculate node in the cluster, using overlay network
Tool flannel configures cluster.
Although disclosed herein embodiment it is as above, above-mentioned content only for ease of understanding the present invention and use
Embodiment is not intended to limit the invention.Technical staff in any fields of the present invention is taken off not departing from the present invention
Under the premise of the spirit and scope of dew, any modification and variation, but the present invention can be carried out in the form and details of implementation
Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.
Claims (8)
1. a kind of operation method of deep learning operation characterized by comprising
User is received for the selection of resource required for operation deep learning operation and for submitting deep learning operation
The selection of docker mirror image;
The deep learning operation is dispatched according to the use of calculate node and loading condition;
When by the deep learning job scheduling to calculate node, docker selected by user is pushed from mirror image warehouse
Mirror image, and docker container is created in each calculate node in the cluster;
The hardware resource that calculate node is distributed according to the deep learning operation is mapped to the docker mirror image, and is adopted
Deep learning operation is run with the hardware resource and the docker container that are mapped to docker mirror image.
2. according to the method described in claim 1, wherein, the required resource includes:
Using cpu resource, GPU resource, framework type, the queuing message of the deep learning task training.
3. according to the method described in claim 1, wherein, calculate node and management node in the cluster use network file
The mode of system NFS carrys out shared stored file;
Deep learning operation is run using the hardware resource for being mapped to docker mirror image and the docker container described
After step, the method also includes:
By the model file storage using the deep learning task training to the calculate node, so that the calculate node will
The model file shares to management node.
4. the method according to claim 1, wherein being created in each calculate node in the cluster
After the step of docker container, the method also includes:
The cluster is configured using overlay network tool flannel.
5. a kind of running gear of deep learning operation characterized by comprising user selects receiving module, job scheduling mould
Block, container creation module and job run module;Wherein,
The user selects receiving module to be used for: receive user for resource required for operation deep learning operation selection with
And the selection for submitting the docker mirror image of deep learning operation;
The job scheduling module is used for: the deep learning operation is dispatched according to the use of calculate node and loading condition;
The container creation module is used for: when by the deep learning job scheduling to calculate node, from mirror image warehouse
Docker mirror image selected by user is pushed, and creates docker container in each calculate node in the cluster;
The job run module is used for: the hardware resource that calculate node is distributed according to the deep learning operation is mapped to
The docker mirror image, and depth is run using the hardware resource and the docker container that are mapped to docker mirror image
Exercises industry.
6. device according to claim 5, wherein it is described required for resource include:
Using cpu resource, GPU resource, framework type, the queuing message of the deep learning task training.
7. device according to claim 5, wherein calculate node and management node in the cluster use network file
The mode of system NFS carrys out shared stored file;
Described device further includes model file memory module, and the model file memory module is used for: in the job run mould
After block runs deep learning operation using the hardware resource and the docker container that are mapped to docker mirror image, it will use
The model file storage of the deep learning task training is to the calculate node, so that the calculate node is literary by the model
Part shares to management node.
8. device according to claim 5, which is characterized in that described device further includes cluster configuration module, the cluster
Configuration module is used for: the container creation module in each calculate node in the cluster creation docker container it
Afterwards, the cluster is configured using overlay network tool flannel.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810793520.0A CN109086134A (en) | 2018-07-19 | 2018-07-19 | A kind of operation method and device of deep learning operation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810793520.0A CN109086134A (en) | 2018-07-19 | 2018-07-19 | A kind of operation method and device of deep learning operation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109086134A true CN109086134A (en) | 2018-12-25 |
Family
ID=64837778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810793520.0A Pending CN109086134A (en) | 2018-07-19 | 2018-07-19 | A kind of operation method and device of deep learning operation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109086134A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109508238A (en) * | 2019-01-05 | 2019-03-22 | 咪付(广西)网络技术有限公司 | A kind of resource management system and method for deep learning |
CN109857475A (en) * | 2018-12-27 | 2019-06-07 | 深圳云天励飞技术有限公司 | A kind of method and device of frame management |
CN110688230A (en) * | 2019-10-17 | 2020-01-14 | 广州文远知行科技有限公司 | Synchronous training method and device, computer equipment and storage medium |
CN111090456A (en) * | 2019-12-06 | 2020-05-01 | 浪潮(北京)电子信息产业有限公司 | Construction method, device, equipment and medium for deep learning development environment |
CN111190713A (en) * | 2019-12-26 | 2020-05-22 | 曙光信息产业(北京)有限公司 | Job scheduling management method and device |
CN112114931A (en) * | 2019-06-21 | 2020-12-22 | 鸿富锦精密电子(天津)有限公司 | Deep learning program configuration method and device, electronic equipment and storage medium |
CN112148348A (en) * | 2019-06-28 | 2020-12-29 | 杭州海康威视数字技术股份有限公司 | Task processing method and device and storage medium |
CN112203291A (en) * | 2020-12-03 | 2021-01-08 | 中国科学院自动化研究所 | Cluster control method for area coverage and connectivity maintenance based on knowledge embedding |
CN112416585A (en) * | 2020-11-20 | 2021-02-26 | 南京大学 | GPU resource management and intelligent scheduling method for deep learning |
CN112422651A (en) * | 2020-11-06 | 2021-02-26 | 电子科技大学 | Cloud resource scheduling performance bottleneck prediction method based on reinforcement learning |
WO2021155667A1 (en) * | 2020-02-05 | 2021-08-12 | 北京百度网讯科技有限公司 | Model training method and apparatus, and clustering system |
CN113542352A (en) * | 2021-06-08 | 2021-10-22 | 支付宝(杭州)信息技术有限公司 | Node joint modeling method and node |
US11249749B2 (en) | 2020-03-26 | 2022-02-15 | Red Hat, Inc. | Automatic generation of configuration files |
CN114090183A (en) * | 2021-11-25 | 2022-02-25 | 北京字节跳动网络技术有限公司 | Application starting method and device, computer equipment and storage medium |
WO2023174163A1 (en) * | 2022-03-15 | 2023-09-21 | 之江实验室 | Neural model storage system for brain-inspired computer operating system, and method |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120317447A1 (en) * | 2011-06-07 | 2012-12-13 | Microsoft Corporation | Propagating unobserved exceptions in distributed execution environments |
CN102880832A (en) * | 2012-08-28 | 2013-01-16 | 曙光信息产业(北京)有限公司 | Method for implementing mass data management system under colony |
CN105224256A (en) * | 2015-10-13 | 2016-01-06 | 浪潮(北京)电子信息产业有限公司 | A kind of storage system |
CN106708622A (en) * | 2016-07-18 | 2017-05-24 | 腾讯科技(深圳)有限公司 | Cluster resource processing method and system, and resource processing cluster |
CN106790660A (en) * | 2017-01-18 | 2017-05-31 | 咪咕视讯科技有限公司 | A kind of dispositions method and device for realizing distributed memory system |
CN107135257A (en) * | 2017-04-28 | 2017-09-05 | 东方网力科技股份有限公司 | Task is distributed in a kind of node cluster method, node and system |
CN107450961A (en) * | 2017-09-22 | 2017-12-08 | 济南浚达信息技术有限公司 | A kind of distributed deep learning system and its building method, method of work based on Docker containers |
CN107733977A (en) * | 2017-08-31 | 2018-02-23 | 北京百度网讯科技有限公司 | A kind of cluster management method and device based on Docker |
CN107783818A (en) * | 2017-10-13 | 2018-03-09 | 北京百度网讯科技有限公司 | Deep learning task processing method, device, equipment and storage medium |
CN108062246A (en) * | 2018-01-25 | 2018-05-22 | 北京百度网讯科技有限公司 | For the resource regulating method and device of deep learning frame |
-
2018
- 2018-07-19 CN CN201810793520.0A patent/CN109086134A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120317447A1 (en) * | 2011-06-07 | 2012-12-13 | Microsoft Corporation | Propagating unobserved exceptions in distributed execution environments |
CN102880832A (en) * | 2012-08-28 | 2013-01-16 | 曙光信息产业(北京)有限公司 | Method for implementing mass data management system under colony |
CN105224256A (en) * | 2015-10-13 | 2016-01-06 | 浪潮(北京)电子信息产业有限公司 | A kind of storage system |
CN106708622A (en) * | 2016-07-18 | 2017-05-24 | 腾讯科技(深圳)有限公司 | Cluster resource processing method and system, and resource processing cluster |
CN106790660A (en) * | 2017-01-18 | 2017-05-31 | 咪咕视讯科技有限公司 | A kind of dispositions method and device for realizing distributed memory system |
CN107135257A (en) * | 2017-04-28 | 2017-09-05 | 东方网力科技股份有限公司 | Task is distributed in a kind of node cluster method, node and system |
CN107733977A (en) * | 2017-08-31 | 2018-02-23 | 北京百度网讯科技有限公司 | A kind of cluster management method and device based on Docker |
CN107450961A (en) * | 2017-09-22 | 2017-12-08 | 济南浚达信息技术有限公司 | A kind of distributed deep learning system and its building method, method of work based on Docker containers |
CN107783818A (en) * | 2017-10-13 | 2018-03-09 | 北京百度网讯科技有限公司 | Deep learning task processing method, device, equipment and storage medium |
CN108062246A (en) * | 2018-01-25 | 2018-05-22 | 北京百度网讯科技有限公司 | For the resource regulating method and device of deep learning frame |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109857475A (en) * | 2018-12-27 | 2019-06-07 | 深圳云天励飞技术有限公司 | A kind of method and device of frame management |
CN109857475B (en) * | 2018-12-27 | 2020-06-16 | 深圳云天励飞技术有限公司 | Framework management method and device |
CN109508238A (en) * | 2019-01-05 | 2019-03-22 | 咪付(广西)网络技术有限公司 | A kind of resource management system and method for deep learning |
CN112114931B (en) * | 2019-06-21 | 2023-12-26 | 富联精密电子(天津)有限公司 | Deep learning program configuration method and device, electronic equipment and storage medium |
CN112114931A (en) * | 2019-06-21 | 2020-12-22 | 鸿富锦精密电子(天津)有限公司 | Deep learning program configuration method and device, electronic equipment and storage medium |
CN112148348A (en) * | 2019-06-28 | 2020-12-29 | 杭州海康威视数字技术股份有限公司 | Task processing method and device and storage medium |
CN112148348B (en) * | 2019-06-28 | 2023-10-20 | 杭州海康威视数字技术股份有限公司 | Task processing method, device and storage medium |
CN110688230B (en) * | 2019-10-17 | 2022-06-24 | 广州文远知行科技有限公司 | Synchronous training method and device, computer equipment and storage medium |
CN110688230A (en) * | 2019-10-17 | 2020-01-14 | 广州文远知行科技有限公司 | Synchronous training method and device, computer equipment and storage medium |
CN111090456A (en) * | 2019-12-06 | 2020-05-01 | 浪潮(北京)电子信息产业有限公司 | Construction method, device, equipment and medium for deep learning development environment |
CN111190713A (en) * | 2019-12-26 | 2020-05-22 | 曙光信息产业(北京)有限公司 | Job scheduling management method and device |
WO2021155667A1 (en) * | 2020-02-05 | 2021-08-12 | 北京百度网讯科技有限公司 | Model training method and apparatus, and clustering system |
US11249749B2 (en) | 2020-03-26 | 2022-02-15 | Red Hat, Inc. | Automatic generation of configuration files |
CN112422651A (en) * | 2020-11-06 | 2021-02-26 | 电子科技大学 | Cloud resource scheduling performance bottleneck prediction method based on reinforcement learning |
CN112416585A (en) * | 2020-11-20 | 2021-02-26 | 南京大学 | GPU resource management and intelligent scheduling method for deep learning |
CN112416585B (en) * | 2020-11-20 | 2024-03-15 | 南京大学 | Deep learning-oriented GPU resource management and intelligent scheduling method |
CN112203291A (en) * | 2020-12-03 | 2021-01-08 | 中国科学院自动化研究所 | Cluster control method for area coverage and connectivity maintenance based on knowledge embedding |
CN113542352A (en) * | 2021-06-08 | 2021-10-22 | 支付宝(杭州)信息技术有限公司 | Node joint modeling method and node |
CN113542352B (en) * | 2021-06-08 | 2024-04-09 | 支付宝(杭州)信息技术有限公司 | Node joint modeling method and node |
CN114090183A (en) * | 2021-11-25 | 2022-02-25 | 北京字节跳动网络技术有限公司 | Application starting method and device, computer equipment and storage medium |
WO2023174163A1 (en) * | 2022-03-15 | 2023-09-21 | 之江实验室 | Neural model storage system for brain-inspired computer operating system, and method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109086134A (en) | A kind of operation method and device of deep learning operation | |
US11481616B2 (en) | Framework for providing recommendations for migration of a database to a cloud computing system | |
CN103516777B (en) | For carrying out the method and system supplied in cloud computer environment | |
JP6659544B2 (en) | Automated experimental platform | |
CN105593813B (en) | For visualizing the presentation interpreter of the data provided from constrained environment container | |
CN108958892A (en) | A kind of method and apparatus creating the container for deep learning operation | |
US8719833B2 (en) | Adaptive demand-driven load balancing | |
CN102760074B (en) | Method and its system for high load capacity operation flow scalability | |
CN110083455B (en) | Graph calculation processing method, graph calculation processing device, graph calculation processing medium and electronic equipment | |
US10178163B2 (en) | Server-processor hybrid system for processing data | |
EP3032442B1 (en) | Modeling and simulation of infrastructure architecture for big data | |
CN103093034B (en) | Based on the Collaborative Design method of cloud computing | |
CN103425529A (en) | System and method for migrating virtual machines between networked computing environments based on resource utilization | |
Kale | Guide to cloud computing for business and technology managers: from distributed computing to cloudware applications | |
CN102291445A (en) | Cloud computing management system based on virtual resources | |
CN109117252A (en) | Method, system and the container cluster management system of task processing based on container | |
CN115202729A (en) | Container service-based mirror image generation method, device, equipment and medium | |
US20090132582A1 (en) | Processor-server hybrid system for processing data | |
JP5822414B2 (en) | General-purpose simulation system using social network interface | |
Chen et al. | Web-FEM: An internet-based finite-element analysis framework with 3D graphics and parallel computing environment | |
Willis et al. | Container-based analysis environments for low-barrier access to research data | |
Tahboub et al. | Novel Approach for Remote Energy Meter Reading Using Mobile Agents | |
AU2015101031A4 (en) | System and a method for modelling the performance of information systems | |
CN115361382A (en) | Data processing method, device, equipment and storage medium based on data group | |
Wu et al. | Optimizing network performance of computing pipelines in distributed environments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181225 |