CN107450961A

CN107450961A - A kind of distributed deep learning system and its building method, method of work based on Docker containers

Info

Publication number: CN107450961A
Application number: CN201710866197.0A
Authority: CN
Inventors: 张舒; 吴大雷; 张秀真
Original assignee: Ji'nan Junda Information Technology Co Ltd
Current assignee: Ji'nan Junda Information Technology Co Ltd
Priority date: 2017-09-22
Filing date: 2017-09-22
Publication date: 2017-12-08
Anticipated expiration: 2037-09-22
Also published as: CN107450961B

Abstract

The present invention relates to a kind of distributed deep learning system and its building method, method of work based on Docker containers, including a server host, the first distributed deep learning platform, the second distributed deep learning platform.The present invention utilizes Docker containerization technique, and multiple distributed deep learning systems are carried simultaneously on a server host.The improvement of the present invention is mainly reflected in three aspects：First, whole system can be realized on a server host, it is not necessary to more main frames, save cost；Second, container is created by template mirror image, process is simple, it is not necessary to which repetition is built, and is avoided the occurrence of mistake and is lost time；Third, the CPU of server can be utilized maximumlly, hardware resource is no longer wasted.

Description

A kind of distributed deep learning system and its building method based on Docker containers, Method of work

Technical field

The present invention relates to a kind of distributed deep learning system based on Docker containers and its building method, work side Method, belong to cloud computing technical field of virtualization.

Background technology

In essence, cloud computing refers to that user terminal obtains storage, calculating, database calculating money by remotely connecting Source.Virtualization technology is one of core composition of cloud computing technology, is fully to integrate various calculating and storage resource and efficiently The key technology utilized, including server virtualization and desktop virtualization.Docker as emerging lightweight virtualization technology, Compared with traditional VM, its more light weight, toggle speed faster, can run hundreds and thousands of individual containers simultaneously, so non-on separate unit hardware Often it is adapted to extending transversely by starting a large amount of containers progress in the peak traffic phase.

Deep learning platform uses unit processing at present, seldom uses distributed deep learning platform, distributed first Deep learning platform building is more complicated, and required hardware configuration is more.But compared to the deep learning platform of unit, distribution Formula deep learning platform can carry out computing faster.

Current these following problems of technology generally existing on the market：

1) unit deep learning platform is done using server, CPU amounts of calculation are enough, but can not be fully utilized, and cause to provide Source wastes.

2) distributed deep learning platform is, it is necessary to multiple host is built, the CPU limited calculated amounts of every main frame, it is desirable to take It is big to build host number needed for extensive platform, it is costly.

3) distributed deep learning platform building process is cumbersome, and using main frame building method, every main frame all needs to carry out phase Same step.But repeat that different mistakes occurs during same steps, cause the process of building slow.

Chinese patent literature CN106657248A disclose a kind of Network Load Balance system based on Docker containers and Its building method, method of work.Using basic technology of the Docker container techniques as the system, Docker container saving is utilized The characteristics of hardware resource can create a large amount of containers in a server host, a whole set of network is realized on a server host SiteServer LBS；It can be created using Docker containers by mirror image second level, and the container created by mirror image can guarantee that The characteristics of exactly the same, realize the shunting that the Web server amount of conducting interviews or data flow are conveniently added by container mirror image. But there is following defect in the patent：When creating mirror image using Dockerfile, it is impossible to visualization and some in test mirrors picture Whether file configuration succeeds.

The content of the invention

In view of the shortcomings of the prior art, the invention provides a kind of distributed deep learning system based on Docker containers System；

Present invention also offers the building method and method of work of above-mentioned distributed deep learning system；

The present invention utilizes Docker containerization technique, and multiple distributed depth are carried simultaneously on a server host Learning system.The improvement of the present invention is mainly reflected in three aspects：First, will configuration by using Docker commit instructions Good container generation mirror image, realizes whether some file configurations in visualization and test mirrors picture succeed；Second, whole system can To be realized on a server host, it is not necessary to more main frames, save cost；Third, container is created by template mirror image, mistake Journey is simple, it is not necessary to which repetition is built, and is avoided the occurrence of mistake and is lost time；Fourth, the CPU of server can be utilized maximumlly, no longer Waste hardware resource.

Term is explained：

1st, Hadoop distributed platforms, the distributed system architecture developed by Apache funds club is referred to.User Distributed program can be developed in the case where not knowing about distributed low-level details.The power of cluster is made full use of to carry out at a high speed Computing and storage.Hadoop realizes a distributed file system Hadoop Distributed File System, referred to as HDFS.HDFS has the characteristics of high fault tolerance, and is designed to be deployed on cheap hardware；And it provides high-throughput The data of access application, it is adapted to those to have the application program of super large data set.HDFS relaxes POSIX requirement, can To access the data in file system in the form of streaming.

2nd, Spark, refer to that the class Hadoop MapReduce's that UC Berkeley AMP lab are increased income is general parallel Computational frame, the Distributed Calculation that Spark is realized based on map reduce algorithms, possesses possessed by Hadoop MapReduce Advantage；But what it is different from MapReduce is that output and result can be stored in internal memory among Job, so as to no longer need to read and write HDFS, therefore Spark can preferably be applied to the algorithm that data mining and machine learning etc. need the map reduce of iteration.

3rd, NameNode, the NameSpace of file system is managed.It maintains all in file system tree and whole tree File and catalogue.These information are permanently stored on local disk with two document forms：NameSpace image file and editor Journal file.NameNode also records in each file the back end information where each piece, but its not persistence The positional information of block, because these information are rebuild when system starts by back end.

The technical scheme is that：

A kind of distributed deep learning system based on Docker containers, including a host and multiple Docker hold Device, Hadoop distributed platforms, Spark are installed on host, the first distributed deep learning is also equipped with host and is put down Platform or the second distributed deep learning platform；Hadoop distributed platforms, Spark are installed on each Docker containers, each The first distributed deep learning platform or the second distributed deep learning platform are also equipped with Docker containers.

Server host is as host, as the hardware support of whole platform, the first distributed deep learning platform and Second distributed deep learning platform is two kinds of currently available distributed deep learning platforms, is all increased income by Yahoo, is current The distributed deep learning platform of main flow.

First distributed deep learning platform, the second distributed deep learning platform are used to help carry out deep learning Instrument, different from unit deep learning platform；Hardware foundation of the server host as whole distributed deep learning system, is needed Possess higher position reason ability, stability, reliability etc. to require.

According to currently preferred, the model DELL PowerEdge R730 of the host, first distribution The model CaffeOnSpark of deep learning platform, the model of the second distributed deep learning platform TensorFlowOnSpark。

DELL PowerEdge R730 server, it is configured to 48 core CPU, 96G internal memories, 8TB local hard drives；Caffe、 TensorFlow is two most popular at present unit deep learning platforms, based on Caffe, TensorFlow CaffeOnSpark, TensorFlowOnSpark are that the distributed deep learning based on Hadoop/Spark that Yahoo increases income is put down Platform.

The building method of the above-mentioned distributed deep learning system based on Docker containers, specific steps include：

(1) host is prepared, host is the server host；Ubuntu14.04 operating systems are installed； Ubuntu14.04 can be mounted directly as metastable version in the (SuSE) Linux OS for supporting Docker with order line Configure Docker environment；

(2) main folder needed for Docker containers is established under host root, main folder includes being capable of carry File, carry out training pattern required for deep learning, training dataset for preserving, test data set, code and match somebody with somebody Put file；

(3) Hadoop distributed platforms, Spark are installed in host；To support that CaffeOnSpark is distributed deep Spend learning platform or TensorFlowOnSpark distribution deep learning platforms；Test Hadoop distributed platforms, Spark are It is no to install successfully；If installed successfully, into step (4), otherwise, step (3) is repeated；

(4) model CaffeOnSpark the first distributed deep learning platform or model is installed in host TensorFlowOnSpark the second distributed deep learning platform, configure the IP of the host node；By the host during system operation Machine is as host node；

(5) blank vessel is created on host；

(6) Hadoop distributed platforms, Spark are installed on the blank vessel；

(7) model CaffeOnSpark the first distributed deep learning is installed on the container after step (6) installation Second distributed deep learning platform of platform or model TensorFlowOnSpark, configures the IP from node；System is transported Using the container as from node during row；

(8) container after being installed by Docker commit instructions using step (7) is template establishment mirror image；

(9) mirror image created with step (8) creates multiple Docker containers, and with configuring the IP of each Docker containers Location.

It is as follows whether test Hadoop distributed platforms install successful step：Perform NameNode formatting, success If, can be appreciated that " successfully formatted " and " Exitting with status 0 " prompting, if " Exitting with status 1 " are then errors.If prompt Error in this step:JAVA_HOME is not set And could not be found. mistake, then JAVA_HOME environmental variances are set just not set over there before explanation, JAVA_HOME variables are please first set by study course, otherwise behind process do not gone down.Then NameNode is opened With DataNode finger daemons, if there are following SSH promptings, yes is inputted.

It is as follows whether test Spark installs successful step：Have under spark/examples/src/main catalogues Spark example procedure, there is the version of the language such as Scala, Java, Python, R.Run an example procedure SparkPi (i.e. Calculate π approximation), very more operation informations can be exported during execution, output result is not easily found, and can be ordered by grep Order is filtered, and the operation result after filtering obtains π 5 decimal approximations.

The method of work of the above-mentioned distributed deep learning system based on Docker containers, specific steps include：

(1) Hadoop platform and Spark in the host are started, the host is as whole distributed depth The host node of learning system, and start Hadoop platform and Spark in several described Docker containers, several described Docker Container is as whole distributed deep learning system from node；

(2) it is capable of the training pattern being stored under the file of carry required for deep learning training, training in host Data set, test data set, code and configuration file；

(3) trained by script startup deep learning, deep learning training mission is assigned to each from node by host node Carry out parallel training.

Beneficial effects of the present invention are：

1st, the present invention can erect distributed deep learning platform in the case where using a server host.

2nd, when needing more distributed nodes, energy quick opening container, which is matched somebody with somebody, postpones addition node.

3rd, the CPU computing resources of server are made full use of.

Brief description of the drawings

Fig. 1 is the structured flowchart of the distributed deep learning system of the invention based on Docker containers；

Embodiment

The present invention is further qualified with reference to Figure of description and embodiment, but not limited to this.

Embodiment 1

A kind of distributed deep learning system based on Docker containers, as shown in figure 1, including a host and multiple Docker containers, Hadoop distributed platforms, Spark are installed on host, it is distributed deep that first is also equipped with host Spend learning platform or the second distributed deep learning platform；Be provided with each Docker containers Hadoop distributed platforms, Spark, the first distributed deep learning platform or the second distributed deep learning platform are also equipped with each Docker containers.

The model DELL PowerEdge R730 of host, the model of the first distributed deep learning platform CaffeOnSpark, the model TensorFlowOnSpark of the second distributed deep learning platform.

Embodiment 2

The building method of the distributed deep learning system based on Docker containers described in embodiment 1, specific steps bag Include：

(1) host is prepared, host is server host；Ubuntu14.04 operating systems are installed； Ubuntu14.04 can be mounted directly as metastable version in the (SuSE) Linux OS for supporting Docker with order line Configure Docker environment；

(5) blank vessel is created on host；

(6) Hadoop distributed platforms, Spark are installed on the blank vessel；

Embodiment 3

The method of work of the distributed deep learning system based on Docker containers described in embodiment 1, specific steps bag Include：

Claims

1. a kind of distributed deep learning system based on Docker containers, it is characterised in that including a host and multiple Docker containers, Hadoop distributed platforms, Spark are installed on host, it is distributed deep that first is also equipped with host Spend learning platform or the second distributed deep learning platform；Be provided with each Docker containers Hadoop distributed platforms, Spark, the first distributed deep learning platform or the second distributed deep learning platform are also equipped with each Docker containers.

2. a kind of distributed deep learning system based on Docker containers according to claim 1, it is characterised in that described The model DELL PowerEdge R730 of host, the model of the first distributed deep learning platform CaffeOnSpark, the model TensorFlowOnSpark of the second distributed deep learning platform.

3. a kind of building method of distributed deep learning system based on Docker containers according to claim 1 or 2, Characterized in that, specific steps include：

(1) host is prepared, host is server host；

(2) main folder needed for Docker containers is established under host root, main folder includes the text for being capable of carry Part is pressed from both sides, and training pattern, training dataset, test data set, code and configuration text required for deep learning are carried out for preservation Part；

(3) Hadoop distributed platforms, Spark are installed in host；Whether test Hadoop distributed platforms, Spark pacify Dress up work(；If installed successfully, into step (4), otherwise, step (3) is repeated；

(4) model CaffeOnSpark the first distributed deep learning platform or model is installed in host TensorFlowOnSpark the second distributed deep learning platform, configure the IP of the host node；

(5) blank vessel is created on host；

(6) Hadoop distributed platforms, Spark are installed on the blank vessel；

(7) model CaffeOnSpark the first distributed deep learning platform is installed on the container after step (6) installation Or model TensorFlowOnSpark the second distributed deep learning platform, configure the IP from node；

(9) mirror image created with step (8) creates multiple Docker containers, and configures the IP address of each Docker containers.

4. a kind of method of work of distributed deep learning system based on Docker containers according to claim 1 or 2, Characterized in that, specific steps include：

(1) Hadoop platform and Spark in the host are started, the host is as whole distributed deep learning system The host node of system, and start Hadoop platform and Spark in several described Docker containers, several described Docker containers As whole distributed deep learning system from node；

(2) it is capable of the training pattern being stored under the file of carry required for deep learning training, training data in host Collection, test data set, code and configuration file；

(3) trained by script startup deep learning, deep learning training mission is assigned to each from node progress by host node Parallel training.