CN111459506A

CN111459506A - Deployment method, device, medium and electronic equipment of deep learning platform cluster

Info

Publication number: CN111459506A
Application number: CN202010136850.XA
Authority: CN
Inventors: 钟孝勋; 贺波; 万书武; 李均; 蒋英明
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-03-02
Filing date: 2020-03-02
Publication date: 2020-07-28
Anticipated expiration: 2040-03-02
Also published as: CN111459506B

Abstract

The disclosure relates to the field of process optimization, and discloses a deployment method, device, medium and electronic equipment of a deep learning platform cluster. The method comprises the following steps: setting a storage node; uploading a deep learning platform installation package, a deep learning calculation installation package corresponding to the deep learning platform installation package, a system installation package which runs on a target system and is depended by the deep learning platform, and an installation program of the deep learning platform to a storage node for storage; and issuing a downloading instruction to target equipment to be deployed with the deep learning platform, so that after the target equipment downloads the installation program of the deep learning platform according to the downloading instruction, the target equipment downloads and installs the installation package of the deep learning platform matched with the target equipment, the system installation package on which the deep learning platform depends and/or the deep learning calculation installation package from the storage node by running the installation program. Under the method, the deployment efficiency of the deep learning platform, particularly the deep learning platform cluster, is improved, and the deployment cost is reduced.

Description

Deployment method, device, medium and electronic equipment of deep learning platform cluster

Technical Field

The present disclosure relates to the field of process optimization technologies, and in particular, to a deployment method, an apparatus, a medium, and an electronic device for a deep learning platform cluster.

Background

The wave of artificial intelligence is rolling the world, and with the development of artificial intelligence, platforms such as a deep learning framework and the like for providing services for deep learning are emerging continuously. However, at present, if a deep learning platform is to be installed, a variety of factors such as a version of the deep learning platform itself, a version of an operating system used by a device to be installed, versions of various installation packages in an environment on which the deep learning platform to be installed depends, and a hardware configuration of the device to be installed need to be considered, so that when the deep learning platform is deployed on the device at present, a relevant person needs to have professional skills and perform cumbersome operations, and particularly when a cluster of the deep learning platform is to be deployed, problems such as low deployment efficiency, high deployment cost, and the like exist.

Disclosure of Invention

In the technical field of process optimization, in order to solve the technical problems, the present disclosure aims to provide a deployment method, an apparatus, a medium, and an electronic device for a deep learning platform cluster.

According to an aspect of the present application, there is provided a deployment method of a deep learning platform cluster, the method including:

setting a storage node;

uploading an installation package of at least one deep learning platform, a deep learning calculation installation package corresponding to the installation package of at least one deep learning platform and a system installation package which is depended by the deep learning platform and runs on a target system to the storage node for storage, wherein the installation packages of all the deep learning platforms are installation packages of the same deep learning platform;

uploading the installation program of the deep learning platform to the storage node for storage;

and issuing a downloading instruction to target equipment to be deployed with the deep learning platform, so that after the target equipment downloads the installation program of the deep learning platform according to the instruction of the downloading instruction, the target equipment downloads and installs the installation package of the deep learning platform matched with the target equipment, and runs the system installation package and/or the deep learning calculation installation package on which the deep learning platform depends from the storage node by running the installation program of the deep learning platform.

According to another aspect of the present application, there is provided a deployment apparatus for a deep learning platform cluster, the apparatus comprising:

a setting module configured to set a storage node;

the system comprises a first uploading module, a second uploading module and a third uploading module, wherein the first uploading module is configured to upload an installation package of at least one deep learning platform, a deep learning calculation installation package corresponding to the installation package of the at least one deep learning platform and a system installation package which is depended by the deep learning platform and runs on a target system to the storage node for storage, and the installation packages of all the deep learning platforms are installation packages of the same deep learning platform;

the second uploading module is configured to upload the installation program of the deep learning platform to the storage node for storage;

and the instruction issuing module is configured to issue a downloading instruction to the target equipment to be deployed with the deep learning platform, so that after the target equipment downloads the installation program of the deep learning platform according to the instruction of the downloading instruction, the target equipment downloads and installs the installation package of the deep learning platform matched with the target equipment from the storage node, and runs the system installation package and/or the deep learning calculation installation package on which the deep learning platform depends.

According to another aspect of the present application, there is provided a computer readable program medium storing computer program instructions which, when executed by a computer, cause the computer to perform the method as previously described.

According to another aspect of the present application, there is provided an electronic device including:

a processor;

a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method as previously described.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

the deployment method of the deep learning platform cluster comprises the following steps: setting a storage node; uploading an installation package of at least one deep learning platform, a deep learning calculation installation package corresponding to the installation package of at least one deep learning platform and a system installation package which is depended by the deep learning platform and runs on a target system to the storage node for storage, wherein the installation packages of all the deep learning platforms are installation packages of the same deep learning platform; uploading the installation program of the deep learning platform to the storage node for storage; and issuing a downloading instruction to target equipment to be deployed with the deep learning platform, so that after the target equipment downloads the installation program of the deep learning platform according to the instruction of the downloading instruction, the target equipment downloads and installs the installation package of the deep learning platform matched with the target equipment, and runs the system installation package and/or the deep learning calculation installation package on which the deep learning platform depends from the storage node by running the installation program of the deep learning platform.

In the method, after a storage node is arranged, an installation package of a deep learning platform, a deep learning calculation installation package, an installation program of the deep learning platform, and the installation package required by the installation and operation of the deep learning platform such as a system installation package which the deep learning platform depends on and is operated on a target system are respectively uploaded to the storage node, and when the deep learning platform is deployed on the target equipment, an instruction is issued to the corresponding target equipment, the corresponding target equipment can automatically download the installation program of the deep learning platform from the storage node according to the download instruction, and can automatically download the installation package required by the installation and operation of the deep learning platform which is adaptive to the target equipment from the storage node by installing and operating the installation program, so that the automatic adaptation, downloading and installation of the installation package required by the installation and operation of the deep learning platform are realized, therefore, the deployment efficiency of the deep learning platform, particularly the deep learning platform cluster, is improved, and the deployment cost is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a system architecture diagram illustrating a method of deployment of a deep learning platform cluster in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a method for deployment of a deep learning platform cluster in accordance with an exemplary embodiment;

FIG. 3 is a flowchart illustrating details of step 220 and step 240 according to one embodiment illustrated in a corresponding embodiment in FIG. 2;

FIG. 4 is a flowchart illustrating steps performed when a target device runs an installation script of a deep learning platform in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating a deployment apparatus of a deep learning platform cluster, according to an example embodiment;

FIG. 6 is a block diagram illustrating an example of an electronic device implementing the deployment method of the deep learning platform cluster described above, according to an example embodiment;

fig. 7 is a computer-readable storage medium for implementing the deployment method of the deep learning platform cluster according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

The disclosure first provides a deployment method of a deep learning platform cluster. The deep learning platform may also be referred to as a deep learning framework, and may provide a class, a function, and an API (Application Programming Interface) Interface for Programming various deep learning algorithms. These deep learning algorithms may include deep Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Generative Adaptive Networks (GANs), and the like. Common deep learning platforms are TensorFlow, Keras, PyTorch, Baidu fly-oar (PaddlePaddle), and the like. The deployment of the deep learning platform cluster refers to the deployment of deep learning platforms on a plurality of terminals or devices respectively.

The implementation terminal of the present disclosure may be any device having an operation and processing function, which may be connected to an external device for receiving or sending data, and specifically may be a portable mobile device, such as a smart phone, a tablet computer, a notebook computer, a pda (personal Digital assistant), etc., or may be a fixed device, such as a computer device, a field terminal, a desktop computer, a server, a workstation, etc., or may be a set of multiple devices, such as a physical infrastructure of cloud computing.

Preferably, the implementation terminal of the present disclosure may be a physical infrastructure of a server or cloud computing.

Fig. 1 is a system architecture diagram illustrating a deployment method of a deep learning platform cluster according to an exemplary embodiment. As shown in fig. 1, the system architecture includes a file server 120, a user terminal 130, and a device cluster 110 to be deployed with a deep learning platform, where the device cluster 110 includes a plurality of devices, and the user terminal 130 and each device, each device and the file server 120, and the file server 120 and the user terminal 130 are connected through communication links. In this embodiment, the user terminal 130 is an execution terminal of the present application, and target devices in the target device cluster 110 may be organized by a preset architecture to implement data interaction. When the deployment method of the deep learning platform cluster provided by the present disclosure is applied to the system architecture shown in fig. 1, a specific process may be as follows: first, the user terminal 130 configures the file server 120 as a storage node, so that the file server 120 can receive and store data sent by the user terminal 130; then, the user terminal 130 may upload the installation package of the deep learning platform, the corresponding deep learning calculation installation package, the system installation package on which the deep learning platform depends and the installation program of the deep learning platform to the storage node for storage; then, the user terminal 130 issues a download instruction to a target device to be deployed with the deep learning platform among the multiple devices of the device cluster 110 through a communication link, the target device may download an installation program of the deep learning platform according to the instruction of the download instruction, and then the target device downloads and installs an installation package of the deep learning platform adapted to the target device, a system installation package on which the deep learning platform depends, and/or a deep learning calculation installation package from a storage node by running the installation program, thereby implementing deployment of the deep learning platform on the target device, and the deep learning platforms deployed on each target device form a deep learning platform cluster.

It should be noted that the embodiment shown in fig. 1 is only one embodiment of the present application, and although in this embodiment, the device in the device cluster is a different device from the execution terminal of the present application, and the storage node for storing the installation program and the various installation packages is one node, in other embodiments or in specific applications, the device in the device cluster may be the execution terminal of the present application, it may be a different device from the target device, and the storage node for storing the installation program and the various installation packages may be a plurality of nodes, such as a storage node cluster.

FIG. 2 is a flow diagram illustrating a method for deployment of a deep learning platform cluster, according to an example embodiment. The present embodiment may be executed by the aforementioned server or desktop computer, as shown in fig. 2, and may include the following steps:

step 210, a storage node is set.

The storage node may be any device having storage and communication functions, and it may be the same type of device as the execution terminal of the present application, or may be a different type of device. The number of the set storage nodes may be one, or may be multiple, for example, when multiple storage nodes are set, the multiple storage nodes may be a server cluster.

The storage node may be set in various ways.

In one embodiment, the provisioning storage node comprises:

and sending an installation package of the storage management system to a target node so that the target node becomes a storage node after receiving and installing the installation package.

In one embodiment, the target node is pre-installed with a client, and the local terminal is pre-installed with a server corresponding to the client, and the setting storage node includes:

and configuring a configuration file corresponding to the client of the target node at a server so as to set the target node as a storage node.

In one embodiment, the storage node is a file server.

Step 220, uploading the installation package of the at least one deep learning platform, the deep learning calculation installation package corresponding to the installation package of the at least one deep learning platform, and the system installation package dependent on the deep learning platform running on the target system to the storage node for storage.

And all the installation packages of the deep learning platforms are the installation packages of the same deep learning platform.

As previously mentioned, the deep learning platform may be a variety of software system architectures that support running deep learning models or algorithms, such as TensorFlow.

When a plurality of installation packages of the deep learning platforms uploaded to the storage node for storage are provided, the installation packages of the deep learning platforms may be installation packages of the same platform but different versions.

The target System may be various operating systems, such as a red hat (RedHat) operating System based on L inux, a Community enterprise operating System (centros) based on L inux, a wubang (ubuntu) operating System based on L inux, and so on.

The deep learning calculation installation package is an installation package of a platform for assisting deep learning training reasoning. Detailed descriptions of the installation package, the deep learning computation installation package, and the system installation package regarding the deep learning platform will be set forth in the explanation of step 240, and will not be detailed here.

And step 230, uploading the installation program of the deep learning platform to the storage node for storage.

The installation program may be any program entity available for installing the deep learning platform, such as software, a module, a component, or a script.

And 240, issuing a downloading instruction to the target equipment to be deployed with the deep learning platform, so that after the target equipment downloads the installation program of the deep learning platform according to the instruction of the downloading instruction, by running the installation program of the deep learning platform, the installation package of the deep learning platform matched with the target equipment, the system installation package depended by the deep learning platform and/or the deep learning calculation installation package are downloaded and installed from the storage node.

The download instruction is an instruction for instructing the target device to download the installer of the deep learning platform.

Target equipment passes through the installer of operation deep learning platform, follow storage node download and install with the installation package of the deep learning platform that target equipment matches, operation the system installation package and/or the deep learning calculation installation package that the deep learning platform relied on, this means, target equipment can not follow simultaneously storage node download and install with the installation package of the deep learning platform that target equipment matches, operation the three kinds of installation packages of the system installation package, the deep learning calculation installation package that the deep learning platform relied on, the deep learning calculation installation package, for example can only follow storage node download and install with the installation package of the deep learning platform that target equipment matches, operation the system installation package that the deep learning platform relied on is two kinds of installation packages.

In one embodiment, an ansable automated operation and maintenance tool is used to issue a download instruction to a target device to which the deep learning platform is to be deployed.

The infrastructure is an automatic operation and maintenance tool, and does not need to install clients (clients) or agents (agents) on a remote host, and communicates with the remote host based on ssh.

For example, the target system is a RedHat operating system, the deep learning platform is a tensoflow, and the download instruction issued to the target device to be deployed with the deep learning platform may be wget install-tensoflow.sh, where the install-tensoflow.sh is a script-type installer.

In this step, by issuing a download instruction to the target device to which the deep learning platform is to be deployed, the target device that receives the download instruction can complete the construction of the deep learning platform and the environment on which the deep learning platform depends to operate according to the instruction.

In one embodiment, the deep learning computation installation package includes a generic parallel computation architecture installation package and a deep neural network library installation package, the system installation package on which the deep learning platform runs on the target system includes an installation package of a compiler suite and a system core library installation package, the installation package of the at least one deep learning platform includes an installation package of a graphics processing version of the at least one deep learning platform, the deep learning computation installation package corresponds to the installation package of the graphics processing version of the deep learning platform, and the specific steps of step 220 and step 240 may be as shown in fig. 3.

Fig. 3 is a flowchart illustrating details of step 220 and step 240 according to one embodiment illustrated in a corresponding embodiment of fig. 2. As shown in fig. 3, the method comprises the following steps:

step 220', uploading the installation package of at least one deep learning platform, the general parallel computing architecture installation package and the deep neural network library installation package corresponding to the installation package of the graphics processing version of the deep learning platform, and the installation package of the compiler suite and the system core library installation package matched with the installation package of the deep learning platform to the storage node for storage.

In this embodiment, to install the deep learning platform, in addition to the installation package of the deep learning platform itself, other dependent installation packages are also required, including an installation package of a general parallel computing architecture, an installation package of a deep neural network library, an installation package of a compiler suite, and an installation package of a system core library.

The correspondence between the installation packages means that the versions of the two installation packages are corresponding, so that the two installation packages can be mutually compatible after being installed and operated to realize cooperative work.

Since a version of an installation package is often developed based on a specific version of another kind of installation package and a version of an operating system thereof, running an installation package is often compatible based on a corresponding specific version number of another kind of installation package and a specific version of an operating system thereof. For example, for an installation package of a version and a graphics processing version of a deep learning platform, if the installation package is to be installed on one device, in order to better use an established deep learning platform installed with the installation package, it is necessary to install a general parallel computing architecture installation package of a version corresponding to the deep learning installation package of the version, a deep neural network library installation package, and the like.

In one embodiment, the installation package of the deep learning platform is an anaconda package embedded with a TensorFlow-gpu package and/or a TensorFlow-cpu package, the installation package of the general parallel computing architecture corresponding to the installation package of the graphics processing version of the deep learning platform is a CUDA package corresponding to the TensorFlow-gpu, the installation package of the compiler suite is a GCC package, and the installation package of the system core library is a G L IBC package.

For example, when the version number of the installed TensorFlow-gpu package is 1.11.0, the version number of the cuDNN package corresponding to the TensorFlow-gpu package should be 7.0, and the version number of the CUDA package corresponding to the TensorFlow-gpu package should be 9.0.

The CUDA (computer Unified Device Architecture) is a computing platform provided by the video card vendor NVIDIA.

The deep Neural Network library (NVIDIACUDADEep Neural Network library) is a gpu accelerated deep Neural Network primitive library.

The GNU C L ibrary (glibc) GNU C library engineering provides The core libraries for GNU systems and GNU/L inux systems, as well as many other systems that use L inux as a kernel, these libraries provide key apis (application programming interfaces), including ISO C11, POSIX.1-2008, BSD, os-specific apis, and so forth.

GNU is a free operating system whose content software is distributed entirely as GP L.

GCC is the GNU Compiler suite (GNU Compiler Collection).

240', issuing a download instruction to a target device to be deployed with the deep learning platform, so that after downloading an installation program of the deep learning platform according to an instruction of the download instruction, the target device downloads and installs an installation package of the deep learning platform matched with the target device, a universal parallel computing architecture installation package and a deep neural network library installation package corresponding to an installation package of a graph processing version of the deep learning platform, and an installation package of a compiler suite matched with the installation package of the deep learning platform, and a system core library installation package or a system core library installation package from the storage node by running the installation program of the deep learning platform

The system comprises an installation package of a deep learning platform matched with the target equipment, an installation package of a compiler suite matched with the installation package of the deep learning platform and a system core library installation package.

In this embodiment, the target device may download and install, for different types of target devices, different numbers and types of deep learning platforms and installation packages depending on the deep learning platforms, which are matched with the target devices, by running the installation program of the deep learning platform, for example, for the installation package of the graphics processing version of the deep learning platform, the installation package of the general parallel computing architecture and the installation package of the deep neural network library may also be downloaded and installed correspondingly, so that the matching between the installation packages required by various deployment deep learning platforms downloaded by the target device and the target devices is ensured, and the deployment of the deep learning platform on the target device may be automatically completed, thereby improving the deployment efficiency of the deep learning platform, particularly the cluster of the deep learning platform, and reducing the deployment cost.

In one embodiment, the provisioning storage node comprises:

setting a plurality of storage nodes;

the uploading, to the storage node, an installation package of at least one deep learning platform, a deep learning calculation installation package corresponding to the installation package of at least one deep learning platform, and a system installation package on which the deep learning platform operates on a target system, and storing the deep learning calculation installation package, includes:

respectively uploading an installation package of at least one deep learning platform, a deep learning calculation installation package corresponding to the installation package of the at least one deep learning platform and a system installation package which is depended by the deep learning platform and operates on a target system to the plurality of storage nodes for storage;

the issuing of a download instruction to a target device to be deployed with the deep learning platform so that the target device downloads and installs an installation package of the deep learning platform matched with the target device, a system installation package and/or a deep learning calculation installation package depending on the deep learning platform from the storage node by running the installation program of the deep learning platform after downloading the installation program of the deep learning platform according to an instruction of the download instruction includes:

and issuing a downloading instruction to target equipment to be deployed with the deep learning platform, so that after downloading the installation program of the deep learning platform according to the instruction of the downloading instruction, the target equipment downloads and installs the installation package of the deep learning platform matched with the target equipment from the storage node closest to the target equipment, and runs the system installation package and/or the deep learning calculation installation package depended by the deep learning platform by running the installation program of the deep learning platform.

In this embodiment, by setting a plurality of storage nodes and enabling the target device to download the installation package of the deep learning platform and the dependent installation package from the storage node closest to the target device when the target device is to deploy the deep learning platform, the resource download rate of the target device when the deep learning platform is deployed is improved to a certain extent, and the resource download delay is reduced, so that the deployment efficiency of the deep learning platform is improved; in addition, the installation package of the deep learning platform, the deep learning calculation installation package and the system installation package are respectively uploaded to the plurality of storage nodes, so that the load of a single storage node is reduced.

In one embodiment, the uploading the installation program of the deep learning platform to the storage node for storage includes:

uploading the installation script of the deep learning platform to the storage node for storage;

and issuing a downloading instruction to target equipment to be deployed with the deep learning platform, so that after the target equipment downloads the installation script of the deep learning platform according to the instruction of the downloading instruction, the target equipment downloads and installs the installation package of the deep learning platform matched with the target equipment, and runs the system installation package and/or the deep learning calculation installation package on which the deep learning platform depends from the storage node by running the installation script of the deep learning platform.

For example, the target system may be a L inux-based RedHat operating system, the deep learning platform is tensorflow, and the installation script is named install-tensorflow.

In one embodiment, the deep learning computing installation package includes a generic parallel computing architecture installation package and a deep neural network library installation package, the system installation package on which the deep learning platform runs on the target system includes an installation package of a compiler suite and a system core library installation package, the installation package of the at least one deep learning platform includes an installation package of a graphics processing version of the at least one deep learning platform, a correspondence between the installation packages is a correspondence of version numbers of the installation packages, the deep learning computing installation package corresponds to the installation package of the graphics processing version of the deep learning platform, and when an installation script of the deep learning platform is executed by the target device, the implemented steps may be as shown in fig. 4.

FIG. 4 is a flowchart illustrating steps performed when a target device runs an installation script for a deep learning platform in accordance with an exemplary embodiment. As shown in fig. 4, the method comprises the following steps:

an information acquisition step 410: and acquiring system kernel version information of the target equipment.

The cat/etc/redhat-release command can be used to obtain the system kernel version information.

Kernel version determination step 420: and comparing the system kernel version information with preset version information, if the system kernel version information is consistent with the preset version information, turning to a system installation package version judgment step, and if the system kernel version information is inconsistent with the preset version information, turning to a system installation package upgrading step.

If the default version information is 7.4 and the obtained system kernel version information is 6.7, the two are not consistent, and the process goes to step 430.

System installation package upgrading step 430: and respectively downloading and installing an installation package of the compiler external member and an installation package of a system core library corresponding to the preset version information from the storage node.

The specific process of this step may be as follows: after an installation catalog of a compiler suite and an installation catalog of a system core library are respectively created, the installation catalog of the compiler suite is entered, the compiler suite corresponding to preset version information is downloaded and installed from the storage node by taking the installation catalog of the compiler suite as a path, the installation catalog of the system core library is entered, and the system core library corresponding to the preset version information is downloaded and installed from the storage node by taking the installation catalog of the system core library as a path. For example, the compiler suite may be gcc, the system core library may be glibc, an installation directory of the gcc is entered by a cd command, the storage node downloads the gcc corresponding to the preset version information and then installs the gcc in the installation directory by a wget command, an installation directory of the glibc is entered by a cd command, and the storage node downloads the glibc corresponding to the preset version information and then installs the glibc in the installation directory by a wget command.

System installation package version judgment step 440: and acquiring the version of the compiler suite and the version of the system core library currently installed by the target equipment to determine whether the version of the compiler suite and the version of the system core library are both preset versions, if so, turning to an equipment type judgment step, and if not, turning to a system installation package upgrading step.

For example, the compiler suite may be gcc, the system core library may be glibc, and the corresponding commands for obtaining the version of the compiler suite and the version of the system core library currently installed by the target device may be gcc-version and ldd-version, respectively.

In the step, the version of the compiler suite and the version of the system core library are checked for the second time.

Device type determination step 450: and judging whether the target equipment is the graphic processing equipment, if so, turning to the step of installing the deep learning calculation installation package, and if not, turning to the step of installing the platform installation package.

For example, if the Graphics Processing (GPU) device is a device equipped with an invidia GPU, it may be determined whether the target device is a GPU device by an lspci | grep-i invidia command, where when the target device is a GPU device, information about the nvidia GPU may be returned according to the command.

Deep learning calculation installation package installation step 460: and downloading and installing the universal parallel computing architecture installation package and the deep neural network library installation package of the corresponding version from the storage node.

Platform installation package installation step 470: the method comprises the steps of obtaining the device type of the target device and the version number of a general parallel computing architecture installed on the target device, downloading and installing an installation package of a graphic processing version of a deep learning platform corresponding to the version number of the general parallel computing architecture from a storage node if the target device is a graphic processing device, and downloading and installing an installation package of a non-graphic processing version of the deep learning platform corresponding to the version number of a compiler suite and the version number of a system core library from the storage node after obtaining the version number of the compiler suite and the version number of the system core library installed on the target device if the target device is not the graphic processing device.

The method comprises the steps that an nvidia-smi | grep CUDA command is executed, the device type of target equipment and the version number of a universal parallel computing architecture installed by the target equipment can be obtained, if the target equipment returns relevant information according to the nvidia-smi | grep CUDA command instruction, the target equipment is determined to be GPU equipment, the returned information contains the version number of the CUDA, firstly, a directory is created, the directory is entered by using a cd command, then a wget command is used for downloading a tensoflow-GPU installation package corresponding to the version number of the CUDA from a storage node, and then the tensoflow-GPU installation package is installed; if the target equipment does not return related information according to the nvidia-smi | grep CUDA command instruction, determining that the target equipment is the CPU equipment, acquiring the version numbers of gcc and glibc installed by the target equipment, downloading a tensoflow-CPU installation package corresponding to the version number from a storage node by using a wget command, and finally installing the tensoflow-CPU installation package.

In one embodiment, the determining whether the target device is a graphics processing device includes:

after a graphic processing information acquisition instruction is run, judging whether the target equipment is graphic processing equipment or not based on graphic processing information returned by the target equipment according to the instruction, wherein the graphic processing information comprises the model of a graphic processor of the target equipment;

the storage node further stores a driver corresponding to the model of each graphics processor, and the downloading and installing of the universal parallel computing architecture installation package and the deep neural network library installation package of the corresponding version from the storage node includes:

downloading and installing a driver corresponding to the model of the graphic processor of the target device from the storage node;

downloading and installing a universal parallel computing architecture installation package and a deep neural network library installation package of a corresponding version from the storage node;

and writing the configuration environment corresponding to the universal parallel computing architecture and the deep neural network library into a preset directory.

In the embodiment, not only the generic parallel computing architecture and the deep neural network library are downloaded and installed, but also the configuration of the driver and the environment is completed.

The model of the graphics processor of the target device may be a variety of models of graphics processors, such as a model of a GPU that may be nvidia.

The way in which the configuration environment writes to the preset directory may be varied. For example, the permissions of/usr/local/cuda may be modified via a crown command, then the configuration environment is written to/etc/profile, and source/etc/profile is executed to validate the configuration environment save.

In one embodiment, the preset version information is a version number, the name of each installation package stored by the storage node includes the version number of the installation package, and each corresponding installation package is correspondingly stored in the storage node.

In one embodiment, the preset version information is a version number, the name of each installation package stored by the storage node includes the version number of the installation package, the storage node further stores a correspondence table of the version numbers of the installation packages, and by querying the table, another installation package corresponding to one installation package can be determined.

In one embodiment, the software module further comprises test code uploaded to the storage node for storage, and after the step of installing the platform installation package, when the installation script of the deep learning platform is executed by the target device, the following steps are further executed:

the testing steps are as follows: and downloading and running the test code from the storage node to verify whether the deployment of the deep learning platform is correct.

For example, the test code is test.py and the deep learning platform is TensorFlow managed by anaconda, then verification can be performed by first entering a specific path of the anaconda's installation directory, downloading test.py from the storage node with a wget command, and then running test.py under the specific path. For example, the deep learning platform can be verified to be correctly built by running/appcom/anaconda 3/bin/python test.

The advantage of this embodiment lies in, after the required various installation package of operation of deep learning platform and deep learning platform is built, tests through further operation test code, has guaranteed the reliability of the deep learning platform and the environment of building.

In one embodiment, the test code may be:

import tensorflow as tf

sess＝tf.Session(config＝tf.ConfigProto(log_device_placement＝True))

the disclosure also provides a deployment device of the deep learning platform cluster, and the following is an embodiment of the device disclosed herein.

FIG. 5 is a block diagram illustrating a deployment apparatus for a deep learning platform cluster, according to an example embodiment. As shown in fig. 5, the apparatus 500 includes:

a setup module 510 configured to setup a storage node;

a first uploading module 520, configured to upload an installation package of at least one deep learning platform, a deep learning calculation installation package corresponding to the installation package of at least one deep learning platform, and a system installation package on which the deep learning platform operates on a target system to the storage node for storage, where all installation packages of the deep learning platforms are installation packages of the same deep learning platform;

a second uploading module 530 configured to upload the installation program of the deep learning platform to the storage node for storage;

the instruction issuing module 540 is configured to issue a download instruction to a target device to be deployed with the deep learning platform, so that the target device downloads and installs an installation package of the deep learning platform matched with the target device, a system installation package depended on by the deep learning platform, and/or a deep learning calculation installation package from the storage node by running the installation program of the deep learning platform after downloading the installation program of the deep learning platform according to an instruction of the download instruction.

According to a third aspect of the present disclosure, there is also provided an electronic device capable of implementing the above method.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: the at least one processing unit 610, the at least one memory unit 620, and a bus 630 that couples the various system components including the memory unit 620 and the processing unit 610.

Wherein the storage unit stores program code that is executable by the processing unit 610 such that the processing unit 610 performs the steps according to various exemplary embodiments of the present invention as described in the section "example methods" above in this specification.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)621 and/or a cache memory unit 622, and may further include a read only memory unit (ROM) 623.

The storage unit 620 may also include a program/utility 624 having a set (at least one) of program modules 625, such program modules 625 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

Electronic device 600 may also communicate with one or more external devices 800 (e.g., keyboard, pointing device, Bluetooth device, etc.), and also with one or more devices that enable a user to interact with electronic device 600, and/or with any device (e.g., router, modem, etc.) that enables electronic device 600 to communicate with one or more other computing devices.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

According to a fourth aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-mentioned method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

Referring to fig. 7, a program product 700 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including AN object oriented programming language such as Java, C + +, or the like, as well as conventional procedural programming languages, such as the "C" language or similar programming languages.

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A deployment method of a deep learning platform cluster, the method comprising:

setting a storage node;

2. The method of claim 1, wherein the deep learning computing installation packages comprise a generic parallel computing architecture installation package and a deep neural network library installation package, wherein the system installation packages on which the deep learning platform depends on running on the target system comprise an installation package of a compiler suite and a system core library installation package, wherein the installation package of the at least one deep learning platform comprises an installation package of a graphics processing version of the at least one deep learning platform, wherein the deep learning computing installation package corresponds to the installation package of the graphics processing version of the deep learning platform, wherein the installation package of the at least one deep learning platform, the deep learning computing installation package corresponding to the installation package of the at least one deep learning platform, and the system installation package on which the deep learning platform depends on running on the target system are uploaded to the storage node for storage, the method comprises the following steps:

uploading an installation package of at least one deep learning platform, a general parallel computing architecture installation package and a deep neural network library installation package corresponding to an installation package of a graphic processing version of the deep learning platform, and an installation package of a compiler suite and a system core library installation package matched with the installation package of the deep learning platform to the storage node for storage;

issuing a download instruction to target equipment to be deployed with the deep learning platform, so that after downloading an installation program of the deep learning platform according to the instruction of the download instruction, the target equipment downloads and installs an installation package of the deep learning platform matched with the target equipment, a general parallel computing architecture installation package and a deep neural network library installation package corresponding to an installation package of a graph processing version of the deep learning platform, and an installation package of a compiler suite matched with the installation package of the deep learning platform, and a system core library installation package or a system core library installation package from the storage node by operating the installation program of the deep learning platform

3. The method of claim 1, wherein uploading the installer of the deep learning platform to the storage node for storage comprises:

4. The method of claim 3, wherein the deep learning computing installation packages comprise a generic parallel computing architecture installation package and a deep neural network library installation package, the system installation packages on which the deep learning platform runs on the target system comprise an installation package of a compiler suite and a system core library installation package, the installation package of the at least one deep learning platform comprises an installation package of a graphics processing version of the at least one deep learning platform, a correspondence between installation packages is a correspondence of installation packages, the deep learning computing installation package corresponds to the installation package of the graphics processing version of the deep learning platform, and when an installation script of the deep learning platform is executed by the target device, the following steps are implemented:

an information acquisition step: acquiring system kernel version information of target equipment;

judging the kernel version: comparing the system kernel version information with preset version information, if the system kernel version information is consistent with the preset version information, switching to a system installation package version judgment step, and if the system kernel version information is inconsistent with the preset version information, switching to a system installation package upgrading step;

upgrading a system installation package: downloading and installing an installation package of a compiler external member and an installation package of a system core library corresponding to the preset version information from the storage nodes respectively;

judging the version of the system installation package: acquiring the version of the compiler suite and the version of the system core library currently installed by the target device to determine whether the version of the compiler suite and the version of the system core library are both preset versions, if so, turning to a device type judgment step, and if not, turning to a system installation package upgrading step;

a device type judging step: judging whether the target equipment is the image processing equipment, if so, turning to a deep learning calculation installation package installation step, and if not, turning to a platform installation package installation step;

deep learning calculation installation package installation: downloading and installing a universal parallel computing architecture installation package and a deep neural network library installation package of a corresponding version from the storage node;

a platform installation package installation step: the method comprises the steps of obtaining the device type of the target device and the version number of a general parallel computing architecture installed on the target device, downloading and installing an installation package of a graphic processing version of a deep learning platform corresponding to the version number of the general parallel computing architecture from a storage node if the target device is a graphic processing device, and downloading and installing an installation package of a non-graphic processing version of the deep learning platform corresponding to the version number of a compiler suite and the version number of a system core library from the storage node after obtaining the version number of the compiler suite and the version number of the system core library installed on the target device if the target device is not the graphic processing device.

5. The method of claim 4, wherein the determining whether the target device is a graphics processing device comprises:

6. The method of claim 4, further comprising test code uploaded to the storage node for storage, wherein after the step of installing the platform installation package, when the installation script of the deep learning platform is executed by the target device, the following steps are further performed:

7. The method of claim 1, wherein the provisioning a storage node comprises:

setting a plurality of storage nodes;

8. An apparatus for deploying a deep learning platform cluster, the apparatus comprising:

a setting module configured to set a storage node;

9. A computer-readable program medium, characterized in that it stores computer program instructions which, when executed by a computer, cause the computer to perform the method according to any one of claims 1 to 7.

10. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1 to 7.