CN109426574A

CN109426574A - Distributed computing system, data transmission method and device in distributed computing system

Info

Publication number: CN109426574A
Application number: CN201710769632.8A
Authority: CN
Inventors: 林健; 夏命榛
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-08-31
Filing date: 2017-08-31
Publication date: 2019-03-05
Anticipated expiration: 2037-08-31
Also published as: EP3667496B1; EP3667496A1; US20200202246A1; CN109426574B; WO2019042312A1; CN114880133A; US11010681B2; EP3667496A4

Abstract

This application discloses a kind of distributed computing systems, the first calculate node and the second calculate node in the distributed computing system all preserve the title of the first data flow diagram parameter in data flow diagram, the information of size and Correspondent Node mark, the first data flow diagram parameter is preserved in first calculate node, first calculate node and the second calculate node use the identical interface parameters generating algorithm stored in respective node and the above- mentioned information of the first data flow diagram parameter, generate respective triple, the triple is used as being used for transmission the interface parameters of the message passing interface MPI primitive of the first data flow diagram parameter between the first calculate node and the second calculate node.In this way, the first calculate node and the second calculate node can not need to use MPI primitive transmitting data stream graph parameter with negotiating, the communication efficiency of data between calculate node can be improved, to promote distributed computing system to the computational efficiency of data flow diagram.

Description

Distributed computing system, data transmission method and device in distributed computing system

Technical field

The present invention relates to computer field more particularly to distributed computing system, the data in distributed computing system are passed Transmission method and device.

Background technique

With the development of big data industry and artificial intelligence technology, all kinds of computing platforms are continued to bring out, such as machine learning Platform, figure computing platform and stream calculation platform etc..These computing platforms are usually deployed in distributed computing system to hold Row big data calculates.For example, machine learning platform is usually using data flow diagram as computing object, it is more by the way that data flow diagram to be divided into A subgraph perhaps copy and the multiple calculate nodes being deployed in multiple subgraph or copy in distributed computing system On, and then multiple calculate node can be used, cooperated computing is carried out to data flow diagram, to promote the computational efficiency of data.Its In, the calculate node in multiple calculate node may include multiple equipment for calculating, such as CPU (Center Processing Unit, central processing unit) and the accelerating hardware that is equipped in the host of the calculate node, such as accelerate hard Part can be GPU (Graphics Processing Unit, graphics processor).

And data flow diagram needs to carry out the data flow diagram parameter between node in the calculating process of distributed computing system Communication, this communication directly affect machine learning platform to the computational efficiency of data flow diagram.It, will be usual in a kind of currently existing scheme Message passing interface (Message Passing Interface, MPI) library technology applied to high-performance computing sector is as one A external plug-in unit introduces distributed computing system, to support the data communication in the system, however, MPI library technology into Before row data communication, need the both sides for carrying out data communication by information exchange to obtain the information of Correspondent Node, to negotiate It is used for the messaging parameter of MPI primitive out, and the dynamic and randomness of the communication timing of computing platform make the double of data communication Side is difficult to confirm opposite end in time and hold consultation, which also increases the burden of data communication in computing platform, from And influence the efficiency of transmission of data.

Summary of the invention

The embodiment of the present invention provides data transmission method, device and a kind of distribution in a kind of distributed computing system Formula computing system can simplify MPI technical application and calculate the communication process in data flow diagram, be not necessarily to before data transmission and communicate Peer Negotiation enables MPI technology to be preferably adapted to the computing platform of distributed deployment client information, to improve distribution The efficiency that data are transmitted in formula computing system, to promote the computational efficiency to distributed computing system data flow diagram.

In order to achieve the above objectives, the embodiment of the present invention adopts the following technical scheme that

In a first aspect, the application provides a kind of distributed computing system, which includes the first calculating section Point and the second calculate node, which is characterized in that the first graph data structure in first calculate node preserves the first data Title, size and the Correspondent Node mark of the first data flow diagram parameter in flow graph, wherein the first data flow diagram parameter is The parameter that one connection side of first data flow diagram is carried；The second graph data structure in second calculate node saves There are the title, size and Correspondent Node of the first data flow diagram parameter in the second data flow diagram to identify；First data The Correspondent Node mark of the first data flow diagram parameter in flow graph corresponds to second calculate node, second data flow diagram In the first data flow diagram parameter Correspondent Node mark correspond to first calculate node；First calculate node For being identified, being utilized according to the title, size and Correspondent Node of the first data flow diagram parameter in first graph data structure First interface parameter generation algorithm generates the first triple, and first triple includes message label, message size and purpose Process sequence number, wherein the message label corresponds to the title of the first data flow diagram parameter, and the message size is corresponding In the size of the first data flow diagram parameter, the purpose process sequence number, which corresponds in second calculate node, receives institute State the process of the first data flow diagram parameter；Second calculate node is used for, according to first in second graph data structure Title, size and the Correspondent Node mark of data flow diagram parameter, generate the second triple using second interface parameter generation algorithm, The second interface parameter generation algorithm is identical as the first interface parameter generation algorithm, and second triple includes described Message label, the message size and originating process sequence number, wherein the originating process sequence number corresponds to described first and calculates section The process of the first data flow diagram parameter is sent in point；First calculate node is used for, using first triple as Interface parameters calls message passing interface MPI to send primitive and sends the first data flow diagram ginseng to second calculate node Number；Second calculate node is used for, and according to second triple, is called MPI to receive primitive and is handled first data flow Graph parameter.

In this way, sending primitive using the first triple as the MPI of interface parameters and using the second triple as the MPI of interface parameters It is corresponding to receive primitive, in this way, including Correspondent Node mark, solution in the first graph data structure and the second graph data structure It has determined the process problem unknowable in data flow diagram operational process of Correspondent Node, and, need to transmit first data flow The communicating pair of graph parameter, the information and the generation calculation of identical interface parameters in data flow diagram stored using respective calculate node Method generates triple, is just not necessarily to the information to opposite end interaction oneself, and the algorithm of triple, this method energy are generated without negotiation Enough independent operatings in data sender and recipient, generate corresponding triple in the case where both sides' no interactions, simplify The process communicated using MPI primitive can be improved the efficiency that data are transmitted in Distributed Computing Platform.

It should be understood that call MPI to receive primitive according to second triple and handle the first data flow diagram parameter, it can To reach the mesh for the calculating for enabling the process in the second calculate node to carry out using the first data flow diagram parameter data flow diagram 's.It calls MPI to receive primitive and handles the first data flow diagram parameter, " processing " can correspond to difference under different scenes Operation, the application is without limitation.Such as one of can be following operation or a variety of: call MPI receive primitive by this One data flow diagram parameter receives the data buffer zone of host memory；It calls MPI to receive primitive and modifies first data flow diagram ginseng The first data flow diagram parameter in host memory is supplied to the process use for calculating data flow diagram by several labels；It will The first data flow diagram parameter is stored from data buffer zone to destination address.

The title (name) of first data flow diagram parameter can be first figure for identifying the first data flow diagram parameter A field in data structure is also possible to the information being dispersed in first graph data structure.The first data flow diagram parameter Size (size) for indicating memory space shared by the first data flow diagram parameter, that is, the number of the data flow diagram parameter According to amount.

The Correspondent Node mark of the first data flow diagram parameter in first data flow diagram can be second calculate node Mark；Or the mark of the storage equipment where the destination address of the first data flow diagram parameter, the storage equipment are located at the Two calculate nodes；Or the mark of the corresponding purpose process of the first data flow diagram parameter, the purpose process are located at second and calculate Node；Or the information of other receiving ends for being used to indicate the first data flow diagram parameter.

It is similar for the Correspondent Node mark of the first data flow diagram parameter in second data flow diagram, it can be this The mark of first calculate node；Or the mark of the storage equipment where the source address of the first data flow diagram parameter, the storage Equipment is located at the first calculate node；Or first calculate node summarize send the first data flow diagram parameter process mark； Or the information of other transmitting terminals for being used to indicate the first data flow diagram parameter.

The first graph data structure in first calculate node preserves the first data flow diagram parameter in the first data flow diagram Title, size and Correspondent Node mark, can be in first graph data structure include carry three kinds of information field, It is also possible to preserve the information for title, size or the Correspondent Node mark that can obtain the first data flow diagram parameter.I.e. So-called " preserving ", which can be, directly to read from first graph data structure, be also possible to according to the first diagram data knot Information analysis in structure obtains.

Second data flow diagram is stored in the second calculate node, and the second data flow diagram can be the pair of the first data flow diagram This, all can also be two subgraphs of a data flow diagram with the first data flow diagram.

Wherein, message label is for indicating that the MPI sends the data that primitive is sent.Message size is for indicating that the MPI is sent out The size for the information for sending primitive to send.Originating process sequence number is the sequence that the first calculate node executes that the MPI sends the process of primitive Row number, purpose process sequence number be the second calculate node execute with the MPI send primitive corresponding MPI reception primitive into The sequence number of journey.The concept of triple is served only for indicating three parameters therein, without limiting the sequence between these three parameters.It should The format of three parameters in triple meets the call format that MPI sends interface function parameter entrained by primitive.In addition, The interface parameters that MPI sends primitive includes but is not limited to first triple, and the interface parameters that MPI receives primitive includes but unlimited In second triple.

Under a kind of implementation, using first triple as interface parameters, message passing interface MPI is called to send former Language to second calculate node send the first data flow diagram parameter in terms of the first data flow diagram parameter, it is described first meter Operator node is used for using first triple as interface parameters, sends primitive from described first by message passing interface MPI The first data flow diagram parameter is read in the host memory of calculate node, to send described first to second calculate node Data flow diagram parameter.

In this way, MPI sends primitive directly reads the first data flow diagram parameter from host memory, it can improve and read data Efficiency.

Under a kind of implementation, first calculate node also preserves the storage where the first data flow diagram parameter The information of equipment, the first calculate node described in the first data flow diagram parameter are also used to be designated as it in the information of the storage equipment In the case that he stores equipment, the first data flow diagram parameter is calculated from other described storage device replications to described first The host memory of node, other described storage equipment are the memory in first calculate node in addition to host memory.

The information of the storage equipment can be the mark of the storage equipment, be also possible to the volume for indicating the storage equipment Number, the storage class of the storage equipment can be determined according to mark or number, can also be the type for identifying the storage equipment The information that above-mentioned function can be reached of information etc. or other forms.

Such first calculate node will prepare the first data flow diagram parameter first before sending primitive using MPI In the host memory of calculate node, and MPI sends primitive and only reads first number from the host memory of first calculate node According to Flowsheet parameter, without fighting for reading the resource of other storage equipment with computing platform, improve MPI transmission primitive executes effect Rate.

Under a kind of implementation, the first interface parameter generation algorithm includes that the first algorithm, the second algorithm and third are calculated Method is identified according to the title, size and Correspondent Node of the first data flow diagram parameter in first graph data structure, is utilized First interface parameter generation algorithm generates the aspect of first triple the first data flow diagram parameter, and first calculate node is used In: according to title the first data flow diagram parameter and described first of the first data flow diagram parameter in first graph data structure Algorithm determines the message label in first triple, is joined according to the first data flow diagram in first graph data structure The size of several first data flow diagram parameters and second algorithm, determine the message size in first triple, Yi Jigen According to the Correspondent Node mark of the first data flow diagram parameter in the first graph data structure and the third algorithm, described first is determined Purpose process sequence number in triple；Correspondingly, according to the first data flow diagram parameter in second graph data structure Title, size and Correspondent Node mark, utilize second interface parameter generation algorithm generate second the first data flow diagram of triple The aspect of parameter, second calculate node are used for: according to the first data flow diagram parameter in second graph data structure The first algorithm in title and the second interface parameter generation algorithm determines the message label in second triple；Root In size and the second interface parameter generation algorithm according to the first data flow diagram parameter in second graph data structure Second algorithm determines the message size in second triple；And according to the first data flow in the second graph data structure Third algorithm in the Correspondent Node mark of graph parameter and the second interface parameter generation algorithm, determines second triple In originating process sequence number.

The first interface parameter generation algorithm being then outlined above is identical as second interface parameter generation algorithm, refers to, the One interface parameters generating algorithm includes the first algorithm, the second algorithm and third algorithm, and, second interface parameter generation algorithm includes The first algorithm identical or corresponding with the first interface parameter generation algorithm, the second algorithm and third algorithm.

First algorithm can be the algorithm for converting any binary length value to fixed binary length value, such as can be with It is hash algorithm etc.；Or other can convert the title of the first data flow diagram parameter in the interface parameters for meeting MPI primitive The algorithm of the format of message label.To the second algorithm, then under a kind of implementation, the value of the message size field can be made to be equal to upper State the parameter value of the size of data flow diagram parameter, i.e. size.Under another implementation, the value of the message size field can be made Parameter value equal to the size of above-mentioned data flow diagram parameter is worth plus one.Third algorithm is process sequence number and Correspondent Node Mapping relations between mark, wherein including reflecting between purpose process sequence number and Correspondent Node mark in the first calculate node Relationship is penetrated, includes the mapping relations between originating process sequence number and Correspondent Node mark in the second calculate node.Third algorithm can To be functional relation, it is also possible to the mapping table safeguarded in calculate node, the application is with no restrictions.

Under a kind of implementation, according to second triple, calls MPI to receive primitive and handle first data flow Aspect the first data flow diagram parameter of graph parameter, second calculate node are used to detect primitive detection described second by MPI Data buffer area in calculate node host memory, to obtain the second triple of the first data flow diagram parameter, the number It is exclusively used in storing the data handled by MPI primitive according to buffer area；MPI is called to receive primitive, to handle first data flow diagram Parameter, the interface parameters that the MPI receives primitive includes second triple.

In this way, primitive processing can be received more in time by allowing for data, but also other are pending for the first calculate node Transmission primitive can perform faster, thus, improve data transmission efficiency.And by the way that dedicated data buffer area is arranged With poll thread, message-passing communication buffer area is not yet called in message sink primitive, message final purpose address In the case where unknown, so that message sends primitive and is able to carry out data transmission, and returned immediately after data are sent completely.It is slow Rush area and be that following message sink primitive temporarily saves data so that message send primitive need not with message sink primitive into Row simultaneously operating solves the intrinsic temporal constraint of the two.Sender synchronous need not wait, and make to have which save the time is executed Help improving performance.

Under a kind of implementation, the first data flow diagram parameter receives primitive and carries the first data flow diagram parameter Destination address, being received by the first data flow diagram parameter in terms of primitive handles the first data flow diagram parameter, described the Two calculate nodes are used to call the MPI to receive using second triple as the MPI interface parameters for receiving primitive former Language stores the first data flow diagram parameter to the destination address from the data buffer area.For example, the destination address User memory space in host memory

Second aspect, the embodiment of the present invention record the data transmission method in a kind of distributed computing system, the distribution Formula computing system includes the first calculate node and the second calculate node, which comprises from first calculate node In first graph data structure, title, size and the Correspondent Node mark of the first data flow diagram parameter of the first data flow diagram are determined, The wherein parameter that the first data flow diagram parameter is carried by a connection side of first data flow diagram, the Correspondent Node Corresponding second calculate node of mark；According to the title of the first data flow diagram parameter in first graph data structure, greatly Small and Correspondent Node mark generates the first triple using first interface parameter generation algorithm, and first triple includes disappearing Breath label, message size and purpose process sequence number, wherein the message label corresponds to the first data flow diagram parameter Title, the message size correspond to the size of the first data flow diagram parameter, and the purpose process sequence number corresponds to institute State the process that the first data flow diagram parameter is received in the second calculate node；Using first triple as interface parameters, It calls message passing interface MPI to send primitive and sends the first data flow diagram parameter to second calculate node, in order to Second calculate node calls MPI to receive former using the second triple corresponding with first triple as interface parameters Language processing the first data flow diagram parameter, second triple is according to the second diagram data in second calculate node Structure is generated using second interface parameter generation algorithm, and the second interface parameter generation algorithm and the first interface are raw It is identical at algorithm.

Under a kind of implementation, using first triple as interface parameters, message passing interface MPI is called to send Primitive to second calculate node send the first data flow diagram parameter in terms of, which comprises with described first Triple sends primitive from the host memory of first calculate node as interface parameters, by message passing interface MPI The first data flow diagram parameter is read, to send the first-class data flow diagram parameter to second calculate node.

Under a kind of implementation, first calculate node also preserves the storage where the first data flow diagram parameter The information of equipment, the method also includes: in the case where the information of the storage equipment is designated as other storage equipment, by institute State the first data flow diagram parameter from it is described other storage device replications to first calculate node host memory, it is described other Storing equipment is the memory in first calculate node in addition to host memory.

The third aspect, the application record the data transmission device in a kind of distributed computing system, the distributed computing System includes the first calculate node and the second calculate node, and the data transmission device is located at first calculate node, described Data transmission device comprises determining that module, and the determining module is for the first diagram data knot from first calculate node In structure, title, size and the Correspondent Node mark of the first data flow diagram parameter of the first data flow diagram are determined, wherein described first The parameter that data flow diagram parameter is carried by a connection side of first data flow diagram, the Correspondent Node identify described in correspondence Second calculate node；Generation module, the generation module are used for according to the first data flow diagram in first graph data structure The title of parameter, size and Correspondent Node mark generate the first triple using first interface parameter generation algorithm, and described first Triple includes message label, message size and purpose process sequence number, wherein the message label corresponds to first number According to the title of Flowsheet parameter, the message size corresponds to the size of the first data flow diagram parameter, the purpose process sequence Row number corresponds to the process that the first data flow diagram parameter is received in second calculate node；Communication module, the communication Module is used to call message passing interface MPI to send primitive to second meter using first triple as interface parameters Operator node sends the first data flow diagram parameter, in order to which second calculate node is with corresponding with first triple Second triple calls MPI to receive primitive and handles the first data flow diagram parameter, second triple as interface parameters It is to be generated according to the second graph data structure in second calculate node using second interface parameter generation algorithm, it is described Second interface parameter generation algorithm is identical as the first interface generating algorithm.

Fourth aspect, the application record a kind of physical machine, and the physical machine includes: that at least one processor and storage can be held The non-transient computer-readable media of line code is to run the first calculate node in distributed computing system, the distributed meter Calculation system includes first calculate node and the second calculate node；The executable code is by least one described processor In processor be configured as executing any method that the first calculate node in above-mentioned system executes when executing.

As it can be seen that fourth aspect and the third aspect are the corresponding devices of method of second aspect, the method for second aspect is executed Under the first calculate node, some cases, which is the first calculate node in the system of first aspect.About The explanation of each step in second aspect, the third aspect and fourth aspect, the explanation of noun, various implementations and beneficial effect Fruit illustrates that the discussion in relation to the first calculate node is equally applicable in the system of first aspect, reference can be made to the correlation in first aspect Content, details are not described herein again.

5th aspect, the application record the data transmission method in a kind of distributed computing system, the distributed computing System includes the first calculate node and the second calculate node, which comprises from the second figure number of second calculate node It is identified according to the title, size and Correspondent Node of the first data flow diagram parameter in structure, determined in the second data flow diagram, described the The Correspondent Node mark of the first data flow diagram parameter in two data flow diagram corresponds to first calculate node；According to institute Title, size and the Correspondent Node mark for stating the first data flow diagram parameter in the second graph data structure, are joined using second interface Number generating algorithms generate the second triple, and second triple includes message label, message size and originating process sequence number, In, the message label corresponds to the title of the first data flow diagram parameter, and the message size corresponds to first number According to the size of Flowsheet parameter, the originating process sequence number, which corresponds in first calculate node, sends first data flow diagram The process of parameter；According to second triple, calls message passing interface MPI to receive primitive processing and calculated from described first The first data flow diagram parameter of node, the first data flow diagram parameter are that first calculate node is sent by MPI What primitive was sent, the interface parameters that the MPI sends primitive includes the first triple corresponding with second triple, institute Stating the first triple is first calculate node according to the first graph data structure in first calculate node, utilizes first What interface parameters generating algorithm generated, the second interface parameter generation algorithm is identical as the first interface generating algorithm.

Under a kind of implementation, the second calculate node calls message transmission using second triple as interface parameters Interface MPI receives primitive and receives the first data flow diagram parameter, so that second calculate node uses first data flow diagram The calculating of parameter progress data flow diagram.

Under a kind of implementation, the second calculate node operation has first thread and the second thread, and described second calculates It include data buffer area in the host memory of node, the data buffer area is exclusively used in storing the data handled by MPI primitive, Using second triple as interface parameters, calls message passing interface MPI to receive primitive processing and calculated from described first The aspect of the first data flow diagram parameter of node, which comprises the first thread passes through message passing interface MPI Detection primitive detects the data buffer area in the host memory, to obtain the second triple；The first thread is according to The second triple in data buffer area calls the first MPI to receive primitive, to handle the first data flow diagram parameter, the number According to the second triple in buffer area, to be second calculate node send primitive according to the MPI obtains；Second line Journey receives primitive modification after determining that the first data flow diagram parameter receives primitive processing by the first MPI, by the 2nd MPI Primitive is waited for MPI, the 2nd MPI, which receives primitive, to join for what is do not executed by second thread with first data flow diagram The corresponding reception primitive of number, the interface parameters that the 2nd MPI receives primitive includes the second of the second calculate node generation Triple, the MPI wait primitive to be finished for waiting the first MPI to receive primitive.

Second triple can be to be obtained according to the interface parameters that the MPI received sends primitive, is also possible to basis and is connect Mouth parameter and the MPI send to analyze in the data that primitive transmits and obtain, and the application is with no restrictions.

That is the second calculate node can again specially can specifically can one thread of pull-up (can again in host memory Referred to as poll thread) it executes MPI and detects primitive, to detect the buffer area of the host memory of second calculate node, in buffer area Including above-mentioned data buffer area, the data for having not enough time to be connect primitive processing by MPI can be thus found.

In an implementation mode, the is called according to the second triple in the data buffer area in the first thread One MPI receives primitive, to handle the aspect of the first data flow diagram parameter, which comprises in first data flow The destination address of graph parameter corresponds to the feelings that the memory headroom that user uses is distributed in the host memory of second calculate node Under condition, the first thread is with the interface ginseng that the second triple in the data buffer area is that the first MPI receives primitive Number calls the first MPI to receive primitive, and the first data flow diagram parameter is stored from the data buffer area to described The destination address of first data flow diagram parameter.

In an implementation mode, the feelings of other storage equipment are corresponded in the destination address of the first data flow diagram parameter Under condition, second calculate node stores the first data flow diagram parameter in the host memory to the destination address, institute Stating other storage equipment is the memory in second calculate node in addition to host memory.

6th aspect, the application also record the data transmission device in a kind of distributed computing system, the distributed meter Calculation system includes the first calculate node and the second calculate node, and the data transmission device is located at second calculate node, institute It states data transmission device and comprises determining that module, the determining module are used for the second diagram data knot from second calculate node In structure, title, size and the Correspondent Node mark of the first data flow diagram parameter in the second data flow diagram, second number are determined Correspond to first calculate node according to the Correspondent Node mark of the first data flow diagram parameter in flow graph；Generation module, The generation module is used for according to the title of the first data flow diagram parameter in second graph data structure, size and communication pair End mark generates the second triple using second interface parameter generation algorithm, and second triple includes message label, message Size and originating process sequence number, wherein the message label corresponds to the title of the first data flow diagram parameter, the message Size corresponds to the size of the first data flow diagram parameter, and the originating process sequence number corresponds in first calculate node Send the process of the first data flow diagram parameter；Communication module, the communication module are used to be adjusted according to second triple Primitive, which is received, with message passing interface MPI handles the first data flow diagram parameter from first calculate node, it is described First data flow diagram parameter is that first calculate node is sent by MPI transmission primitive, and the MPI sends connecing for primitive Mouth parameter includes the first triple corresponding with second triple, and first triple is the first calculate node root According to the first graph data structure in first calculate node, generated using first interface parameter generation algorithm, described second Interface parameters generating algorithm is identical as the first interface generating algorithm.

7th aspect, the application also record a kind of physical machine, and the physical machine includes: at least one processor and storage can The non-transient computer-readable media of code is executed to run the second calculate node in distributed computing system, the distribution Computing system includes the first calculate node and second calculate node；The executable code is by least one described processing Processor in device is configured to execute any method executed by the second calculate node above-mentioned when executing.

As it can be seen that the 6th aspect and the 7th aspect are the corresponding device of method of the 5th aspect, the method execution of the 5th aspect Under the second calculate node, some cases, which is the second calculate node in the system of first aspect.About The explanation of each step in 5th aspect, the 6th aspect and the 7th aspect, the explanation of noun, various implementations and beneficial effect Fruit illustrates that the discussion in relation to the second calculate node is equally applicable in the system of first aspect, reference can be made to the correlation in first aspect Content, details are not described herein again.

Eighth aspect, the application record a kind of non-transient computer-readable media for being stored with executable program, it is described can It executes program and is used to execute any method that the first calculate node or the second calculate node execute in above-mentioned system.About The explanation of each step involved in eighth aspect, the explanation of noun, various implementations and beneficial effect explanation are aforementioned related Discuss equally applicable, reference can be made to related content above-mentioned, details are not described herein again.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention.

Fig. 1 is the schematic diagram of a data flow diagram provided in an embodiment of the present invention.

Fig. 2 is a kind of computing platform schematic diagram provided in an embodiment of the present invention；

Fig. 3 is a kind of network architecture schematic diagram of Distributed Computing Platform provided in an embodiment of the present invention；

Fig. 4 is a kind of timing diagram of method that interprocess communication is carried out using MPI technology provided in an embodiment of the present invention；

Fig. 5 is a kind of schematic diagram of data transmission method provided in an embodiment of the present invention；

Fig. 6 is a kind of schematic diagram of cutting data flow diagram provided in an embodiment of the present invention；

Fig. 7 is a kind of schematic diagram of the framework of Tensor Flow machine learning platform provided in an embodiment of the present invention；

Fig. 8 is a kind of schematic diagram of data transmission device provided in an embodiment of the present invention

Fig. 9 is provided in an embodiment of the present invention a kind of for executing the schematic diagram of the physical machine of the present processes.

Specific embodiment

Character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".For example, A/B is understood that For A perhaps B then and/or can be understood as and or.

Term " first " and " second " in description and claims of this specification etc. are not intended to description object Particular order, but for distinguishing different objects, in the case where having certain illustrated, " first " and " second " can describe identical Object.For example, the first process and the second process are different process in the case where without certain illustrated.

In the description of the present invention, unless otherwise indicated, the meaning of " plurality " is refer to two or more.For example, more A equipment refers to two or more equipment.

In addition, the term " includes " being previously mentioned in description of the invention and " having " and their any deformation, it is intended that It is to cover and non-exclusive includes.Such as the process, method, system, product or equipment for containing a series of steps or units do not have It is defined in listed step or unit, but optionally further comprising the step of other are not listed or unit, or optionally It further include the other step or units intrinsic for these process, methods, product or equipment.

Be described below this application involves some nouns.

Message transmission: in computer system, the general designation of a kind of data communications method between process or between component software.It will Data abstraction to be communicated is simultaneously encapsulated as " message ", and the both sides or multi-party pass through for participating in communication such as call message transmission, receive at the originals Language realizes transmitting of the message between process or component, to complete data communication.

Primitive: referring to be instructed by several and form, for completing one section of code an of function or a process.Primitive Execution should be continuous.

Calculate node can be physical machine, be also possible to operate in the host (host) in physical machine, virtual machine (Virture Machine) or container.It should be understood that virtual machine and container require to be deployed in physical machine.Under that is, First calculate node and the second calculate node described in text, it may be possible to identical physical machine or different physical machines. For example, calculate node is the virtual machine or container being deployed on same physical machine.Certainly, hair described in Examples below Sending end physical machine and receiving end physical machine, it is also possible to be identical physical machine or different physical machines.

It should be noted that the node for including in calculate node and data flow diagram is noun of different nature in present specification, tool There is different semantemes.

Equipment: referring to the hardware in physical machine, for example, the hardware can support virtual machine in the physical machine or container or The operation of process or thread.Such as it can be and calculate equipment or storage equipment.Equipment is calculated to refer in physical machine for calculating Hardware, calculate equipment can be CPU, be also possible to GPU, FPGA (Field-Programmable Gate Array, scene Programmable gate array), MIC (Many Intergrated Core, Cheng Zhonghe) or other hardware with computing capability set It is standby.Storage equipment refer in physical machine can storing data or code hardware, the memory such as used for above-mentioned calculating equipment, such as Host memory (also known as CPU memory), the various memories such as GPU memory, FPGA memory, then such as external memory, such as hard disk, CD etc..

Host: refer in computer hardware system for placing mainboard and the container (Mainframe) of other main components.Than It such as may include CPU, memory, hard disk, power supply and other input/output interfaces.Such as the input/output interface can be USB Any one of controller, video card, network interface card, sound card.

Data flow diagram: a kind of flow direction and calculated relationship by expression data in calculating logic, to reflect meter Calculate the design principle of logic and the pattern data structure of implementation process.The application is with the machine learning of common calculating data flow diagram It is described for platform.It should be noted that it is flat that data flow diagram can be pre-loaded to this before calculating in machine learning platform Platform, the preloading procedure include to the node for including in data flow diagram, while and while on parameter definition.

In machine learning platform, the calculating logic of algorithm carries out Formal Representation usually using data flow diagram.Use this Machine learning platform calculates data flow diagram, needs first to describe the data flow diagram with code, this process is properly termed as counting According to the definition of flow graph, after which defines, this partial code is compiled, when calculating the data flow diagram, according to volume Code after translating is read out and executes, rather than executes according to coded sequence when defining data flow diagram.

Data flow diagram is a kind of directed acyclic graph (Directed Acyclic Graph), it is by several nodes and node Between connection in (referred to as " while, edge ") composition, that is to say, that mono- node of Bian Shicong is directed toward another node.

Node and side, two levels are explained when can define from data flow diagram and when operation.When data flow diagram defines, section Point represents operator used in calculating process or variable.Wherein, operator is exactly the symbol for expressing operation rule, example Such as add (+), subtracts (-), multiplies (×), except (÷), integral (∫), differential, index, logarithm (log or ln) and other function shapes Formula etc..In fact, variable is also considered as a kind of special operator, that is, there are 0 input, 1 operator exported.Side table Show the operation relation between operator and/or variable.In data flow diagram operation, the storage of node on behalf data, each node A corresponding storage location.For example, node may map to some physical address in hard disk, memory or CPU register or Virtual address.Wherein data can be a variable perhaps this variable institute's assigned value or an operation result, the operation As a result it can be the mathematical expressions form such as a variable or constant, Bian Ze is represented in the transmission of data, that is, a node Data transmit another node being directed toward to the side.

Correspondent Node: to the both sides of communication, when describing the communication process of a side, this side is local terminal, and another party is logical Believe opposite end.Such as send data one end (transmitting terminals of the data) Correspondent Node be reception data one end (data Receiving end), the Correspondent Node for receiving one end (receiving ends of the data) of data is one end (transmission of the data for sending data End).The both sides of communication can be from physical machine, virtual machine, container, and multiple granularities such as process and thread describe.For example, sending number According to one end by a process perhaps thread execute send primitive with send data receive one end of data by a process or Thread, which executes, receives primitive to receive the data, then the process or thread that can also execute the reception primitive are referred to as the hair of the data The Correspondent Node of sending end, the process or thread that can either execute the reception primitive are referred to as the Correspondent Node of the transmission primitive；Together Reason can execute the process of the transmission primitive perhaps thread be referred to as the data receiving end Correspondent Node or can claim to execute the hair Send process or the thread of primitive for the Correspondent Node of the transmission primitive.Above-mentioned expression involved in following description, no longer into Row explanation.

Opposite end calculate node: to the communication process that a data are transferred to another calculate node from a calculate node, make It is source calculate node with the calculate node that transmission primitive sends the data, is using the calculate node that reception primitive receives the data Purpose calculate node.Then to the source calculate node of the data, opposite end calculate node is the purpose calculate node of the data, to this The purpose calculate node of data, opposite end calculate node are the source calculate node of the data.

Data flow diagram parameter: in data flow diagram, parameter refer to it is that the side on figure is carried, for by calculate node processing or The data fed back by calculate node.That is, data flow diagram parameter seeks to pass from a node (i.e. the source node on the side) The defeated data to this in another node destination node of while (i.e. this) being directed toward.Obviously, to a data flow diagram, data flow diagram The transmission of parameter is to calculate a part of the data flow diagram, also, the storage location of the node instruction in a data flow diagram is same In the case where one equipment (such as same CPU memory, same GPU memory etc.), the transmission of the data flow diagram parameter be can be in process Memory reproduction process；On the other hand, the storage location striding equipment of the node instruction in a data flow diagram is (such as same host CPU memory and GPU memory, the equipment etc. on different hosts) in the case where, the transmission of the data flow diagram parameter can be between process Communication process, and if storage location indicated by source node and destination node needs to be communicated based on network across host.

The source address of data flow diagram parameter is the storage location of the data flow diagram parameter in the calculate node of source, the source address It can be documented in the source node for carrying the side of the data flow diagram parameter.

The destination address of data flow diagram parameter is the storage location of the data flow diagram parameter in purpose calculate node, the mesh Address can be documented in carry the data flow diagram parameter side destination node in.

Address of node: to the node in data flow diagram, it can be physical address indicated by the node or virtually Location.Address of node can be used in the communication of data flow diagram parameter.

Size (size): the size or message size of the data flow diagram parameter referred in such as present specification.It indicates Memory space shared by one data or message, that is, the data volume that the data or message include generally are with byte (Byte) Dimension.Such as 2KB, 0.5MB etc..

Below with reference to Fig. 1, illustrate data flow diagram, node, side and data flow diagram parameter are how to express calculating to patrol Volume.Shown in FIG. 1 is a data flow diagram, and the data flow diagram is for expressing " multiplied by third number to obtain after two number additions To a result " as calculating logic, can be expressed as E=(A+B) × D with formula.Wherein, the data flow diagram have A, B, C, D, totally 4 sides totally 5 nodes and a, b, c, d E.When data flow diagram defines, node A, B, D respectively indicate a variable, node C, E respectively indicates addition and multiplying.Side a, b indicate that two addends of add operation in node C come from node A, B, side c, d Indicate that two factors of multiplying in node E come from node C, D.

In data flow diagram operation, node A, B, D indicate the storage location of input variable, and node C indicates add operation knot The storage location of fruit, E indicate the storage location of multiplication result.Storage location represented by node may map to hard disk, Address for storing data in the physical equipments such as memory or CPU register.Side a, b represent the storage position of node A, B mapping Data in setting are transferred to the process of the storage location of node C mapping, and side c, d are represented in the storage location of node C, D mapping Data are transferred to the process of the storage location of node E mapping.The data transmission procedure that these sides represent can be the process of being mapped to Interior memory reproduction process, such as storage location indicated by the node that is connected of these sides is in same host.Certainly, these sides The data transmission procedure of representative is also possible to be mapped to network-based data communication process between process, as these sides are connected Node distribution is in a distributed system.Such as shown in the figure, it inputs to input in 1, B in A and inputs 5 in 3, D, then transmitted on the side a It is the value transmitted on the side 3, d is 5 that value, which is the value transmitted on the side 1, b, then it is in 4, E that value obtained in C, which is the value transmitted on the side 4, c, Obtained value is 20, this data flow graph representation, calculating is (1+3) × 5=20.

It should be appreciated that this application involves computing platform, can be deployed in one or more calculate node, the calculating Node can be physical machine or virtual machine, that is to say, that this application involves computing platform, it is equally suitable in virtual machine field With.The physical machine can be the server perhaps some terminal with computing capability such as PCs or laptop Deng the application is with no restrictions.The application is illustrated for calculating data flow diagram, can be transported in above-mentioned each calculate node Row one or more processes, with to data flow diagram subgraph or copy carry out operation, below citing description computing platform it is soft Hardware running environment.It is to be understood that the node in calculate node and data flow diagram is different art-recognized meanings.User is flat in calculating The data flow diagram that need to be calculated is loaded on platform, and the data flow diagram is calculated using computing platform.

The application scenarios of the application be operate in it is in one or multiple calculate nodes, support Distributed Computing Platform soft Part.For example, computing platform can be machine learning platform, then such as machine learning platform can be for multilayer neural network Carry out the deep learning platform of machine learning.That is, the calculating task striding equipment in the application is run, so-called striding equipment, Refer to the case where executing in multiple calculating equipment that the program code of data flow diagram is distributed in one or more server, here Calculating equipment can be CPU, being also possible to GPU, FPGA, (Field-Programmable Gate Array, scene can compile Journey gate array), MIC (Many Intergrated Core, Cheng Zhonghe) or other hardware devices with computing capability.This Class platform software includes but is not limited to TensorFlow, MXNet, CNTK etc..

It is to be understood that in the scene of above-mentioned striding equipment operation computing platform software, the data flow diagram that need to calculate can be with Be divided into multiple subgraphs (a part that each subgraph can be the data flow diagram) or multiple data flow diagram copy be distributed in it is more A equipment.

In computing platform, the multiple copies or multiple subgraphs of data flow diagram are to execute calculating by multiple processes, Such as process calculates a subgraph perhaps a copy or a process calculates the son of multiple equipment in a physical machine The subgraph or copy of multiple equipment in perhaps copy or a physical machine are schemed by two or more process meter It calculates.Fig. 2 is the angle from process, the software and hardware resources of the corresponding computing platform of description process.With two use in computing platform It is illustrated for the process for calculating data flow diagram, in the example of Fig. 2, whether does not limit the two processes in same calculating On node and a process can calculate the subgraph or copy of a multiple equipment physically.

Process 1 shown in Fig. 2, i.e., 1011, during calculating data flow diagram, using in host memory 1012 and GPU Deposit 1016.Process 2, i.e., 1021 use host memory 1022 and GPU memory 1026 during calculating data flow diagram.Host Memory 1012 and 1022 can be the host memory of identical physical machine or different physical machines.GPU memory 1016 and 1026 can To be the GPU memory in different physical machines, or the different GPU memories on same physical machine.It should be understood that in host In the case where depositing the host memory that 1012 and 1022 are identical physical machine, represents and be respectively allocated to 1 He of process in host memory The memory address space of process 2 represents different address fields.Platform runtime code 1013 and 1023 is loaded in host memory It is run in 1012 and 1022, platform runtime code is exactly the code of computing platform own system, for running computing platform Software environment, this partial code user can not edit.Kernel function code in Fig. 2, then load is in the corresponding host of process It is run on memory and GPU memory.Kernel function code is for realizing the various kernel function for expressing local calculation logic, it can be understood as One include various kernel function kernel function library, kernel function, can be by for indicating some more complicated logical operations rules Node in data flow diagram calls, for example, kernel function can be the operation of matrix, such as dot product, multiplication cross etc. or convolutional calculation, These operations need more complicated one section to instruct and realize.Under a kind of implementation, (phase homotype in the memory of generic equipment Number equipment, if the GPU of different model is not generic equipment, GPU and CPU are also not generic equipment), the kernel function of deployment It is the same, and the kernel function then disposed to different classes of equipment can be different.In different types of equipment, the core disposed Function type has deployment kernel function A, B, C in intersection, such as certain GPU memory, disposes kernel function B in the memory of another CPU, C, D etc., that is to say, that a kind of kernel function can be deployed in multiple equipment.How to be disposed as kernel function, be by computing platform into Row distribution, such as can write in the library of computing platform, present specification does not repeat them here.And computing platform is used in user , can be according to the loading condition of distinct device and the distribution situation of data flow diagram when resource is calculated, dynamic dispatching is different Kernel function in equipment.

And the copy or subgraph of data flow diagram, it loads saved on host and GPU memory respectively, in Fig. 2, data flow diagram In circle represent node, the short-term between circle with arrow represents connection side, and (short-term is without arrow for the circle of short-term starting The circle that is connected of one end of head) source node on the connection side is represented, the circle that the arrow of the short-term is directed toward is exactly the connection side Destination node.Source node and destination node can be directed toward either host memory or physical address in GPU memory or Virtual address.

It is described as in specific Fig. 2, kernel function code 1014 runs on host memory 1012, and data flow diagram 1015 is stored in master Machine memory 1012；Kernel function code 1024 runs on host memory 1022, and data flow diagram 1025 is stored in host memory 1022；Core Function code 1017 runs on GPU memory 1016, and data flow diagram 1018 is stored in GPU memory 1016；Kernel function code 1027 is transported Row is in GPU memory 1026.Data flow diagram 1028 is stored in GPU memory 1026.Data flow diagram 1015,1018,1028 and 1025 It is the copy of same data flow diagram.For example, what address direction process 1 (i.e. 1011) that the source node on connection side 1019 is recorded used An address in host memory 1012, the address that destination node is recorded are directed toward in the GPU memory 1016 that process 1 (1011) use An address, therefore calculating process of the side from source node to destination node needs to be implemented the data flow diagram of striding equipment in process Parameter communication 1010.For another example the address that the source node on connection side 1029 is recorded is directed toward in the GPU that process 1 (i.e. 1011) use An address in 1016 is deposited, one in the GPU memory 1026 that process 2 (i.e. 1021) use is directed toward in the address that destination node is recorded A address, therefore calculating process of the side from source node to destination node needs to be implemented process span device data Flowsheet parameter and leads to Letter 1020.If process 1 and process 2 are located at different hosts, process span device data Flowsheet parameter communication 1020 is across object The communication of reason machine.

Below with reference to Fig. 3, further illustrate that this application involves a kind of software runtime environment arrived and hardware structures.In Fig. 3, By taking a machine learning platform 3011 as an example, which operates in server 1 (i.e. 3001), server 2 On (i.e. 3002), server 3 (i.e. 3003) and server 4 (i.e. 3004), between this four servers of server 1 to server 4 It is communicated by the network switch 3031.By taking server 1 as an example, the software that includes in server and hard is illustrated in detail Part.Hardware aspect is fitted with CPU (3021 in such as server 1), host memory on server 1 to server 4 (as serviced 3022 in device 1) and network interface card (3024 in such as server 1).It may include GPU card in server (in such as server 1 3023), wherein GPU card is packaged with GPU memory (3025 in such as server 1).Software aspects, 1 arrives server 4 on server It is deployed with machine learning platform software (3011 in such as server 1), which includes programming interface (in such as server 1 3012) engine (3013 in such as server 1), memory management (3014 in such as server 1) and telecommunication management is (such as when, running 3015 in server 1) modules such as.Wherein, the data flow diagram that memory management module (3014 in such as server 1) manages is (such as 3016 in server 1) parameter on is stored in host memory (3022 in such as server 1), some parameter may be used also To be stored in GPU memory (3025 in such as server 1).Memory management module 3014 is by host memory 3022 or GPU memory The data flow diagram 3016 stored in 3025 is read out, to be calculated, in calculating process if necessary to other servers into The communication of row data Flowsheet parameter can then pass through communication management module (3015 in such as server 1) and network interface card (such as server 3024 in 1) reception and transmission of data, are realized.It is to be understood that machine learning platform 3011 by server 1 into Cheng Yunhang, and engine, memory management etc. may be considered several sections of codes with different function when programming interface, operation.

It should be understood that machine learning platform is equivalent to the operating system run on computer hardware, it can also be with class It is divided into application layer and core layer than other operating systems, application layer is edited for user or input data, application layer Interface is provided between core layer, the instruction edited so as to user or function call these interfaces execute these using core layer Instruction or function.

It should be understood that host memory is namely for CPU memory workable for the process for calculating data flow diagram, therefore even if should be into Journey is located on virtual machine or container, the memory being assigned to, and such as virtual cpu memory, can also be referred to as lead in this application Machine memory.

This application involves one width data flow diagram of scene requirement multiple subgraphs or multiple copies can striding equipment collaboration hold Row, it is same including being performed in unison between same server or multiple processes of multiple servers, or in same server It is performed in unison between the CPU and accelerator hardware of process administration.However, striding equipment, which carries out data flow diagram calculating, necessarily refers to data The interaction of Flowsheet parameter, parameter exchange method used at present are not able to satisfy machine learning platform to the needs of computational efficiency.

On the other hand, it in high-performance calculation (High Performance Computing, HPC), uses with message transmission Interface (Message Passing Interface, MPI) technology is message transmission (Message Passing, MP) machine of representative System, MPI includes agreement and semantic description, and the efficiency of this mechanism configured transmission is higher.High-performance calculation refers to using ultra-large type Electronic computer carries out the calculating of large amount of complex, for example, biological heredity genetic testing, missile trajectory is calculated, the reactor of core industry Simulation, space industry aerial vehicle trajectory calculate or astronomical observation field star orbital calculate etc..These calculating use hard Part performance is much higher than the computer in general civilian or commercial scene, such as supercomputer (super computer).It is super Grade computer is a kind of ultra-large type electronic computer, and the ability with very strong calculating and processing data, main feature is shown as High speed and large capacity, with there are many external and peripheral equipment and abundant, H.D software systems.Existing supercomputing Machine arithmetic speed mostly can achieve per second one too (Trillion, trillion) it is secondary above for being easy to understand high-performance computing sector The difference of server used in middle supercomputer and other field, comes below for example, some producers are specified that: meter The arithmetic speed of calculation machine is 10,000,000 times per second or more average；Memory capacity belongs to supercomputer, example more than 1000 myriabits Such as, the ILLIAC- IV in the U.S., Japanese NEC, European Eugene, Chinese " milky way " computer etc..For another example being suitable for existing For the supercomputer of cluster architecture, such as the BlueGene in the U.S., " capital " of Japan, Piz Daint in Europe, China " martial prowess ", " Milky Way " etc..If supercomputer is used under distributed scene as calculate node, between calculate node Using high performance communication network (hereinafter referred high performance network), such as infinite bandwidth InfiniBand technology, wherein using special The network equipment (such as routing, network interface card, cable, interchanger), cost be about the several times of the same category of device of enterprise (such as Routing for InfiniBand is 4 times of common routing price).Common ICP/IP protocol or Ethernet transport protocol fortune Row, can also be contour using InfiniBand between common commercial or civilian server on this kind of private network device Performance communication network, but payload ratio is low, (such as bandwidth gulps down the performance for being unable to fully using this kind of dedicated network equipment The amount of spitting), therefore, usually using dedicated between high performance communication field, this kind of dedicated network equipment and supercomputer Communication protocol carry out data transmission.For example, being carried out data transmission using MPI technology, in a device in high performance communication field It can show as, MPI library is mounted in equipment.

It should be noted that MPI library refers to MPI development library, including a variety of MPI interface functions (also referred to as MPI instruction or MPI Primitive) and MPI communication protocol.From software architecture angle, MPI interface function can be understood as application layer and core layer is indirectly Mouthful, if analogy uses the software architecture of TCP/IP technology, MPI interface function is equivalent to Socket interface function (i.e. wherein Socket).

Under a kind of implementation, the interface function in MPI function library is used to carry out the transmission of a data, the transmission process In, originating process is using the process for sending primitive transmission data in transmitting terminal physical machine, and purpose process is in the physical machine of receiving end The process of the data is received using reception primitive.Transmission primitive and reception primitive in MPI function library use in pairs.To logical Two processes of letter, originating process send primitive using one and send data to the process of a mesh, which receives former using one Language handles the data, in the interface parameters of this pair of of primitive, carries identical message size (size), identical message label (tag), and, the mark of the purpose process is also carried in the interface parameters of the transmission primitive, the information of the purpose process is It is used to indicate the purpose process, for example, it may be the sequence number of the purpose process, can also be somebody's turn to do with the message identification of other forms Purpose process, the application is without limitation.And the mark of the originating process is also carried in the interface parameters of the reception primitive, the source into The mark of journey is for identifying the originating process, for example, it may be the sequence number of the originating process, can also use the information of other forms The originating process is identified, the application is without limitation.

It is to be appreciated that it is (as follows unique process sequence number can be assigned to the process that can be carried out communication in MPI library The rank that text refers to).And the mapping relations of the process sequence number in the MPI library with physical machine where process can be safeguarded in host, Perhaps the mapping relations or process sequence number between the information of process sequence number equipment corresponding with process, object where process Mapping relations between reason machine and the corresponding equipment three of process.It can thus learn the process communicated whether same One physical machine, or even whether identical equipment is used, it can choose so that MPI sends primitive through network interface card or shared drive, Or the modes such as local winding network equipment (namely kernel virtual unit) are sent.Select which kind of mode send MPI primitive with Whether originating process and purpose process related in same physical host or same data center (Data Center, DC), also with communication The network communication technology used between process is related (such as high performance network, Ethernet etc.), and the application does not repeat them here.

MPI library can match RDMA (storage of Remote Direct Memory Access remote direct data) skill Art, in the case where the CPU of host where purpose process is not carried out and receives primitive, originating process can will be counted by sending primitive Accordingly and the relevant information of data (triple as mentioned below) write-in purpose process where host the buffer area MPI.Certainly, ICP/IP protocol also can be used in MPI library, then the host where purpose process can be connect by I/O (Input/Output) equipment The data from originating process are received, and will be where the relevant information of the data (triple referred in as follows) write-in purpose process The buffer area MPI of host, to write data into the buffer area MPI of host during purpose process executes and receives primitive.For MPI technology and which kind of communication protocol collocation use, and belong to different application scene involved in the technical solution of the application record, this Application is with no restrictions.And it uses from different communication protocol collocation in the scene that carries out data transmission, MPI primitive is being realized How the data such as the transmission or reception of data work during transmitting, and belong to the internal operation mechanism of MPI library, please refer to not With the correlation technique data and explanation of the MPI library of version, the application is not repeated them here.

In MPI technology, the ground for the data for having a dedicated storage to handle in the host memory of MPI library by MPI primitive is used Location space, the referred to as buffer area MPI.The buffer area MPI is generally defined as fixed size, such as 64KB, 1MB etc..It should be understood that Buffer size can be less than by sending the data to be transmitted of primitive, can also be greater than buffer size, big in the data to be transmitted In the case where buffer size, the data that originating process can will transmit during executing and sending primitive are split.And In the case where the corresponding buffer area MPI of purpose process is occupied full, sending primitive can not just continue to write data into purpose process The buffer area MPI.In this case it needs purpose process to execute and receives the purpose that the data are written in received data by primitive Address.For example, the destination address can be located at GPU chip, the memory headroom that user uses (is distributed in host memory by system One user is used to store the data of the user) and other memory devices in.Purpose process execute receive primitive to receive data, The receive process may include having been written into source in the buffer area MPI for detect the host where purpose process by the reception primitive The data that process is sent, and the data that originating process transmission comes are saved into (the example into the destination address of the data from the buffer area MPI Such as host memory or GPU memory).In this way, the purpose process can use the data.

Table 1 is illustrated common MPI interface function.

Table 1

For example, the waiting primitive in table 1, for waiting a primitive to be finished.Table 1 only illustrates to several primitive It is bright, for example, sending primitive can also include that other can be realized the MPI interface letter for sending the other forms of this function of information Number, receiving primitive can also include that other can be realized the MPI interface function for receiving the other forms of this function of information.

In order to improve data transmission performance of the machine learning platform under distributed network environment, (data are mainly data flow Graph parameter), industry has begun trying the MPI communication technology of high-performance computing sector using computing platform at present, than as follows The TensorFlow-Allreduce project of Baidu (Baidu) exploitation referred in text, but it is hard due to computing platform itself Part condition is different from high-performance computing sector, so that existing use computing platform to will affect data transmission MPI technology Performance, again such that the computational efficiency of computing platform is by biggish limitation.

Below in conjunction with Fig. 4, the side that the TensorFlow-Allreduce project of Baidu (Baidu) exploitation proposes simply is introduced Machine learning platform is distributed in a distributed system by case, the project.Its technical solution is exactly to use in machine learning platform The message transmission primitive of MPI library constructs the collective communication that all processes participate in jointly, to complete data during collective communication The reduction of data flow diagram parameter in flow graph calculating calculates and parameter distribution, so as in the high performance network such as InfiniBand The performance of data flow diagram parameter communication is promoted in environment.Collective communication step be multiple processes simultaneously participate in include reduction calculate Communication steps, reduction calculates the interaction for needing this multiple process to carry out parameters, for example, maximizing, minimum value, or asks flat It, is all a kind of reduction calculating, for average, a part numerical value to be averaging respectively is can be read in multiple processes, is needed It participates in the multiple processes calculated and respective data is sent to the process for executing averaging algorithm, that is, need the ginseng between process Number communication.And it is clearly common algorithm in machine learning that reduction, which calculates, so collective communication can be used to carry out data flow diagram The transmission of parameter.

Fig. 4 gives the timing diagram of the technical solution at runtime.In the program, it is all participate in data flow diagram calculate into Journey (2011,2021,2031) is respectively used to one subgraph of operation or copy.Each course cycles reciprocally execute two kinds of routines. Routine is exactly one group of function set, can execute certain function, such as a system provides external interface or clothes using routine Business.For example, the API of operating system, service etc. are exactly routine；The canonical function and library function that Delphi or C++Builder is provided Deng be also routine.Routine A carry out data flow diagram local parameter generate calculate (in figure 2012,2015,2022,2025,2032, 2035 be all routine A)；Routine B carry out data flow diagram global parameter reduction calculate (2013 in figure, 2016,2023,2026, 2033,2036 be all routine B).Wherein, the technology realization of routine B is that the parameter based on MPI_Send and MPI_Recv primitive is logical Believe (2014,2017,2024,2027,2034,2037).The transmitting-receiving chain of these parameters communication constitutes ring-type, so that each process Reduction calculated result can finally reach other all processes, collect and distributed to complete a global data Flowsheet parameter Journey.This collective communication process constitutes primary global grid and hinders synchronous (globle barrier synchronization) (2041,2042), i.e., all processes, which must assure that, carries out the calculating of global parameter reduction when can be in identical wheel iteration, It is exactly after all processes all complete mutually same wheel calculating, to enter back into next round iteration, 2041 and 2042 indicate two-wheeled iterative process. Fig. 1 is specifically combined, global grid barrier is synchronous it is to be understood that the process for first reaching routine B has to wait for other processes and also reaches Routine B, routine B can be just finally completed.Such as a mark or instruction can be set in code, so that each process is being held Row detects whether other processes also go to the identical mark or instruction when perhaps instructing to the mark, in all processes In the case where being performed both by identical mark or instruction, it is further continued for executing next instruction.

It should be noted that carrying out shown in Fig. 4 before carrying out parameter communication using MPI primitive, participation in the program The originating process and purpose process of communication need first to interact to obtain the information of Correspondent Node, otherwise can not pass through MPI primitive Transmit data.

However, being dynamic using TensorFlow as the communication timing of the computing platform of representative, random, when operation Can learn the information such as source/target and the size of the intercommunication primitive message of being handled, and existing machine learning platform use it is logical It includes the primitive and interface parameters of pairing that letter technology, which does not require originating process and purpose process to use, and disappearing using MPI library as representative Breath pass through mechanism requires just to need to specify the interfaces such as message source or target and message size for intercommunication primitive in programming development Parameter, and the interface parameters close coupling of receiving-transmitting sides pairing primitive.TensorFlow-Allreduce scheme does not solve this Contradiction, it has increased one group of programming interface newly, writes its scheme using the newly-increased interface, develops several for collective communication Instruction, encapsulates MPI interface in the instruction, and the programming habit for changing TensorFlow is allowed to adapt to the requirement of MPI library.In this way One, user must be learned by the exploitation bank interface provided using the program, rewrites or rewrites application code, could be disappeared Breath transmitting communicates brought performance advantage.Therefore, the ease for use of the program and versatility are insufficient.Importantly, calculating flat The dynamic and randomness of the communication timing of platform make the both sides of data communication be difficult to confirm opposite end in time and hold consultation, should Negotiations process also increases the burden of data communication in computing platform, to influence the efficiency of transmission of data.

On the other hand, based on the operation be ordered into using MPI library as the requirement communication of the message passing mechanism of representative, synchronized, Allow appropriate non-matching message interspersed under the auxiliary of library internal buffer mechanism and asynchronous operation.However, being with TensorFlow The machine learning of representative, deep learning platform not sought common ground step due to calculating process, thus communication timing therein be also it is out-of-order and Asynchronous, a large amount of random traffic operations are interspersed to be executed, it is not required that each process is when calculating all in identical iteration wheel It is secondary.TensorFlow-Allreduce scheme does not solve this contradiction, it selects the calculating iteration round in TensorFlow Between synchronized using global grid barrier, avoid the communication of back gear time interspersed, to meet the constraint of MPI library, this makes computation rate Faster process is frequently waited for, to cause computing resource waste.This synchronous time overhead waited also has It may be decreased or offset the performance advantage of message-passing communication, so that the overall calculation rate of entire machine learning platform depends on Most slow process, to affect the computation rate of machine learning platform.

Baidu's TensorFlow-Allreduce scheme be not improved in the TensorFlow system of script, and It is the development library except a set of TensorFlow system for being encapsulated in script, which is external function library, it is relatively independent, By the external expansion interface access system of TensorFlow system, the transmission mode using collective communication provided is (i.e. It Allreduce) is another group of communication interface arranged side by side with the interface of original TensorFlow system offer.As external function Library, there is no modify TensorFlow platform core layer identification code.TensorFlow-Allreduce scheme is a set of independent generation Code, the application programming interface that this set code can call TensorFlow platform to provide from TensorFlow platform exterior (Application Programming Interface,API).It should be understood that machine learning platform can also be divided into and be answered With layer and core layer, wherein application layer is used to receive the model of user's input, wait train or data to be learned, runs user Algorithm or code for writing etc..And as above with described in Fig. 3 run when engine 3013, memory management 3014, communication tube The modules such as reason 3015 can think to belong to core layer.This set development library is with cannot differentiating the source of data flow diagram parameter to be communicated The physical location that location or destination address are directed toward is host memory or GPU memory, and TensorFlow system shields the development library These information.This requires the source of its capable perception data Flowsheet parameter of MPI library used or the physical bits of destination address It sets, so that external mechanism can correctly read and write data.

TensorFlow-Allreduce scheme uses one kind CUDA (Compute Unified Device Architecture the MPI library of (CUDA-aware)) is perceived.CUDA programming can be supported to connect if this kind of MPI library matches The GPU of mouth, such as the GPU of NVIDIA (tall and handsome to reach), then the transmission primitive of the MPI library can be determined to by transmission primitive processing The source address of information which kind of memory be located at, the reception primitive of the MPI library can determine the information to be handled by the reception primitive Destination address which kind of memory be located at.For example, the MPI library, which sends primitive, will just be located at if data to be sent are located at GPU memory Data in GPU memory copy to corresponding host memory and retransmit.And in fact, not every MPI library all CUDA perceive, Which has limited the selections by MPI Technology application in machine learning platform, to MPI library.

On the other hand, a variety of machine learning platforms including TensorFlow, core layer often call non-thread The CUDA driving layer interface of safety accesses GPU memory.So, the MPI library of CUDA perception is used in same TensorFlow etc. When the machine learning platform of CUDA interface is used in conjunction with, the mutual preempting resources of meeting, so that there are performance deficiencies.As it can be seen that MPI library and The mechanism of the platform access GPU memory such as TensorFlow is identical, and MPI library and TensorFlow platform core layer are using difference Thread access GPU memory, and multiple threads cannot concurrently access GPU memory, that is to say, that occupy in a thread In the case where CUDA driving layer interface, other threads are not available the interface also can not just access GPU memory.This just needs to adopt With some scheduling schemes so that multiple threads can access GPU memory, such as use mutual exclusion lock or stream synchronization mechanism.Due to Baidu Scheme is external function library, therefore can not perceive a function in the process of implementation, related subfunction relationship and calling Process.Such as using the transmission primitive of MPI function library in Baidu's scheme and receiving primitive, if the data of transmission are located at GPU memory, the then thread for executing the transmission primitive or reception primitive can be all locked during the entire process of executing the primitive, Either computing platform is to send primitive or receive primitive as the instruction managed by stream synchronization mechanism.And in fact, sending former Language or the implementation procedure for receiving primitive include multiple subprocess, and not all subprocess requires access GPU memory, this is just Additional waiting time expense can be brought, to influence the efficiency of message transmission.For sending primitive, executes and send primitive Process includes cutting data, is inserted into pointer, multiple subprocess such as memory duplication, and only memory duplication needs to access GPU memory.

The method that the application proposes data transmission in a kind of distributed system can simplify MPI technical application in calculating data Communication process in flow graph.This method can realize that the code is included in computing platform software by software code, deployment In distributed computing system.The object calculated below using data flow diagram as computing platform, transmitting data stream figure ginseng in calculating process It is illustrated for number, the application is not intended to limit the computing object of computing platform, does not also limit the data transmitted in calculating process Type.Dispose the copy or subgraph that data flow diagram to be trained is preserved in more physical machines of computing platform, wherein should Distributed computing system includes the first calculate node and the second calculate node.Wherein, in the corresponding embodiment of Fig. 5, first is calculated Node and the second calculate node are different calculate node.When operation, program code of the invention runs on the host of server Memory, alternatively, host memory and GPU memory.It is illustrated below in conjunction with Fig. 5.It should be understood that such as non-explanation, following S501~ The number of S508, does not represent the sequencing of step execution, for example, not providing the sequencing that S501 and S502 is executed.

S501: the first calculate node according to the title of the first data flow diagram parameter in the first graph data structure, size and Correspondent Node mark generates the first triple using first interface parameter generation algorithm, and first triple includes message mark Note, message size and purpose process sequence number, wherein the message label corresponds to the name of the first data flow diagram parameter Claim, the message size corresponds to the size of the first data flow diagram parameter, and the purpose process sequence number corresponds to described The process of the first data flow diagram parameter is received in second calculate node.

Wherein, the first graph data structure in the first calculate node preserves the first data flow diagram in the first data flow diagram Title, size and the Correspondent Node mark of parameter, wherein the first data flow diagram parameter is a connection of first data flow diagram The parameter that side is carried.The Correspondent Node mark of the first data flow diagram parameter in first data flow diagram corresponds to described second Calculate node.

First graph data structure can be different data structure in different computing platforms, and the application does not limit System.For example, can be Tenser data structure in TensorFlow platform.As it can be seen that in the first data flow diagram and the second number It is that the data flow of second calculate node is transferred to from first calculate node according to the first data flow diagram parameter in flow graph, is recorded Graph parameter.

The title (name) of first data flow diagram parameter can be first figure for identifying the first data flow diagram parameter A field in data structure is also possible to the information being dispersed in first graph data structure, that is to say, that the first data The title of Flowsheet parameter, which can be, to be obtained according to the information analysis in first graph data structure, and specific implementation is in difference Computing platform in it is different, such as in TensorFlow, reference can be made to the relevant paragraph of the application hereafter.

The size (size) of the first data flow diagram parameter is for indicating the sky of storage shared by the first data flow diagram parameter Between, that is, the data volume of the data flow diagram parameter.The size of the first data flow diagram parameter can be by the first diagram data knot A field in structure obtains, such as unit is byte, the numerical value of this parameter of the size of record data stream graph parameter, such as 3000,4000 etc..It is also possible to indicate with the information being dispersed in first graph data structure, such as first graph data structure In parameter, a part of data volume of the first data flow diagram parameter is respectively identified in multiple minor structures, it can be according to these information meters Calculation obtains the size of the first data flow diagram parameter.

The Correspondent Node mark of the first data flow diagram parameter in first data flow diagram can be second calculate node Mark；Or the mark of the storage equipment where the destination address of the first data flow diagram parameter, the storage equipment are located at the Two calculate nodes；Or second receive in calculate node the first data flow diagram parameter process mark；Or other are used for Indicate the information of the receiving end of the first data flow diagram parameter, the application is with no restrictions.

To sum up, the first graph data structure in the first calculate node preserves the first data flow diagram in the first data flow diagram Title, size and the Correspondent Node mark of parameter, can be includes carrying three kinds of information in first graph data structure Field is also possible to preserve the letter for title, size or the Correspondent Node mark that can obtain the first data flow diagram parameter Breath.I.e. so-called " preserving ", which can be, directly to read from first graph data structure, be also possible to according to first figure Information analysis in data structure obtains.

Certainly, which also can store one or more data in the first calculate node In structure.

For example, the realization of the S501 and S503 can be in the first calculate node and second in TensorFlow platform One metamessage management module of addition (specifically can be addition in the memory management module (3014 in such as Fig. 3) of calculate node One section of code), which can will be on the side of the data flow diagram in the first calculate node and the second calculate node The information of data flow diagram parameter be stored in data structure, include the name of above-mentioned data flow diagram parameter in the data structure Claim, size and Correspondent Node identify.

In addition, it is necessary to understand, in general machine learning platform, since the random ordering of communication is at random, process executes each The corresponding operation of kind primitive, i.e. a process can execute transmission operation or execute reception operation, in most cases, not have It is special to execute transmission primitive or specially execute the process for receiving primitive.

S503: the second calculate node according to the title of the first data flow diagram parameter in the second graph data structure, size and Correspondent Node mark generates the second triple, the second interface parameter generation algorithm using second interface parameter generation algorithm It is identical as the first interface parameter generation algorithm, second triple include message label, the message size and Originating process sequence number, wherein the originating process sequence number, which corresponds in first calculate node, sends first data flow The process of graph parameter；

Wherein, the second graph data structure in second calculate node preserves first number in the second data flow diagram It is identified according to the title, size and Correspondent Node of Flowsheet parameter；The first data flow diagram parameter in second data flow diagram Correspondent Node mark correspond to first calculate node.

Second data flow diagram is stored in the second calculate node, and the second data flow diagram can be the pair of the first data flow diagram This, all can also be two subgraphs of a data flow diagram with the first data flow diagram.In relation to first in second data flow diagram Title, the explanation of size and Correspondent Node mark of data flow diagram parameter, refer to the corresponding part in S501, herein no longer It repeats.

In this way, without being interacted with Correspondent Node or user, so that it may obtain MPI and send primitive and MPI reception original Interface function parameter needed for language.

It should be noted that generating message label, message size and source (or purpose) process sequence number usually using not Same algorithm, i.e. the first interface parameter generation algorithm include the first algorithm, the second algorithm and third algorithm.First algorithm, Second algorithm and third algorithm, which can convert the information in the first graph data structure and the second graph data structure to, to be met MPI and connects The above-mentioned triple of mouth parameter format.

And the first interface parameter generation algorithm being outlined above is identical as second interface parameter generation algorithm, refers to, the One interface parameters generating algorithm includes the first algorithm, the second algorithm and third algorithm, and, second interface parameter generation algorithm includes The first algorithm identical or corresponding with the first interface parameter generation algorithm, the second algorithm and third algorithm.

Then to S501, a kind of implementation be can be according to the first data flow diagram parameter in first graph data structure Title the first data flow diagram parameter and first algorithm, determine in first triple message label, according to described The size of first data flow diagram parameter the first data flow diagram parameter in first graph data structure and second algorithm, determine institute State the message size in the first triple, and the Correspondent Node according to the first data flow diagram parameter in the first graph data structure Mark and the third algorithm, determine the purpose process sequence number in first triple.

Correspondingly, to S503, it is a kind of to be achieved in that according to the first data flow diagram ginseng in second graph data structure The first algorithm in several titles and the second interface parameter generation algorithm, determines the message mark in second triple Note；According to the size and the second interface parameter generation algorithm of the first data flow diagram parameter in second graph data structure In the second algorithm, determine the message size in second triple；And according to the first number in the second graph data structure According to the third algorithm in the Correspondent Node mark of Flowsheet parameter and the second interface parameter generation algorithm, the described 2nd 3 is determined Originating process sequence number in tuple.

Wherein, message label is for indicating that the MPI sends the data that primitive is sent.Message label can pass through the first algorithm The title for handling the first data flow diagram parameter obtains, and the first algorithm, which can be to convert any binary length value to, fixes two The algorithm of system length value, such as can be hash algorithm etc.；Or other can convert the title of the first data flow diagram parameter The algorithm for the format that message marks in interface parameters to meet MPI primitive.

Message size is used to indicate that the MPI to send the size for the information that primitive is sent.To the second algorithm, then a kind of realization side Under formula, the value of the message size field can be made to be equal to the parameter value of the size of above-mentioned data flow diagram parameter, i.e. size.It is another real Under existing mode, the value of the message size field can be made to be equal to the parameter value of the size of above-mentioned data flow diagram parameter plus one Value.This value added is the size of the transmission primitive other information to be carried, the packet of the transmission primitive as mentioned below Head length.For example, it includes data to be sent and packet header that MPI, which sends the information that primitive is sent, then the value of the size of the message is exactly The size of the data to be sent adds the size in packet header.

Originating process sequence number is the sequence number that the first calculate node executes that the MPI sends the process of primitive, purpose process sequence Row number is to execute the sequence number that MPI corresponding with MPI transmission primitive receives the process of primitive in the second calculate node.It needs to manage Solution, due to preserving the source of the first data flow diagram parameter in the first data flow diagram and the second data flow diagram of the application Node and destination node then know the corresponding storage equipment of source address and the first data flow diagram of the first data flow diagram parameter The corresponding storage equipment of the destination address of parameter, and since in computing platform, for calculating data flow diagram, (data flow diagram parameter is passed Defeated is a part for calculating data flow diagram)

Third algorithm is the mapping relations between process sequence number and Correspondent Node mark, wherein wrapping in the first calculate node Include the mapping relations between purpose process sequence number and Correspondent Node mark, include originating process sequence number in the second calculate node with Mapping relations between Correspondent Node mark.Third algorithm can be functional relation, and be also possible to safeguard in calculate node reflects Firing table, the application is with no restrictions.The specific implementation of first algorithm, the second algorithm and third algorithm may refer to hereafter The example of TensorFlow platform, implementation can also be used in other computing platforms in detail below.About process sequence Number description may refer to the example of hereafter TensorFlow platform, implementation can also be used in other meters in detail below Calculate platform.

Obviously, above primitive is sent as the MPI of interface parameters using the first triple to join with by interface of the second triple Several MPI reception primitive is corresponding, in this way, including Correspondent Node in the first graph data structure and the second graph data structure Mark, solve the problems, such as that the process of Correspondent Node is unknowable in data flow diagram operational process, and, need to transmit this first The communicating pair of data flow diagram parameter, the information and identical interface parameters in data flow diagram stored using respective calculate node Generating algorithm generates triple, is just not necessarily to the information to opposite end interaction oneself, and the algorithm of triple is generated without negotiation, should Method can in data sender and recipient independent operating, generate corresponding triple in the case where both sides' no interactions, The process communicated using MPI primitive is simplified, can be improved the efficiency that data are transmitted in Distributed Computing Platform.

S505: the first calculate node calls message passing interface MPI to send using first triple as interface parameters Primitive sends the first data flow diagram parameter to second calculate node；

Wherein, the concept of triple described in this application is served only for indicating three parameters therein, without limit this three Sequence between a parameter.The format of three parameters in the triple meets MPI and sends interface function parameter entrained by primitive Call format.In addition, the interface parameters that MPI sends primitive includes but is not limited to first triple, MPI receives connecing for primitive Mouth parameter includes but is not limited to second triple.

Under a kind of implementation, the first calculate node described in S505 is used for the ginseng using first triple as interface Number sends primitive by message passing interface MPI and reads first data from the host memory of first calculate node Flowsheet parameter, to send the first data flow diagram parameter to second calculate node.

Under a kind of implementation, the first calculate node is also preserved the first data flow diagram parameter and is calculated described first The information of the storage equipment of node, that is, hereafter described in data type of memory.Then before S505, first calculating Node executes S504, i.e., in the case where the information of the storage equipment is designated as other storage equipment, by first data Host memory of the Flowsheet parameter from other described storage device replications to first calculate node, other described storage equipment are Memory in first calculate node in addition to host memory.

The information of the storage equipment can be the mark of the storage equipment, be also possible to the volume for indicating the storage equipment Number, the storage class of the storage equipment can be determined according to mark or number, can also be the type for identifying the storage equipment The information that above-mentioned function can be reached of information etc. or other forms, the application is with no restrictions.Specific implementation can refer to hereafter phase Close paragraph.

For example, other storage equipment can be GPU memory or the memory of other processing units, such as FPGA or DSP The memory etc. of equal processing units.The step can be with reference to specific implementation in TensorFlow platform described below to reinforce managing Solution.The communication management module that the step is construed as computing platform above-mentioned is realized, computing platform core layer is used Access the mechanism of other storage equipment.Such as to GPU memory, the function that the CUDA programming interface offer in the platform can be used will Data to be sent copy to host memory.In this way, the first calculate node using MPI send primitive before, will be by the first data Flowsheet parameter prepares in the host memory of the first calculate node, and MPI sends primitive only from the host of first calculate node The first data flow diagram parameter is read in memory, without fighting for reading the resource of other storage equipment with computing platform, is improved The execution efficiency of MPI transmission primitive.In addition can be more flexible to the selection of MPI library, not requiring MPI library to support access, other are deposited Equipment is stored up, the competition of computing platform and MPI library to other storage equipment are accessed will not be generated, specific discuss can be with reference to hereafter Relevant paragraph.Certainly, if selecting the MPI library for supporting access GPU memory, the step can also be executed by MPI library.

It should be understood that the first data flow diagram parameter can store, in the buffer area of host memory, (it, which is illustrated, please refers to this The relevant paragraph of application), it is also possible to distribute to the memory space of user in host memory, the application is with no restrictions.Such as MPI In the case that library matches RDMA technology, data in available host memory in any address being registered, and MPI library In the case where matching TCP/IP technology, then need to answer the first data flow diagram parameter in the memory space for being stored in user It makes in the buffer area MPI or data buffer area (seeing below) and uses.

That is, the data buffer area hereinafter referred to, can be all arranged in source calculate node and purpose calculate node, In this way, the data buffer area is used cooperatively with original buffer area MPI (the two can be collectively referred to as buffer area), it is not complete in buffer area In the case where full occupancy, it can more tolerate the asynchronous progress for sending and receiving operation, be suitable for needing to carry out in learning platform Complicated, the asynchronous and out-of-order transmitting-receiving operation of multiple data.

S507: the second calculate node calls MPI to receive primitive and handles first data flow diagram according to second triple Parameter.

It is to be appreciated that calling MPI to receive primitive handles the first data flow diagram parameter, " processing " is in different fields Under scape, different operations can be corresponded to, the application is without limitation.Such as can be it is following operation one of or it is a variety of: call MPI receives the data buffer zone that the first data flow diagram parameter is received host memory by primitive；MPI is called to receive primitive modification The first data flow diagram parameter in host memory is supplied to calculating data flow by the label of the first data flow diagram parameter The process of figure uses；The first data flow diagram parameter is stored from data buffer zone to destination address.Primitive is received such as about MPI Where manages the first data flow diagram parameter, can be with further reference to the dependent segment in the explanation hereafter by taking TensorFlow platform as an example It falls.

Under a kind of implementation, the first data flow diagram parameter receives primitive and carries the first data flow diagram parameter Destination address is implemented as calling second triple as the MPI interface parameters for receiving primitive then to S507 The MPI receives primitive, and the first data flow diagram parameter is stored from the data buffer area to the destination address.Example Such as, which is located at the user memory space in host memory.

If the destination address is located at other storage equipment, which is in second calculate node except in host Other storage equipment of access, such as GPU memory, the first MPI are supported to receive former if depositing the MPI library that outer storage equipment uses Language can also store destination address to corresponding destination address in the data of GPU memory.And under another implementation method, it is this Situation can then be carried out after S507 by the mechanism of other storage equipment of the access of computing platform itself.That is S508: in institute State the first data flow diagram parameter destination address correspond to other storage equipment in the case where, second calculate node is by the master To the destination address, other described storage equipment are the second calculating section for the first data flow diagram parameter storage in machine memory Memory in point in addition to host memory.

S508 is similar with the S504 of previously described first calculate node, illustrates and refers to the step with beneficial effect Paragraph and relevant paragraph hereinafter.S507 may be considered the MPI client executing being mentioned above.

It should be understood that data flow diagram is all preserved in the multiple physical machines for being deployed with distributed machines learning platform, The code of the machine learning platform is executed also by process with the training data flow diagram.So for the first data in data flow diagram Flowsheet parameter, the first calculate node are transmitting terminals, and for another data in data flow diagram, which may It is receiving end.Then the specific implementation of S505 and S507 can refer to hereinafter relevant description.

Due to the random ordering that machine learning platform executes instruction, data are possibly can not be after the host memory of write-in receiving end Sentence processing is received by MPI in time, and the space for the buffer area MPI that MPI library carries is small, is unable to satisfy several easily in machine learning The data transportation requirements of MB, therefore one piece of data buffer area can be divided in the host memory of the second calculate node, the data are slow It deposits area and is exclusively used in the data that storage MPI primitive uses, make a concrete analysis of see the application relevant paragraph.

Then under a kind of implementation, S507 includes detecting primitive by MPI to detect the second calculate node host memory In data buffer area, to obtain the second triple of the first data flow diagram parameter, the data buffer area is exclusively used in depositing Store up the data handled by MPI primitive；MPI is called to receive primitive, to handle the first data flow diagram parameter, the MPI is received The interface parameters of primitive includes second triple.

Under a kind of implementation, second calculate node operation has first thread and the second thread, and S507 includes:

The first thread detects primitive by message passing interface MPI and detects the data buffer area in the host memory, To obtain second triple；The first thread calls the first MPI to receive according to the second triple in the data buffer area Primitive, to handle the first data flow diagram parameter, the second triple in the data buffer area is the second calculating section Point sends what primitive obtained according to the MPI；Second thread is determining the first data flow diagram parameter by described first After MPI receives primitive processing, the 2nd MPI reception primitive is revised as MPI and waits primitive, it is not that the 2nd MPI, which receives primitive, The reception primitive corresponding with the first data flow diagram parameter executed by second thread, the 2nd MPI receive primitive Interface parameters include the second triple that second calculate node generates, the MPI waits primitive for waiting described the One MPI receives primitive and is finished.

Wherein, the second triple can be obtains according to the interface parameters that the MPI received sends primitive, is also possible to root It sends to analyze in the data of primitive transmission according to interface parameters and the MPI and obtain, the application is with no restrictions.

That is the second calculate node can execute MPI detection original by one thread of pull-up (can be described as poll thread) again Language includes above-mentioned data buffer area, the data in buffer area to detect the buffer area of the host memory of second calculate node Buffer area is often bigger than the buffer area MPI that system carries, and illustrates and can be found in hereafter relevant paragraph.It can thus find not It is in time for being connect the data of primitive processing by MPI.The thread can execute MPI detection primitive detection buffering by the way of poll Area, once finding such data, then calling the corresponding MPI of the data to receive primitive, (to distinguish, referred to as the first MPI receives former Language) and originally pending MPI primitive (to distinguish, referred to as the 2nd MPI receives primitive) is revised as MPI and waits primitive, the MPI Wait primitive for wait the first MPI receive primitive be finished, to the first MPI receive primitive be finished, then the thread after Continuous poll receives the data of instruction processing to continue with to MPI.

In this way, primitive processing can be received more in time by allowing for data, but also other are pending for the first calculate node Transmission primitive can perform faster, thus, improve data transmission efficiency.

To sum up describe the data transmission method of the machine learning platform from transmitting terminal to receiving end, this method by using Local graph data structure and interface parameters generating algorithm, obtains using interface function parameter needed for MPI primitive, avoids The parameter of transmitting terminal and receiving end is matched before transmitting data, is improved the efficiency of data communication, further, is obtained number to be transmitted According to the storage location in transmitting terminal and receiving end, to be sent in the storage location not in the case where host memory in data , will be mobile across storage equipment in data progress physical machine by the mechanism of machine learning platform after preceding and data receiver, it widens The range of choice of MPI library also avoids MPI library and machine learning platform and fights for resource when across storage equipment mobile data. And by the way that dedicated data buffer area and poll thread is arranged, enable message-passing communication buffer area in message sink original In the case that language not yet calls, message final purpose address is unknown, so that message transmission primitive is able to carry out data and sends, and Data return immediately after being sent completely.Buffer area is that following message sink primitive temporarily saves data, so that message Operation need not be synchronized with message sink primitive by sending primitive, solve the intrinsic temporal constraint of the two.Sender need not be same Step waits, and makes to facilitate improving performance which save the time is executed.

The characteristics of these above-mentioned improvement enable MPI library to be well adapted for machine learning platform, improve communication effect Rate, since MPI library is the technology in high-property transmission field, this, which allows for machine learning platform, can sufficiently use high-property transmission The resource of network, greatly improves communication efficiency, to promote the computational efficiency of computing platform.

About the other technologies details for the method for transmitting data in the machine learning platform of above-mentioned corresponding diagram 5, to wherein The detailed description of the beneficial effect of the explanation and each step of the noun or step that are related to, can be with further reference to present specification Other relevant paragraphs.

It is to be appreciated that thought described in the above method can use a variety of computing platforms, such as can be hereinafter The machine learning platform of detailed description is also possible to figure computing platform or stream calculation platform etc., and the application is with no restrictions.

Be described below this application involves computing platform calculate data flow diagram a kind of process.It should be understood that the process be for Explain computing platform in the process for calculating data flow diagram, only a kind of example, the application to the process with no restrictions.The process The machine learning platform such as TensorFlow referred to suitable for the application.Generally, comprising data flow diagram creation (or for " number Defined according to flow graph ") and data flow diagram operation.Wherein, in an implementation mode, data flow diagram creation can be subdivided into full figure structure It makes, subgraph extraction, figure cutting, scheme the sub-steps such as optimization.Data flow diagram operation can be subdivided into input data filling, algorithm kernel function The sub-steps such as execution, output data acquisition.For example, step S501~S507 can consider category in the method that the present embodiment proposes Sub-step is executed in algorithm kernel function, and before step S501, figure number is written into the information such as the title of data flow diagram parameter, size Then belong to data flow diagram creation process according to structure.

Data flow diagram creation, it is intelligible which by user is converted to computing platform using the algorithm of programming language Data flow diagram structure.

Specifically, including full figure construction, i.e., all algorithmic codes write user are all converted to data flow diagram Structure.Later, subgraph extraction is carried out to the data flow diagram structure after conversion, this is because often including in the data flow diagram The node and side unrelated with final calculation result is obtained.Therefore during subgraph in one case extracts, computing platform will with it is final The node and side that node where calculated result is connected are extracted from full figure, as subgraph to be run.Other are not by most The node and side that whole calculated result is connected will be ignored, and be not involved in follow-up operation process.Wherein, the connection can be directly with Node connection where final calculation result is also possible to connect by several sides with node where final calculation result.

Next, being illustrated with saving a part of subgraph in each equipment.Computing platform by the subgraph extracted into Row figure cutting, i.e. cutting are several width Local maps, and every width Local map corresponds to an equipment.Such as can be specified according to user Equipment allocation strategy carries out cutting.All nodes correspond to algorithm logic and will be executed as that equipment where it on one width subgraph. It should be understood that a node on subgraph will not be sliced into two equipment by figure dicing process, it is likely that a line is cut off.? In this case, computing platform can be automatically inserted into pairs of data and send running node in the Local map after cutting (SendOp) and data receiver running node (RecvOp).Under the auxiliary of traffic operation, it is split several on distinct device The overall calculation logic of width Local map can be remained exactly the same with the subgraph before cutting.As it can be seen that after data flow diagram is split, Complete the calculating to these subgraphs with wanting multi-process, it is necessary to transmitting data stream graph parameter.

That is, various data and information in data flow diagram, such as data flow diagram parameter, the letter of data flow diagram parameter Breath, may be stored in graph data structure.

It should be understood that the copy of a data flow diagram can also be distributed to multiple equipment by machine learning platform, then without carrying out The cutting of subgraph.In this case, transmitting data stream graph parameter is also needed, can be similarly inserted into data flow diagram and be used for table Show the node for sending operation and receiving operation.Due to including the information on side and node in data flow diagram, machine learning platform has more The method of the kind node of insertion for indicating to send operation and reception operation in data flow diagram, the application is with no restrictions.For side Just understand, below with reference to Fig. 6, schematically illustrated for scheming cutting.

As shown in fig. 6, the data flow diagram includes node a, b, c, w, y, x before figure cutting, and is directed toward the side of b by a, by A be directed toward c side, by w be directed toward y while and by x be directed toward y while.Figure dicing process makes node a be directed toward b, and a is directed toward c, and x refers to To the side of y be cut off, computing platform can where node a, x in subgraph insertion send running node s1, s2, and node b, c, Insertion receives running node r1, r2 in subgraph where y.This to establish a pair between s1 to r1 respectively between s2 to r2 Correspondence ensure that the overall calculation logic of two width Local maps with completely the same before cutting.

That is, figure cutting is also a kind of figure distribution, then machine learning platform is carrying out figure cutting or figure distribution During, necessarily have been determined the allocation plan of each node relevant to the node where final result, i.e., this is each It is certain which equipment node, which is dispensed in, therefore also can determine the source for the data that the side between these nodes is carried Calculate node and purpose calculate node.As it can be seen that and based on above description and Fig. 2, it is known that these subgraphs are assigned to calculating Multiple processes of platform execute, therefore the corresponding process of each node is also specific in subgraph.Based on these information, the application is mentioned It out can be in Tensor data structure new field, the type of memory and communication of Correspondent Node equipment is written in some embodiments Opposite end mark.

In an implementation mode, after carrying out figure cutting, figure optimization can also be carried out, i.e., is carried out the subgraph after cutting excellent Change processing enables to rate of the data flow diagram in future operation to have to reach under the premise of not changing its calculating logic The purpose promoted.

Steps mentioned above belongs to the step of computing platform creation data flow diagram.

Following computing platform operation data flow graph, in this step, computing platform dispatches each equipment and executes data flow diagram, Obtain the final calculation result of algorithm.

Under one of implementation, including input data filling, i.e. computing platform are read to be calculated from storage equipment External data collection, these external data collection are filled into the variable node in data flow diagram, so that the fortune in calculate node Operator has input data.Then, multiple computational threads are according to the subgraph in corresponding equipment, by executing and respectively son Scheme relevant kernel function to calculate data flow diagram.Specifically, computational threads by the node on the corresponding subgraph of equipment according to Certain scheduling strategy is lined up, and kernel function corresponding to the operator on each node is successively executed, to obtain algorithm Intermediate calculation results.Wherein the execution sequence of each node is dynamically determined by loading condition when scheduling strategy and operation, common Scheduling strategy have full synchronization policy (Bulk Synchronous Parallel, BSP), half synchronization policy such as SSP (Stale Synchronous Parallel) and asynchronous (Asynchronous Parallel, ASP) strategy, it should be noted that Machine learning field, it is not required that the synchronization that multiple computational threads calculate, therefore in most cases, the meter of this kind of learning platform Calculating all has asynchronous, random feature.

It should be noted that one kind is traffic operation node in these pending nodes, i.e. figure dicing process before this Middle be inserted into data send running node (SendOp) and data receiver running node (RecvOp), such as s1, the s2 in Fig. 5, R1 and r2.That is the present processes are described uses MPI primitive communication process, is exactly that computing platform is executing this Data representated by a little nodes are sent or the operation of data receiver.For example, in existing TensorFlow platform, traffic operation It is realized using remote procedure call protocol (Remote Procedure Call Protocol, RPC) mechanism.

Finally, computing platform is completed to calculate, and exports calculated result from from the node for represent final calculation result, returns to User program.

Below with reference to Fig. 7, by taking the TensorFlow machine learning platform of open source as an example, description is based on side described herein Method realizes the process of data flow diagram parameter communication.It is understood that the implementation detail of following data flow diagram parametric procedures Suitable for other computing platforms, the application is not caused to limit.Using following communication process so that TensorFlow this In the common computing platform of kind, MPI technical application can be simplified and calculating the communication process in data flow diagram, without being transmitted in data It is preceding to negotiate with Correspondent Node to client information, it is more flexible using MPI interface and meet the calculation features of computing platform, energy Preferably play the communication capacity of high performance network.After tested, following methods is used in the hardware environment of following examples, Communication efficiency can be improved 50%, to substantially reduce the time that computing platform calculates data flow diagram.

It is calculated it should be noted that the sender of data involved in following processes may be considered mentioned above first Node, recipient may be considered the second calculate node mentioned above, and sender and recipient can be identical calculating section Point or different calculate nodes, can be deployed in identical physical machine, can also be deployed in different physical machines, under The data flow diagram parameter communication process stated is applicable in.

In one example, the server of TensorFlow machine learning platform operation, reaches configured with tall and handsome (NVIDIA) GPU card and InfiniBand network interface card, the message-passing communication library which uses are MPI interface function.Its In, the GPU card of NVIDIA is provided by CUDA programming interface calculates acceleration capacity, and InfiniBand network interface card is mentioned by rdma protocol For efficient communication ability.

As shown in fig. 7, the module that the present embodiment is related to includes distribution in TensorFlow (5011) software frame (Distributed Runtime) module 5012 when operation, engine when which is the operation in TensorFlow have this Shen Please above described in run when engine function, corresponding method and step can be performed；(Common when public operation Runtime) module 5013, the module realize TensorFlow in memory management, have the application above described in memory Corresponding method and step can be performed in the function of management module；And remote rendezvous point (Remote Rendezvous) module 5014, the module realize TensorFlow telecommunication management, have the function of the application above described in communication management module, Corresponding method and step can be performed.It is to be understood that module described in the present embodiment is all a section code, it is believed that one The code of a module is continuously to write together.It wherein, include data flow diagram in Common Runtime module 5013 Graph5016.It further include host memory 5021, GPU memory 5022 and InfiniBand network interface card 5023 in server.

Part shown in dotted line frame changes in Fig. 7 for what the present embodiment carried out on the basis of existing TensorFlow software frame Into.Inside Distributed Runtime, the present embodiment is added to the function of MPI scheduling；In Common Runtime Portion, the present embodiment are added to the function of metamessage management, for managing the metamessage of data flow diagram parameter, under which includes Size, title and the Correspondent Node mark for the data flow diagram parameter that text refers to, in an implementation mode, metamessage further includes number According to the storage location of Flowsheet parameter.Management can be increase, delete, at least one of operation such as modification；In Remote Inside Rendezvous, the present embodiment is added to the function of MPI client；And MPI is integrated in TensorFlow platform Library；In the host memory that TensorFlow is used, it is (namely described below slow to be also assigned with message-passing communication buffer area Rush area), so that the instruction in MPI library uses.As it can be seen that these following improvement are all in the core layer of TensorFlow platform, this Following improvement is allowed for, such as interface parameters generates and interface calling procedure is hidden in the original data flow diagram of TensorFlow Inside creation and operational process, rather than it is exposed to application developer calling.TensorFlow platform is integrated in as one kind Internal improves mechanism, these improvement do not change the original programming mode of TensorFlow, can be realized existing application program Accelerate.

In the data structure of data flow diagram, information needed for saving the communication of message transmission specifically can be and mention above And the title of data flow diagram parameter, size and Correspondent Node mark.Common Runtime of this step in TensorFlow Module 5013 executes when creating data flow diagram.The existing Graph5016 module of TensorFlow uses a series of numbers such as Tensor The data flow diagram parameter of connection side and its carrying is saved according to structure.It has contained in these data structures for indicating data flow The information of the size (size) of the title (name) and data Flowsheet parameter of graph parameter, these data structures are that MPI primitive will pass Defeated data.The information for including in existing Tensor data structure can satisfy existing data flow diagram parameter transmission means, Use RPC communication.However, not including that the transmission primitive of MPI and reception primitive needs are taken in existing Tensor data structure The information of the opposite end process of band and the type of memory of parameter to be communicated.In the embodiment of the present invention, in Tensor data structure Increase the type of memory (such as being expressed as Dev type) and Correspondent Node identification field of data flow diagram parameter.It should be noted that To the above-mentioned information referred to, the field for indicating the information can be defined in Tensor data structure, such as definition is used for The dev_type field of type of memory is stored to store the corresponding type of memory of this end node of a data flow diagram parameter, to transmission Primitive, this end node are source node, and to the waiting primitive for receiving primitive and receiving end execution, this end node is purpose node.Again For example, it is also possible to define the corresponding type of memory of memory field storage peer node, to sending primitive, for the purpose of peer node Node, to the waiting primitive for receiving primitive and receiving end execution, peer node is source node.Under a kind of form, it can also define Multiple fields respectively correspond type of memory with this end node and peer node for storing with data flow diagram parameter.It is also possible to carry In Tensor data structure in multiple sub-data structures, need to carry out wherein part relevant to above- mentioned information parsing or Splicing calculates, and the embodiment of the present invention does not limit, and the means that computing platform provides some auxiliary go analysis Tensor data Content in structure, to obtain 4 kinds of above-mentioned information.

Wherein, the size of data flow diagram parameter is exactly the data volume of the data flow diagram parameter, that is, data flow diagram ginseng The shared memory space of number, unit byte record the numerical value of this parameter of the size of data flow diagram parameter to be transmitted, such as 3000,4000 etc..

During Common Runtime module creation data flow diagram, at least traversal has with the node where final result The connection of pass is while (such as when can also traverse all connections in the data flow diagram), for the data flow diagram carried on connection side Parameter, it is based on data flow diagram cutting as a result, knowing the type of memory of the data flow diagram parameter for indicating to carry on connection side Information, by this be used to indicate the type of memory of data flow diagram parameter carried on connection side information its Tensor data are written Structure, such as can be the type of memory field for filling in definition, it is also possible to be distributed in multiple fields.A kind of implementation Under, type of memory refers to the corresponding type of memory of this end node of a data flow diagram parameter, and to primitive is sent, this end node is source section Point, to the waiting primitive for receiving primitive and receiving end execution, this end node is purpose node.Pair for including in title based on side The mark of the equipment at end knows the identifier of the equipment of opposite end corresponding to the parameter carried on the side, by the equipment of the opposite end Identifier be written Tensor data structure, specifically can be Correspondent Node identification field therein.For example, TensorFlow In data flow diagram, the format of side title is connected are as follows: [src_device]；[src_incarnation]；[dst_device]； [tensor_name]；[frame_id]:[iter_id]

Wherein, [dst_device] field indicate be exactly the connection side destination node (i.e. receiving end) device identification Symbol.What [src_device] field indicated is the device identifier of the source node (i.e. transmitting terminal) on the connection side.These equipment marks Knowing symbol is often character string.[src_device] can also be abbreviated as Src Dev；[dst_device] can also be abbreviated as Dst Dev。

And to memory type field, different enumerated values can be used and identify different type of memory.Such as it is marked with 01 Know host memory, 10 mark GPU memories, or identifies GPU memory with 0 mark host memory, 1, or in 001 mark host It deposits, 010 mark GPU memory, 100 identify other hardware memories etc., and the application is with no restrictions.

Before the Remote Rendezvous module of TensorFlow initiates the communication of data flow diagram parameter, make MPI interface Function carries above-mentioned information.The step belongs to the process of operation data flow graph (i.e. calculating data flow diagram), it is believed that is The function for the MPI client being mentioned above.It is noted that above-mentioned information is needed by processing under a kind of implementation It could be carried by MPI interface function.Such as title (name), the information and the number of size (size) of data flow diagram parameter It identifies according to Correspondent Node corresponding to Flowsheet parameter, is carried after processing as the interface function parameter of MPI primitive, and data flow A part in data that the type of memory of graph parameter is carried as MPI interface function is transmitted.In this manner it is possible to using existing The various general MPI interface functions having, substantially increase versatility.For another example data flow diagram to be transmitted can also be joined Several type of memory is also used as the function parameter of MPI interface function.Certainly, in this case, need to modify MPI interface function Definition and operating specification, the application to being not described in detail in this respect.

For ease of description, being illustrated so that MPI sends primitive as an example.Primitive is received to MPI, can refer to and original is sent to MPI Also the message size field of a data flow diagram parameter, message are carried in the interface function parameter of the description of language, i.e. MPI reception primitive Tag field and target process sequence-number field.Under a kind of implementation, it includes data in the data that primitive carries that MPI, which is received, Type of memory corresponding to the destination address of Flowsheet parameter.Details are not described herein again.MPI send primitive interface function parameter include Message size field, the message size field are used to indicate the size of the transmission primitive information to be sent, then a kind of realization side Under formula, the value of the message size field can be made to be equal to the parameter value of the size of above-mentioned data flow diagram parameter, i.e. size.It is another real Under existing mode, the value of the message size field can be made to be equal to the parameter value of the size of above-mentioned data flow diagram parameter plus one Value.This value added is the size of the transmission primitive other information to be carried, the packet of the transmission primitive as mentioned below Head length.Under a kind of implementation, packet header includes the size of the data flow diagram parameter, mark (for marking the data flow diagram to join Number, such as its title), the number of corresponding target process and corresponding originating process number.MPI sends the message carried in primitive Label (tag) is used to indicate the MPI and sends the data carried in primitive, such as the title for indicating data flow diagram parameter.MPI It sends the message carried in primitive and is labeled as the binary numeral of a regular length, therefore the title of data flow diagram parameter can be made It with some algorithms, is converted into meeting the format of message label, its value is sent into message in primitive as MPI and marks this parameter Value, such as algorithm can be hash function.It should be understood that can be collided due to the verification scheme of MPI interface function to avoid Hash It influences.

Another aspect executes the host of MPI transmission primitive from process mapping table, according to the data flow being outlined above The Correspondent Node identification field of graph parameter, finds opposite end process sequence number, i.e. rank, and opposite end process is to execute to send out with the MPI The process for sending the corresponding MPI of primitive to receive primitive.For example, process sequence number can be some numbers, such as 0,1,2,3,28 etc.. Above-mentioned process mapping table, identifier including equipment in the computing platform and is closed using the mapping between the process sequence number of the equipment System.In the Correspondent Node identification field of one data flow diagram parameter it is in store be the data flow diagram parameter receiving end equipment Identifier.It is to be appreciated that, once calling the process pull-up of an equipment, appointing having executed to calculate after gallery operation Before business, the corresponding relationship of the process and the equipment will not all change, which can know the identifier for the equipment that it is called, Therefore the process mapping table can be generated in machine learning platform.For example, can be during process pull-up, it will by certain functions The identifier of equipment in the machine learning platform is converted to obtain sequence number, and the obtained sequence number of the transformation is just used as and sets The sequence number of standby corresponding process.Can also be after process pull-up, the mark for the equipment that the sequence number and process of record the process call Know the mapping relations of symbol.Can also when generating triple, can also mark to Correspondent Node using certain function at Reason, to obtain the sequence number of required process.It should be understood that these process sequences number are used for the primitive in MPI library, the machine Mapping where device learning platform can also save the process sequence number in the MPI library and process between the information of physical machine is closed System, the sequence number of each process are different.

Certainly, under a kind of implementation, under a kind of implementation, type of memory does not carry in MPI interface parameters, but It is carried in the data of MPI transmission.Such as type of memory sequence is melted into byte stream, a word as Tensor data structure Section.Obviously, MPI primitive is transferred to Correspondent Node process, and opposite end process parses the data received, you can learn that this was received The type of memory of the corresponding purpose equipment of data flow diagram parameter.

More than, without transmit data both sides negotiate, the program code independent operating of sender and recipient, both sides without In the case where interaction, just obtain in MPI interface, function parameter triple needed for sending primitive and receiving primitive < rank,size,tag>.Wherein, rank is the process sequence number of Correspondent Node, and size is the size of the information to be transmitted, and tag is Message label.

It should be noted that MPI is received in the function parameter of primitive, including data to be transmitted (such as data Flowsheet parameter) Destination address, that is to say, that MPI sends the information that opposite equip. is only carried in primitive, but not including data to be transmitted Destination address.Therefore, the data that sending ending equipment transmits of can receive of the receiving end of data to be transmitted (such as described below are deposited It is placed in the host memory of receiving end), but will be during the process of receiving side calls MPI to receive primitive, or even MPI is called to connect After receiving primitive, received data could be used for training data flow graph.

Such as by taking the data flow diagram of the cutting in Fig. 6 as an example, need progress striding equipment logical between source node S 2 and destination node R2 Letter.Wherein, node R 2 is located at equipment B, carries following information: Name:Edge_x_y_0, size:4000, Src Dev:Dev A, Dev Type:CUDA_GPU.Wherein Dev Type:CUDA_GPU indicates opposite equip., that is, equipment A type of memory.Node S2 Positioned at equipment A, following information: Name:Edge_x_y_0, size:4000, Dst Dev:Dev B, Dev Type is carried: CUDA_GPU.Wherein Dev Type:CUDA_GPU indicates opposite equip., that is, equipment B type of memory.The then transmission of source node S 2 Primitive can be written as MPI_Isend (tag=hash " Edge_x_y_0 ", size=4000+LEN_HEADER, rank= Dev2rank (" Dev B ")), i.e., carry triplet parameters mentioned hereinabove in the transmission primitive, and Dev Type:CUDA_ GPU then carries the data portion in the transmission primitive, and the data portion of the transmission primitive can also be including where the source node The title of equipment, i.e. Dev A.The reception primitive of destination node R2 carries interface message, can be written as MPI_Irecv (Tag= Hash " Edge_x_y_0 ", size=4000+LEN_HEADER, rank=dev2rank (" Dev A ")), i.e. the reception primitive Middle carrying triplet parameters mentioned hereinabove, and Dev Type:CUDA_GPU), then carry the data portion in the reception primitive Point, the data portion of the reception primitive can also include the title of the equipment where destination node, i.e. Dev B.

In example, the size of the data flow diagram parameter of 4000 expression MPI primitive carryings in size, LEN_HEADER, Indicate the length (length of header) in the packet header being described above.The data of MPI primitive transmission are serializings (Serialization) the Tensor data structure after, i.e. one group of byte.In addition to comprising to be transmitted in Tensor data structure Data flow diagram parameter, also include some other information fields, these information fields serializing after be known as " packet header ".Such as The title of the Tensor data structure.The length in packet header is determining, it is possible to add a constant.

In this way, transmitting terminal and receiving end are not necessarily to interact to can be obtained by opposite end during data flow diagram calculates Identifier of equipment etc. generates the parameter of MPI interface function, reduces the number of communications and waiting between each process.

It continues with and is illustrated with the process that process sends data flow diagram parameter.From the foregoing, it can be understood that executing MPI sends original The process of language need to first obtain the MPI and send the data flow diagram parameter to be sent that primitive carries.The process can describe in this way: really Whether fixed data flow diagram parameter to be sent is located at host memory, is located at other storages in the data flow diagram parameter to be sent and sets In the case where standby, which is copied into host memory.For example, other storage equipment can be the host in addition to host memory Other memories, such as GPU memory.This is because for general MPI interface function, it can only be directly using in host memory Data.

It include the buffer area MPI in host memory, which is allocated to what MPI library used under a kind of implementation Memory headroom.To certain MPI libraries, this is included, to store the data called by MPI primitive, as the memory headroom can be with It is 64KB.Optionally, the buffer area MPI is also possible to distribute to the memory headroom of certain user, it can multiplexing.Obvious MPI buffering Area space is smaller, although can break the synchronism of the communication by MPI library to a certain extent, such as to can receive 64KB below Information, but be easy to use up, it is not able to satisfy the needs of machine learning scene, for example, the data flow diagram of several hundred KB and a few MB are joined Number is all relatively conventional.The application also proposed another implementation, in the memory of the host where computing platform, then be arranged One data buffer area, the address space of the data buffer area are greater than the buffer area MPI, which is exclusively used in storage and passes through The data that MPI primitive calls.For example, the data buffer area of several MB or more than ten MB can be arranged in memory, it might even be possible to The MB up to a hundred even data buffer area of several GB are set in the memory of host.It is to be understood that machine learning platform is deployed with Physical machine, biggish memory headroom is configured on hardware, can satisfy Memory Allocation demand above-mentioned.In this way, data buffer storage Area and MPI fits use, and have expanded the capacity of the buffer area MPI, so that the ability of data used in processing MPI primitive It further enhances.And poll thread hereafter can be cooperated to use, accelerate receiving end to the number sent in host memory According to processing, thus accelerate MPI send primitive execution, to accelerate the data interaction in machine learning platform.It retouches for convenience It states, the above-mentioned buffer area MPI and data buffer area, collectively referred to as buffer area, common ground is to be exclusively used for storage to pass through MPI original The data of intonation.

On the other hand, in the case where MPI technology uses RDMA communication mechanism, a host can remotely be write data into The CPU of memory and the host is without perceiving, that is to say, that when the host for needing to receive data is also not carried out MPI and receives primitive, The host for sending data can be sent primitive by MPI and send data to the host for needing to receive data.And MPI is sent The destination address of the data of transmission is not carried in primitive, only MPI is received just to be carried in primitive, then only execution MPI in opposite end connects Primitive is received, could be by received data transmission to destination address, when being not carried out MPI reception primitive, received data can be stored first Enter these buffer areas.So, if necessary to receive data host in, in buffer area idle insufficient space with place by The data to be sent, the MPI for carrying the data that will be sent, which sends primitive, will be unable to execute.This is that is, to MPI Technology is to exist synchronize, have order constrained, i.e. transmitting terminal sends primitive to swimmingly continuously carry out MPI, receiving end is needed to exist After receiving data, MPI corresponding with the data received is executed as early as possible and receives primitive.So, increase described above Data buffer area can greatly allow asynchronous, out-of-order transmitting-receiving to operate, without increasing additional message synchronization mechanism, thus Meets the needs of this kind of computing platform of TensorFlow.Need it is synchronous, have the MPI library of order constrained can be docked to it is asynchronous, The TensorFlow of scrambling characteristic facilitates the property of the parameter communication of hoisting machine learning platform in data flow diagram communication process Energy.

It, the step of above-mentioned transmitting terminal process obtains data flow diagram parameter to be transmitted, can in TensorFlow platform It is executed with being considered that Remote Rendezvous module executes before MPI sends primitive (such as MPI_Send or MPI_Isend) 's.For example, Remote Rendezvous module reads the type of memory field of data flow diagram parameter to be communicated, judge that it is It is no to be located at host memory address space.If it is, terminating this step.If it is not, such as data flow diagram ginseng to be communicated Numerical digit then executes the cudaMemcpy function of CUDA programming interface offer, by the data flow diagram parameter to be communicated in GPU memory It is copied in host memory from GPU memory.In this way, whether the MPI library no matter selected supports access GPU, it can be in engineering It practises and being used in platform, without accessing GPU memory by MPI interface function, so that the range of choice of MPI library is bigger, it is also significantly slow The preempting resources problem that GPU is accessed in the scheme for the Baidu being mentioned above is solved.And since execute the step is Remote Rendezvous module in TensorFlow platform, belongs to the core layer of TensorFlow platform, then, is not necessarily to One thread will execute MPI and send lock of the whole process of primitive or MPI reception primitive all to GPU memory plus the process, and It is that only data to be sent mentioned above need to be copied to this step in host memory from GPU memory to lock, reduces The waiting time of other threads substantially reduces different processes in the lock competition of access GPU memory.

In the calculating process of the data flow diagram of striding equipment deployment, it is clear that reception and transmission including data flow diagram parameter, It may be considered the function of MPI client mentioned hereinabove.The data flow diagram parameter transmitted before sending with reception (this Reception refers to that data flow diagram parameter is received primitive processing by MPI in receiving end) before, it will all be stored in corresponding host memory In buffer area, which namely aims at the address field in the memory of MPI library distribution, it is slow to can be the MPI that is mentioned above Area is rushed, can also be data buffer area.

The reception and transmission of data flow diagram parameter are described below.Under a kind of implementation, whether have in detection data flow graph Data flow diagram parameter needs to carry out striding equipment communication.If there is, it is determined that a pending traffic operation is that data send operation Or data reception operation sends if it is data and operates, and sends data using MPI_Send or MPI_Isend primitive and (counts According to Flowsheet parameter)；If it is data reception operation, data are received using MPI_Recv or MPI_Irecv primitive；Later, Remote Rendezvous module is used the data received as data flow diagram parameter.Once send or receive operation knot Shu Hou detects whether to need to carry out striding equipment communication there are also parameter again, and so circulation executes.The cyclic process can be one What multiple threads of physical machine operation executed, these threads are controlled by some scheduling mechanisms, to execute not according to scheduling mechanism Same instruction is to complete different operations, such as sending data and receiving data is that thread can be by the operation of dynamic dispatching execution Two kinds.Such as the primitive executed after a variety of events occur is defined in the scheduling mechanism, that is, occur what what triggers The execution of primitive.Than as mentioned above, such as detect that this host is responsible in the data flow diagram calculated also data flow diagram parameter It needs to carry out striding equipment communication, then MPI is executed according to action type and send primitive or MPI reception primitive.Machine learning platform Often asynchronous, out-of-order, this mode is relatively common in machine learning platform.

It is also used to carry out the data in buffer area to be processed so that machine learning it should be noted that MPI receives primitive The data of the buffer area are used in platform for the process or thread of calculating.For example, it may be to the metamessages of the data into Row processing (such as confirming the state of the data, confirm in data to be received, which data is received), is also possible to this The synchronization process of data, for example, the process or the thread data of informing in machine learning platform for calculating be ready for it is ready, It can also be and store data into destination address.It is possible each to execute the program realization that MPI reception primitive includes for different MPI libraries It is not identical.

Primitive is received since an above-mentioned thread can only once execute MPI, and the machine learning of distributed deployment Platform is related to the interaction of more physical machines, thus a physical machine for deploying machine learning platform may receive in a short time it is more MPI sends the data of primitive transmission, then above-mentioned thread, which possibly can not receive primitive by MPI in time and handle, be sent to object Data in the host memory of reason machine.Under another implementation, the physical machine of deployment machine learning platform can distribute one specially Thread, which is exclusively used in detection and sends the data that primitive is sent by MPI, and receives the data detected.It should Dedicated thread belongs to a kind of dedicated thread, is not required to previously described scheduling mechanism control.For example, it may be described above Distributed Runtime module cooperates Remote Rendezvous module to complete.It thus can be improved and pass through MPI The real-time that primitive receives information is received, can also be reduced on receiver terminal, it is to be performed that remaining MPI sends primitive etc. Time.

The method that the dedicated thread running is described below: it is in the memory of the loop detection host by the way of poll The no triplet information for having the data sent, and the corresponding data of triple that processing detects, to accelerate to host The processing of the data that primitive processing is received to MPI in memory.For convenience of description, this thread can be described as poll thread.Specifically , the buffer area in poll thread poll host memory, i.e., the data mentioned above for being exclusively used in storage and being called by MPI primitive Memory space, such as the buffer area poll MPI；Or include the buffer area MPI and in the case where data buffer area in host memory, The buffer area poll MPI and data buffer area.The process is it is also assumed that be the function for the MPI scheduling being described above, realization is being made In host for the receiving end of data flow diagram parameter.The process of a wheel in the polling procedure is described below.Poll thread dispatching MPI The MPI_Probe or MPI_Iprobe in library etc. detect primitive, whether have the data flow diagram sent ginseng to detect in host memory The several or corresponding triple of data flow diagram parameter is waiting corresponding MPI reception primitive processing, that is to say, that this sent The corresponding MPI of data flow diagram parameter receives primitive and has not been performed, if it is not, continuing to execute in detection primitive poll host Buffer area in depositing；If so, then calling MPI_Irecv primitive corresponding with the data flow diagram parameter that this is detected, thus Data flow diagram parameter to be received can be received in local memory.Wherein, the detection thread is according to the purpose meter of the detection Corresponding triple<the rank in operator node side, size, tag>, so that it is determined that going out the MPI_ for handling the data flow diagram parameter The interface parameters of Irecv primitive.Then, corresponding with the data flow diagram parameter in implementation strategy script executed to other threads MPI_Recv or MPI_Irecv primitive be changed to MPI_Wait primitive, with etc. thread to be polled by MPI_Irecv primitive it is complete The processing (such as being placed into the corresponding memory space of destination address) of the pairs of data flow diagram parameter.Pass through MPI_ to poll thread Irecv primitive completes the processing to the data flow diagram parameter, then epicycle end of polling(EOP), and poll thread continues in poll host memory Buffer area, whether have the data flow diagram parameter sent that corresponding MPI is being waited to receive at primitive in host memory to detect Reason.In fact, can be held in script to what other threads executed when the MPI_Irecv primitive of the poll thread starts to execute Reception primitive in row strategy sends return information and is changed to MPI_Wait primitive to trigger the reception primitive, such as returns to one MPI requests (MPI_Request) object, may include in the object in the triple detected, thus can be according to object tune With the MPI_Wait primitive for being originally used for handling the corresponding data of the ternary not being performed.Thus, it is possible to quickly handle The data of the buffer area of receiving end host memory are written to, can will thus have been connect in the buffer area of receiving end host memory Data the space occupied of receipts frees out, other data from transmitting terminal are quickly written, that is to say, that transmitting terminal can be with The waiting time before executing other transmission primitive is reduced, so that receiving end can be handled within the shorter time in more buffer areas Data, and transmitting terminal can also execute within the shorter time and more send primitive.

In an implementation mode, it can constantly be detected in a manner of Infinite Cyclic.For example the thread runs to entire meter The process for calculating data flow diagram is completed to calculate.

It is briefly described below detection thread is how to detect the data for receiving primitive processing in host memory to MPI. In MPI library, there are some data structures to do corresponding record, if a data not yet receive primitive processing by MPI, it will It is stored in a data structure, MPI detection primitive can detect the record so that it is determined that this records corresponding data not Primitive processing is received by MPI.If a data receive primitive reception by MPI or are received by MPI primitive, the number It will be detected from data structure above-mentioned from removing so that primitive can not be detected by MPI according to corresponding record.In a kind of realization Under mode, above-mentioned MPI receives primitive (such as MPI_Recv) and MPI waits primitive (such as MPI_Wait) may be considered What the Remote Rendezvous module in TensorFlow platform executed.

In general, these data flow diagram parameters by by MPI receive primitive be supplied to receiving end for calculating thread or Process is received primitive and data flow diagram parameter is placed in host memory and belonged to using the calculating to carry out machine learning, such as MPI In the memory space of a user, which is the user for carrying out the machine learning.On the other hand, the data flow diagram parameter is being determined Destination address not in the case where host memory, by the copy of data flow diagram parameter storage to the corresponding equipment of destination address In.This step is it is also assumed that be a part of MPI client functionality in Remote Rendezvous module.With destination address For GPU memory, then the cudaMemcpy function of CUDA programming interface offer is provided, received data are copied into GPU In memory.

Due to using the primitive of MPI library, the buffer area of host memory is first written in data before destination address is written, and in number In calculating according to flow graph, the destination address of many data flow diagram parameters is the other equipment such as GPU, in this way, having received data uses meter Calculate platform in CUDA programming interface be written GPU, without using MPI library support access GPU memory, greatly expand usable The type of MPI library equally in the scheme for also alleviating the Baidu being mentioned above significantly, access the preempting resources problem of GPU. And since execute the step is Remote Rendezvous module in TensorFlow platform, belong to TensorFlow The core layer of platform, then, without to execute in a process, MPI sends primitive or MPI receives the whole process of primitive to GPU Memory adds the lock of the process, but only need to copy to data to be sent mentioned above in host memory from GPU memory This step lock, reduce other threads waiting time substantially reduce different processes access GPU memory lock it is competing It strives.

The corresponding method of Fig. 5 to Fig. 7 can be run in Fig. 7 and system shown in Fig. 3 and server.

To sum up, the transmission method for the data that the application proposes is believed without negotiating opposite end with Correspondent Node before data transmission Breath, improves the synchronization of MPI library, has order constrained to communicate asynchronous, out-of-order contradiction with data flow diagram, and improves MPI library and meter It calculates platform and problem is seized to the resource of access GPU, enable MPI technology preferably suitable with the computing platform of distributed deployment Match, network transmission resource can be made full use of, the efficiency of data transmission in machine learning platform is improved, to improve machine learning The business processing speed of platform.

On the other hand, the embodiment of the present invention provides the data transmission dress in one kind distributed computing system as shown in Figure 8 It sets.The distributed computing system includes the first calculate node and the second calculate node, wherein the data transmission device is located at described First calculate node, the data transmission device comprise determining that module 801, which is used to calculate section from described first In the first graph data structure in point, the title, size and communication pair of the first data flow diagram parameter of the first data flow diagram are determined End mark, wherein the parameter that the first data flow diagram parameter is carried by a connection side of first data flow diagram, described Corresponding second calculate node of Correspondent Node mark.

Generation module 802, the generation module 802 are used for according to the first data flow diagram in first graph data structure The title of parameter, size and Correspondent Node mark generate the first triple using first interface parameter generation algorithm, and described first Triple includes message label, message size and purpose process sequence number, wherein the message label corresponds to first number According to the title of Flowsheet parameter, the message size corresponds to the size of the first data flow diagram parameter, the purpose process sequence Row number corresponds to the process that the first data flow diagram parameter is received in second calculate node；

Communication module 803, the communication module 803 are used to call message using first triple as interface parameters Passing interface MPI sends primitive and sends the first data flow diagram parameter to second calculate node, in order to described second Calculate node is called MPI to receive primitive and is handled institute using the second triple corresponding with first triple as interface parameters The first data flow diagram parameter is stated, second triple is according to the second graph data structure in second calculate node, benefit It is generated with second interface parameter generation algorithm, the second interface parameter generation algorithm and the first interface generating algorithm phase Together.

Under a kind of implementation, using first triple as interface parameters, message passing interface MPI is called to send Primitive to second calculate node send the first data flow diagram parameter in terms of, the communication module 803 be used for institute The first triple is stated as interface parameters, host of the primitive from first calculate node is sent by message passing interface MPI The first data flow diagram parameter is read in memory, to send the first data flow diagram parameter to second calculate node.

Under a kind of implementation, first calculate node also preserves the storage where the first data flow diagram parameter The information of equipment, first calculate node further include read module 804, and the read module 804 is used to set in the storage In the case that standby information is designated as other storage equipment, the first data flow diagram parameter is answered from other described storage equipment The host memory of first calculate node is made, other described storage equipment are that host memory is removed in first calculate node Outer memory.

Under a kind of implementation, the first interface parameter generation algorithm includes that the first algorithm, the second algorithm and third are calculated Method is identified according to the title, size and Correspondent Node of the first data flow diagram parameter in first graph data structure, is utilized First interface parameter generation algorithm generates the aspect of the first triple, which is used for according to first diagram data The title of the first data flow diagram parameter in structure and first algorithm determine the message label in first triple, According to the size of the first data flow diagram parameter in first graph data structure and second algorithm, the described 1st is determined Message size in tuple, and according to the Correspondent Node of the first data flow diagram parameter in the first graph data structure mark and institute Third algorithm is stated, determines the purpose process sequence number in first triple.

As it can be seen that data transmission device shown in Fig. 8 is as the transmitting terminal in data transmission in above-mentioned implementation 's.In some other implementation, data transmission device shown in Fig. 8 be can execute with transmitting terminal corresponding operation, with As the receiving end in data transmission.That is, in some cases, data transmission device shown in Fig. 8 can have hair The function of sending end and receiving end, in other words, data transmission device shown in Fig. 8 are transmitting terminals in the transmission of some data, It is receiving end in the transmission of other data.

Realization side of the data transmission device in distributed computing system shown in Fig. 8 as data receiver is described below Formula.The distributed computing system includes the first calculate node and the second calculate node, and the data transmission device is located at described the Two calculate nodes, wherein the determining module 801 of the data transmission device, the determining module 801 are used to calculate section from described second In second graph data structure of point, the title, size and communication pair of the first data flow diagram parameter in the second data flow diagram are determined End identifies, and the Correspondent Node mark of the first data flow diagram parameter in second data flow diagram corresponds to first meter Operator node.

Generation module 802, the generation module 802 are used to be joined according to the first data flow diagram in second graph data structure Several title, size and Correspondent Nodes mark generate the second triple using second interface parameter generation algorithm, and the described 2nd 3 Tuple includes message label, message size and originating process sequence number, wherein the message label corresponds to first data flow The title of graph parameter, the message size correspond to the size of the first data flow diagram parameter, the originating process sequence number pair The process of the first data flow diagram parameter is sent in first calculate node described in Ying Yu.

Communication module 803, the communication module 803 are used to call message passing interface MPI to connect according to second triple It receives primitive and handles the first data flow diagram parameter from first calculate node, the first data flow diagram parameter is institute It states the first calculate node and sends what primitive was sent by MPI, the interface parameters that the MPI sends primitive includes and described second Corresponding first triple of triple, first triple are first calculate nodes according in first calculate node The first graph data structure, utilize first interface parameter generation algorithm generate, the second interface parameter generation algorithm and institute It is identical to state first interface generating algorithm.

Under a kind of implementation, which includes first thread and the second thread, second calculate node It include data buffer area in host memory, the data buffer area is exclusively used in storing the data handled by MPI primitive, with described Second triple calls message passing interface MPI to receive primitive processing from first calculate node as interface parameters The aspect of the first data flow diagram parameter, the first thread, which is used to detect primitive by message passing interface MPI, detects institute The data buffer area in host memory is stated, to obtain the second triple；The first thread is used for according to the data buffer area In the second triple call the first MPI to receive primitive, to handle the first data flow diagram parameter, in the data buffer area The second triple to be second calculate node send primitive according to the MPI obtains；Second thread is used for true After the fixed first data flow diagram parameter receives primitive processing by the first MPI, the 2nd MPI reception primitive is revised as MPI Wait primitive, it is not executed by second thread with the first data flow diagram parameter pair that the 2nd MPI, which receives primitive, The reception primitive answered, the interface parameters of the 2nd MPI reception primitive include the second ternary that second calculate node generates Group, the MPI wait primitive to be finished for waiting the first MPI to receive primitive.

In this manner it is possible to accelerate to handle the data flow diagram ginseng having had been prepared for indicated by triple in data buffer area Number, accelerates the speed for the data that receiving end processing receives, so that accelerating transmitting terminal executes the speed for sending primitive, in addition, number MPI primitive is also enhanced to the adaptability of out-of-order asynchronous receiving-transmitting operation according to buffer area, can be better adapted in computing platform The characteristics of data are transmitted.

It is former calling the first MPI to receive according to the second triple in the data buffer area under above-mentioned implementation Language, to handle the aspect of the first data flow diagram parameter, the first thread in the communication module 803 is used for, described first The destination address of data flow diagram parameter corresponds to that the memory that user uses is distributed in the host memory of second calculate node is empty Between in the case where, with the second triple in the data buffer area be the first MPI receive primitive interface parameters, call First MPI receives primitive, and the first data flow diagram parameter is stored from the data buffer area to first number According to the destination address of Flowsheet parameter.

Under a kind of implementation, the data transmission device further includes memory module 805, and the memory module 805 is used for In the case where the destination address of the first data flow diagram parameter corresponds to other storage equipment, by the in the host memory For the storage of one data flow diagram parameter to the destination address, other described storage equipment are in second calculate node except in host Memory except depositing.

Under a kind of implementation, which includes the first algorithm, the second algorithm and third algorithm, It is identified according to the title, size and Correspondent Node of the first data flow diagram parameter in second graph data structure, utilizes the Two interface parameters generating algorithms generate the aspect of the second triple, and the generation module 802 is used for according to second diagram data The first algorithm in the title and the second interface parameter generation algorithm of the first data flow diagram parameter in structure, determine described in Message label in second triple；According to the size of the first data flow diagram parameter in second graph data structure and described The second algorithm in second interface parameter generation algorithm determines the message size in second triple；And according to second The in the Correspondent Node mark of the first data flow diagram parameter in graph data structure and the second interface parameter generation algorithm Three algorithms determine the originating process sequence number in second triple.

As it can be seen that data transmission device corresponding to Fig. 8 corresponds to the first calculate node and second above in some cases Calculate node can execute the method for sender or recipient in the method as described in Fig. 5 to Fig. 7 above.Then, right to Fig. 8 institute The beneficial effect for the step of various explanations involved in the data transmission device answered and each module execute, refers to above Corresponding paragraph, details are not described herein again.In this way, can simplify MPI technical application is calculating the communication process in data flow diagram, it is not necessarily to Negotiate to enable MPI technology preferably flat with the calculating of distributed deployment client information with Correspondent Node before data transmission Platform adaptation, so that the efficiency that data are transmitted in distributed computing system is improved, to be promoted to distributed computing system data flow The computational efficiency of figure.

It should also be noted that, determining module 801, generation module 802 and communication module 803 in Fig. 8, can be difference Code or run code process or thread, division mode shown in Fig. 8 is only a kind of example, in some implementations In, these modules may be using name otherwise or division, such as certain module is a module.Determining module 801 can Corresponding to above 3014 module of memory management or TensorFlow platform in Commen Runtime5013 module；It is raw It can correspond in 3015 module of telecommunication management or TensorFlow platform above at module 802 and communication module 803 Remote Rendezvous5014 module.And in the case where communication module 803 includes first thread and the second thread, it can be with Think when first thread belongs to operation mentioned hereinabove in 3013 module of engine or TensorFlow platform 5012 module of Distributed Runtime, and the second thread belongs in communication management module or TensorFlow platform Remote Rendezvous5014 module.Therefore the function of previously described module may be implemented in data transmission device shown in Fig. 8 Can, specific embodiment refers to description above.

Through the above description of the embodiments, it is apparent to those skilled in the art that, for description It is convenienct and succinct, only the example of the division of the above functional modules, in practical application, can according to need and will be upper It states function distribution to be completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, to complete All or part of function described above.The device of foregoing description and the specific work process of unit can refer to aforementioned side Corresponding process in method embodiment, details are not described herein.

On the other hand, as shown in figure 9, Fig. 9 shows a kind of schematic block diagram of physical machine provided in an embodiment of the present invention, The embodiment of the present invention provides a kind of physical machine, which includes processor 40 and memory 42, which is that storage can Execute the non-transient computer-readable media of code.The physical machine can be previously described first calculate node or the second meter Operator node.Distributed Computing Platform is operated on processor 40 by the program in memory 42.

The physical machine can execute the executable code in memory 42 by processor 40, previously described each to execute Kind method.Obviously, Fig. 9 is more simpler by one than Fig. 3 or the server shown in Fig. 7 that can run various methods in the application Kind expression-form.Data transmission device shown in Fig. 8 may operate in as in Fig. 9 or Fig. 3 or framework shown in Fig. 7.

In an implementation mode, the non-transient computer-readable media of the storage executable code is memory 42, should Physical machine further includes interface circuit 41 and system bus 43.

Processor 40 can be realized by multiple processors.Processor 40 can be central processing unit (English: central Processing unit, abbreviation: CPU).Processor 40 can also for other general processors, graphics processor (English: Graphics Processing Unit, abbreviation: GPU) digital signal processor (English: digital signal Processing, abbreviation: DSP), specific integrated circuit (English: application specific integrated Circuit, abbreviation: ASIC), field programmable gate array (English: field-programmable gate array, abbreviation: FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components etc..General procedure Device can be microprocessor or the processor is also possible to any conventional processor etc..

Such as physical machine and corresponding embodiment in Fig. 5 to Fig. 7, by taking the physical machine including CPU and GPU as an example It is described.Wherein, if including GPU in processor, GPU and GPU memory is generally housed in same chip, i.e. processor 40 In may include certain processors memory.

Interface circuit 41 specifically can be the communication interface of hardware in physical machine.The communication interface can connect for wireless communication Mouthful, can also including antenna etc. radio circuits.For example, wireless communication interface can be the wireless module etc. of physical machine.Processor 40 pass through the transmitting-receiving that data are carried out between interface circuit 41 and other equipment, such as other physical machines.Such as in Fig. 3 and Fig. 7 Network interface card 3024 or InfiniBand network interface card 5023 shown in physical machine belong to a kind of implementation of communication interface.

Memory 42 may include volatile memory (English: volatile memory), such as random access memory (English: random-access memory, abbreviation: RAM), memory etc.；Memory 42 can also include nonvolatile memory (English: non-volatile memory), such as read-only memory (English: read-only memory, abbreviation: ROM), fastly Flash memory (English: flash memory), hard disk (English: hard disk drive, abbreviation: HDD) or solid state hard disk (English Text: solid-state drive, abbreviation: SSD)；Memory 42 can also include the combination of the memory of mentioned kind.Storage Device 42 can have multiple so that above-mentioned multiple processors 40 use, a such as host memory described in previous embodiment, in GPU It deposits.

Memory 42 may include bottom storage medium and memory.Wherein, memory is coupled to bottom storage medium, for making For the caching of bottom storage medium.

System bus 43 may include data/address bus, power bus, control bus and signal condition bus etc..System bus 43 for connecting above-mentioned processor 40, memory 42 and interface circuit 41.For clear explanation in the present embodiment, Various buses are all illustrated as system bus 43 in Fig. 9.

Physical machine provided in an embodiment of the present invention can be used for executing any method of the application record, such as Fig. 5 to figure 7 corresponding first calculate nodes or the second calculate node either execute method, then, to relating in physical machine corresponding to Fig. 9 And to the explanation of noun and the beneficial effect of each step involved in the description of method, method, refer to above Corresponding paragraph, details are not described herein again.In this way, can simplify MPI technical application is calculating the communication process in data flow diagram, it is not necessarily to Negotiate with Correspondent Node before data transmission to client information, enable MPI technology preferably with the computing platform of distributed deployment Adaptation, so that the efficiency that data are transmitted in distributed computing system is improved, to be promoted to distributed computing system data flow diagram Computational efficiency.

Optionally, the present embodiment also provides the non-transient computer-readable media for being stored with executable program, and journey can be performed Sequence includes for executing any method described herein.The non-transient computer-readable media is mountable to physical machine, when When physical machine is run, the processor of physical machine executes the computer executed instructions, so that physical machine executes described herein A kind of method.Then, to various nouns in the description of the method saved in above-mentioned non-transient computer-readable media, related this method Explanation and each step beneficial effect, refer to corresponding paragraph above, details are not described herein again.

Optionally, the non-transient computer-readable media in the present embodiment can be above-mentioned memory 42 as shown in Figure 9.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the module or The division of unit, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units Or component can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, institute Display or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit Indirect coupling or communication connection can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) or processor execute described in each embodiment of the present invention The all or part of the steps of method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, ROM, RAM, magnetic or disk etc. The various media that can store program code.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

It is apparent to those skilled in the art that for convenience and simplicity of description, only with above-mentioned each function The division progress of module can according to need and for example, in practical application by above-mentioned function distribution by different function moulds Block is completed, i.e., the internal structure of device is divided into different functional modules, to complete all or part of function described above Energy.The specific work process of the system, apparatus, and unit of foregoing description, can be with reference to corresponding in preceding method embodiment Journey, details are not described herein.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the module or The division of unit, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units Or component can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, institute Display or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit Indirect coupling or communication connection.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member can be realized in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, all or part of the technical solution It can be embodied in the form of software products, which is stored in a storage medium, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) or processor execute All or part of the steps of the method according to each embodiment of the present invention.The storage medium is non-transitory (English: non- Transitory) medium, comprising: flash memory, mobile hard disk, read-only memory, random access memory, magnetic disk or light The various media that can store program code such as disk.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. a kind of distributed computing system, the distributed computing system includes the first calculate node and the second calculate node, It is characterized in that, the first graph data structure in first calculate node preserves the first data flow diagram in the first data flow diagram Title, size and the Correspondent Node mark of parameter, wherein the first data flow diagram parameter is the one of first data flow diagram The parameter that connection side is carried；The second graph data structure in second calculate node preserves the institute in the second data flow diagram State title, size and the Correspondent Node mark of the first data flow diagram parameter；The first data flow diagram in first data flow diagram The Correspondent Node mark of parameter corresponds to second calculate node, first data flow diagram in second data flow diagram The Correspondent Node mark of parameter corresponds to first calculate node；

First calculate node is used for, according to the title of the first data flow diagram parameter in first graph data structure, greatly Small and Correspondent Node mark generates the first triple using first interface parameter generation algorithm, and first triple includes disappearing Breath label, message size and purpose process sequence number, wherein the message label corresponds to the first data flow diagram parameter Title, the message size correspond to the size of the first data flow diagram parameter, and the purpose process sequence number corresponds to institute State the process that the first data flow diagram parameter is received in the second calculate node；

Second calculate node is used for, according to the title of the first data flow diagram parameter in second graph data structure, greatly Small and Correspondent Node mark generates the second triple using second interface parameter generation algorithm, and the second interface parameter generates Algorithm is identical as the first interface parameter generation algorithm, and second triple includes that message label, the message are big Small and originating process sequence number, wherein the originating process sequence number, which corresponds in first calculate node, sends first number According to the process of Flowsheet parameter；

First calculate node is used for, and using first triple as interface parameters, message passing interface MPI is called to send Primitive sends the first data flow diagram parameter to second calculate node；

Second calculate node is used for, and according to second triple, is called MPI to receive primitive and is handled first data flow Graph parameter.

2. system according to claim 1, which is characterized in that using first triple as interface parameters, call Message passing interface MPI send primitive to second calculate node send the first data flow diagram parameter in terms of first number According to Flowsheet parameter, first calculate node is used to pass through message passing interface using first triple as interface parameters MPI sends primitive and reads the first data flow diagram parameter from the host memory of first calculate node, with to described the Two calculate nodes send the first data flow diagram parameter.

3. system according to claim 2, which is characterized in that first calculate node also preserves first data The information of storage equipment where Flowsheet parameter, the first calculate node described in the first data flow diagram parameter are also used in the storage In the case that the information of equipment is designated as other storage equipment, the first data flow diagram parameter is stored into equipment from described other The host memory of first calculate node is copied to, other described storage equipment are in first calculate node except in host Deposit outer memory.

4. according to claim 1 to system described in 3 any claims, which is characterized in that the first interface parameter generates Algorithm includes the first algorithm, the second algorithm and third algorithm, according to the first data flow diagram in first graph data structure Title, size and the Correspondent Node mark of parameter, generate first the first data of triple using first interface parameter generation algorithm The aspect of Flowsheet parameter, first calculate node are used for:

According to title the first data flow diagram parameter of the first data flow diagram parameter in first graph data structure and described One algorithm determines the message label in first triple, according to the first data flow diagram in first graph data structure The size of parameter the first data flow diagram parameter and second algorithm determine the message size in first triple, and According to the Correspondent Node of the first data flow diagram parameter in the first graph data structure mark and the third algorithm, described the is determined Purpose process sequence number in one triple；

Correspondingly, in title, size and Correspondent Node according to the first data flow diagram parameter in second graph data structure Mark generates the aspect of second triple the first data flow diagram parameter, second meter using second interface parameter generation algorithm Operator node is used for:

It is generated and is calculated according to the title of the first data flow diagram parameter in second graph data structure and the second interface parameter The first algorithm in method determines the message label in second triple；According to first in second graph data structure The second algorithm in the size of data flow diagram parameter and the second interface parameter generation algorithm determines in second triple Message size；And according to the Correspondent Node of the first data flow diagram parameter in the second graph data structure mark and described second Third algorithm in interface parameters generating algorithm determines the originating process sequence number in second triple.

5. according to claim 1 to system described in 4 any claims, which is characterized in that according to second triple, MPI is called to receive aspect the first data flow diagram parameter that primitive handles the first data flow diagram parameter, described second calculates section Point is for detecting the data buffer area in the second calculate node host memory by MPI detection primitive, to obtain described the Second triple of one data flow diagram parameter, the data buffer area are exclusively used in storing the data handled by MPI primitive；It calls MPI receives primitive, and to handle the first data flow diagram parameter, the interface parameters that the MPI receives primitive includes described second Triple.

6. according to claim 1 to system described in 4 any claims, which is characterized in that the first data flow diagram parameter The destination address that primitive carries the first data flow diagram parameter is received, primitive processing is being received by the first data flow diagram parameter The aspect of the first data flow diagram parameter, second calculate node using second triple as the MPI for connecing The interface parameters for receiving primitive calls the MPI to receive primitive, by the first data flow diagram parameter from the data buffer area Store the destination address.

7. the data transmission method in a kind of distributed computing system, the distributed computing system include the first calculate node and Second calculate node, which is characterized in that the described method includes:

From the first graph data structure in first calculate node, the first data flow diagram parameter of the first data flow diagram is determined Title, size and Correspondent Node mark, wherein the first data flow diagram parameter be first data flow diagram a connection The parameter that side is carried, corresponding second calculate node of Correspondent Node mark；

It is identified, is utilized according to the title, size and Correspondent Node of the first data flow diagram parameter in first graph data structure First interface parameter generation algorithm generates the first triple, and first triple includes message label, message size and purpose Process sequence number, wherein the message label corresponds to the title of the first data flow diagram parameter, and the message size is corresponding In the size of the first data flow diagram parameter, the purpose process sequence number, which corresponds in second calculate node, receives institute State the process of the first data flow diagram parameter；

Using first triple as interface parameters, calls message passing interface MPI to send primitive and calculate section to described second Point sends the first data flow diagram parameter, in order to second calculate node with first triple corresponding second Triple calls MPI to receive primitive and handles the first data flow diagram parameter, second triple is root as interface parameters According to the second graph data structure in second calculate node, generated using second interface parameter generation algorithm, described second Interface parameters generating algorithm is identical as the first interface generating algorithm.

8. the method according to the description of claim 7 is characterized in that being called using first triple as interface parameters Message passing interface MPI send primitive to second calculate node send the first data flow diagram parameter in terms of, it is described Method includes:

Using first triple as interface parameters, primitive is sent by message passing interface MPI and calculates section from described first The first data flow diagram parameter is read in the host memory of point, to send first flow data to second calculate node Flowsheet parameter.

9. according to the method described in claim 8, it is characterized in that, first calculate node also preserves first data The information of storage equipment where Flowsheet parameter, the method also includes:

In the case where the information of the storage equipment is designated as other storage equipment, by the first data flow diagram parameter from institute Other storage device replications are stated to the host memory of first calculate node, other described storage equipment are first calculating Memory in node in addition to host memory.

10. according to any method of claim 7 to 9, which is characterized in that the first interface parameter generation algorithm includes First algorithm, the second algorithm and third algorithm, in the name according to the first data flow diagram parameter in first graph data structure Title, size and Correspondent Node mark generate the aspect of the first triple, first meter using first interface parameter generation algorithm Operator node is used for:

According to the title of the first data flow diagram parameter in first graph data structure and first algorithm, described is determined Message label in one triple, according to the size of the first data flow diagram parameter in first graph data structure and described the Two algorithms determine the message size in first triple, and according to the first data flow diagram in the first graph data structure The Correspondent Node mark of parameter and the third algorithm, determine the purpose process sequence number in first triple.

11. the data transmission device in a kind of distributed computing system, the distributed computing system includes the first calculate node With the second calculate node, which is characterized in that the data transmission device is located at first calculate node, the data transmission dress It sets and includes:

Determining module, the determining module are used for from the first graph data structure in first calculate node, determine first Title, size and the Correspondent Node mark of first data flow diagram parameter of data flow diagram, wherein the first data flow diagram parameter The parameter carried by a connection side of first data flow diagram, the Correspondent Node mark corresponding described second calculate section Point；

Generation module, the generation module are used for the name according to the first data flow diagram parameter in first graph data structure Title, size and Correspondent Node mark, utilize first interface parameter generation algorithm to generate the first triple, the first triple packet Include message label, message size and purpose process sequence number, wherein the message label, which corresponds to first data flow diagram, joins Several titles, the message size correspond to the size of the first data flow diagram parameter, the purpose process sequence correspondence The process of the first data flow diagram parameter is received in second calculate node；

Communication module, the communication module are used to call message passing interface MPI using first triple as interface parameters Send primitive and send the first data flow diagram parameter to second calculate node, in order to second calculate node with Corresponding second triple of first triple calls MPI to receive primitive and handles first data flow as interface parameters Graph parameter, second triple are to be joined according to the second graph data structure in second calculate node using second interface Number generating algorithm generates, and the second interface parameter generation algorithm is identical as the first interface generating algorithm.

12. device according to claim 11, which is characterized in that using first triple as interface parameters, adjust With message passing interface MPI send primitive to second calculate node send the first data flow diagram parameter in terms of, institute Communication module is stated for using first triple as interface parameters, by message passing interface MPI transmission primitive from described The first data flow diagram parameter is read in the host memory of first calculate node, described in sending to second calculate node First data flow diagram parameter.

13. device according to claim 12, which is characterized in that first calculate node also preserves first number According to the information of the storage equipment where Flowsheet parameter, first calculate node further includes read module, and the read module is used In in the case where the information of the storage equipment is designated as other storage equipment, by the first data flow diagram parameter from described For other storage device replications to the host memory of first calculate node, other described storage equipment are the first calculating section Memory in point in addition to host memory.

14. 1 to 13 any device according to claim 1, which is characterized in that the first interface parameter generation algorithm packet The first algorithm, the second algorithm and third algorithm are included, according to the first data flow diagram parameter in first graph data structure Title, size and Correspondent Node mark, the aspect of the first triple, the generation are generated using first interface parameter generation algorithm Module is used for title and first algorithm according to the first data flow diagram parameter in first graph data structure, determines institute The message label in the first triple is stated, according to the size of the first data flow diagram parameter in first graph data structure and institute The second algorithm is stated, determines the message size in first triple, and according to the first data in the first graph data structure The Correspondent Node mark of Flowsheet parameter and the third algorithm, determine the purpose process sequence number in first triple.

15. a kind of physical machine, the physical machine includes: the non-transient computer of at least one processor and storage executable code For readable medium to run the first calculate node in distributed computing system, the distributed computing system includes first meter Operator node and the second calculate node；The executable code is matched when being executed by the processor at least one described processor It is set to perform claim and requires method described in any one of 7-10.

16. a kind of non-transient computer-readable media for being stored with executable program, the executable program include for executing The program of method described in any one of claim 7-10.

17. the data transmission method in a kind of distributed computing system, the distributed computing system includes the first calculate node With the second calculate node, which is characterized in that the described method includes:

From the second graph data structure of second calculate node, the first data flow diagram parameter in the second data flow diagram is determined Title, size and Correspondent Node mark, the Correspondent Node of the first data flow diagram parameter in second data flow diagram Mark corresponds to first calculate node；

It is identified, is utilized according to the title, size and Correspondent Node of the first data flow diagram parameter in second graph data structure Second interface parameter generation algorithm generates the second triple, second triple include message label, message size and source into Program row number, wherein the message label corresponds to the title of the first data flow diagram parameter, and the message size corresponds to The size of the first data flow diagram parameter, the originating process sequence number, which corresponds in first calculate node, sends described the The process of one data flow diagram parameter；

According to second triple, message passing interface MPI is called to receive primitive processing from first calculate node The first data flow diagram parameter, the first data flow diagram parameter are that first calculate node passes through MPI transmission primitive hair It sends, the interface parameters that the MPI sends primitive includes the first triple corresponding with second triple, and described first Triple is first calculate node according to the first graph data structure in first calculate node, is joined using first interface Number generating algorithm generates, and the second interface parameter generation algorithm is identical as the first interface generating algorithm.

18. according to the method for claim 17, which is characterized in that second calculate node operation has first thread and the Two threads, include data buffer area in the host memory of second calculate node, and the data buffer area is exclusively used in storage quilt The data of MPI primitive processing call message passing interface MPI to receive primitive using second triple as interface parameters Handle the aspect of the first data flow diagram parameter from first calculate node, which comprises

The first thread detects primitive by message passing interface MPI and detects the data buffer area in the host memory, with Obtain the second triple；

The first thread calls the first MPI to receive primitive according to the second triple in the data buffer area, to handle The first data flow diagram parameter is stated, the second triple in the data buffer area is second calculate node according to the MPI Send what primitive obtained；

Second thread is after determining that the first data flow diagram parameter receives primitive processing by the first MPI, by second MPI receives primitive is revised as MPI and waits primitive, and it is not by second thread execution and institute that the 2nd MPI, which receives primitive, The corresponding reception primitive of the first data flow diagram parameter is stated, the interface parameters that the 2nd MPI receives primitive includes second meter The second triple that operator node generates, the MPI wait primitive to be finished for waiting the first MPI to receive primitive.

19. claim according to claim 18, which is characterized in that in the first thread according to the data buffer storage The second triple in area calls the first MPI to receive primitive, to handle the aspect of the first data flow diagram parameter, the method Include:

Use is distributed in the host memory that the destination address of the first data flow diagram parameter corresponds to second calculate node In the case where the memory headroom that family uses, the first thread is with the second triple in the data buffer area for described first MPI receives the interface parameters of primitive, and the first MPI is called to receive primitive, by the first data flow diagram parameter from described Data buffer area stores the destination address to the first data flow diagram parameter.

20. claim according to claim 17, which is characterized in that the method also includes:

In the case where the destination address of the first data flow diagram parameter corresponds to other storage equipment, second calculate node By the first data flow diagram parameter storage in the host memory to the destination address, other described storage equipment are described the Memory in two calculate nodes in addition to host memory.

21. 7 to 20 any claim according to claim 1, which is characterized in that the second interface parameter generation algorithm includes First algorithm, the second algorithm and third algorithm, in the name according to the first data flow diagram parameter in second graph data structure Title, size and Correspondent Node mark, utilize second interface parameter generation algorithm to generate the aspect of the second triple, the method packet It includes:

It is raw according to the title of the first data flow diagram parameter in second graph data structure and the second interface parameter At the first algorithm in algorithm, the message label in second triple is determined；According in second graph data structure The second algorithm in the size of first data flow diagram parameter and the second interface parameter generation algorithm determines second ternary Message size in group；And according to the Correspondent Node of the first data flow diagram parameter in the second graph data structure mark and it is described Third algorithm in second interface parameter generation algorithm determines the originating process sequence number in second triple.

22. the data transmission device in a kind of distributed computing system, the distributed computing system includes the first calculate node With the second calculate node, the data transmission device is located at second calculate node, which is characterized in that the data transmission dress It sets and includes:

Determining module, the determining module are used for from the second graph data structure of second calculate node, determine the second number It is identified according to the title, size and Correspondent Node of the first data flow diagram parameter in flow graph, it is described in second data flow diagram The Correspondent Node mark of first data flow diagram parameter corresponds to first calculate node；

Generation module, the generation module are used for the name according to the first data flow diagram parameter in second graph data structure Title, size and Correspondent Node mark, utilize second interface parameter generation algorithm to generate the second triple, the second triple packet Include message label, message size and originating process sequence number, wherein the message label corresponds to the first data flow diagram parameter Title, the message size correspond to the first data flow diagram parameter size, the originating process sequence number correspond to institute State the process that the first data flow diagram parameter is sent in the first calculate node；

Communication module, the communication module are used to call message passing interface MPI to receive at primitive according to second triple The first data flow diagram parameter from first calculate node is managed, the first data flow diagram parameter is first meter Operator node sends what primitive was sent by MPI, and the interface parameters that the MPI sends primitive includes and second triple pair The first triple answered, first triple are first calculate nodes according to the first figure in first calculate node Data structure is generated using first interface parameter generation algorithm, and the second interface parameter generation algorithm connects with described first Mouth generating algorithm is identical.

23. data transmission device according to claim 22, which is characterized in that the communication module include first thread and Second thread includes data buffer area in the host memory of second calculate node, and the data buffer area is exclusively used in storing The data handled by MPI primitive, it is former using second triple as interface parameters, calling message passing interface MPI to receive Language handles the aspect of the first data flow diagram parameter from first calculate node, and the first thread is used for by disappearing Breath passing interface MPI detection primitive detects the data buffer area in the host memory, to obtain the second triple；

The first thread is used to call the first MPI to receive primitive according to the second triple in the data buffer area, with place The first data flow diagram parameter is managed, the second triple in the data buffer area is second calculate node according to MPI sends what primitive obtained；

Second thread is used for after determining that the first data flow diagram parameter receives primitive processing by the first MPI, will 2nd MPI receives primitive and is revised as MPI waiting primitive, and it is not executed by second thread that the 2nd MPI, which receives primitive, Reception primitive corresponding with the first data flow diagram parameter, the interface parameters that the 2nd MPI receives primitive includes described the The second triple that two calculate nodes generate, the MPI wait primitive to execute for waiting the first MPI to receive primitive Finish.

24. claim according to claim 23, which is characterized in that according to the two or three in the data buffer area Tuple calls the first MPI to receive primitive, and to handle the aspect of the first data flow diagram parameter, the first thread is used in institute The destination address for stating the first data flow diagram parameter, which corresponds in the host memory of second calculate node, distributes to what user used It is the interface ginseng that the first MPI receives primitive with the second triple in the data buffer area in the case where memory headroom Number calls the first MPI to receive primitive, and the first data flow diagram parameter is stored from the data buffer area to described The destination address of first data flow diagram parameter.

25. claim according to claim 22, which is characterized in that the data transmission device further includes storage mould Block, the memory module are used in the case where the destination address of the first data flow diagram parameter corresponds to other storage equipment, By the first data flow diagram parameter storage in the host memory to the destination address, other described storage equipment are described the Memory in two calculate nodes in addition to host memory.

26. any claim according to claim 22 to 25, which is characterized in that the second interface parameter, which generates, to be calculated Method includes the first algorithm, the second algorithm and third algorithm, is joined according to the first data flow diagram in second graph data structure Several title, size and Correspondent Node marks, the aspect of the second triple is generated using second interface parameter generation algorithm, described Generation module is used for: being joined according to the title of the first data flow diagram parameter in second graph data structure and the second interface The first algorithm in number generating algorithm determines the message label in second triple；According to second graph data structure In the first data flow diagram parameter size and the second interface parameter generation algorithm in the second algorithm, determine described second Message size in triple；And according to the Correspondent Node of the first data flow diagram parameter in the second graph data structure mark and Third algorithm in the second interface parameter generation algorithm determines the originating process sequence number in second triple.

27. a kind of physical machine, the physical machine includes: the non-transient computer of at least one processor and storage executable code For readable medium to run the second calculate node in distributed computing system, the distributed computing system includes the first calculating section Point and second calculate node；The executable code is matched when being executed by the processor at least one described processor It is set to perform claim and requires method described in any one of 17-21.

28. a kind of non-transient computer-readable media for being stored with executable program, the executable program include for executing The program of method described in any one of claim 17-21.