CN109426574A - Distributed computing system, data transmission method and device in distributed computing system - Google Patents
Distributed computing system, data transmission method and device in distributed computing system Download PDFInfo
- Publication number
- CN109426574A CN109426574A CN201710769632.8A CN201710769632A CN109426574A CN 109426574 A CN109426574 A CN 109426574A CN 201710769632 A CN201710769632 A CN 201710769632A CN 109426574 A CN109426574 A CN 109426574A
- Authority
- CN
- China
- Prior art keywords
- flow diagram
- data flow
- parameter
- mpi
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24568—Data stream processing; Continuous queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
- H04L47/12—Avoiding congestion; Recovering from congestion
- H04L47/125—Avoiding congestion; Recovering from congestion by balancing the load, e.g. traffic engineering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
- H04L47/24—Traffic characterised by specific attributes, e.g. priority or QoS
- H04L47/2441—Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
Abstract
This application discloses a kind of distributed computing systems, the first calculate node and the second calculate node in the distributed computing system all preserve the title of the first data flow diagram parameter in data flow diagram, the information of size and Correspondent Node mark, the first data flow diagram parameter is preserved in first calculate node, first calculate node and the second calculate node use the identical interface parameters generating algorithm stored in respective node and the above- mentioned information of the first data flow diagram parameter, generate respective triple, the triple is used as being used for transmission the interface parameters of the message passing interface MPI primitive of the first data flow diagram parameter between the first calculate node and the second calculate node.In this way, the first calculate node and the second calculate node can not need to use MPI primitive transmitting data stream graph parameter with negotiating, the communication efficiency of data between calculate node can be improved, to promote distributed computing system to the computational efficiency of data flow diagram.
Description
Technical field
The present invention relates to computer field more particularly to distributed computing system, the data in distributed computing system are passed
Transmission method and device.
Background technique
With the development of big data industry and artificial intelligence technology, all kinds of computing platforms are continued to bring out, such as machine learning
Platform, figure computing platform and stream calculation platform etc..These computing platforms are usually deployed in distributed computing system to hold
Row big data calculates.For example, machine learning platform is usually using data flow diagram as computing object, it is more by the way that data flow diagram to be divided into
A subgraph perhaps copy and the multiple calculate nodes being deployed in multiple subgraph or copy in distributed computing system
On, and then multiple calculate node can be used, cooperated computing is carried out to data flow diagram, to promote the computational efficiency of data.Its
In, the calculate node in multiple calculate node may include multiple equipment for calculating, such as CPU (Center
Processing Unit, central processing unit) and the accelerating hardware that is equipped in the host of the calculate node, such as accelerate hard
Part can be GPU (Graphics Processing Unit, graphics processor).
And data flow diagram needs to carry out the data flow diagram parameter between node in the calculating process of distributed computing system
Communication, this communication directly affect machine learning platform to the computational efficiency of data flow diagram.It, will be usual in a kind of currently existing scheme
Message passing interface (Message Passing Interface, MPI) library technology applied to high-performance computing sector is as one
A external plug-in unit introduces distributed computing system, to support the data communication in the system, however, MPI library technology into
Before row data communication, need the both sides for carrying out data communication by information exchange to obtain the information of Correspondent Node, to negotiate
It is used for the messaging parameter of MPI primitive out, and the dynamic and randomness of the communication timing of computing platform make the double of data communication
Side is difficult to confirm opposite end in time and hold consultation, which also increases the burden of data communication in computing platform, from
And influence the efficiency of transmission of data.
Summary of the invention
The embodiment of the present invention provides data transmission method, device and a kind of distribution in a kind of distributed computing system
Formula computing system can simplify MPI technical application and calculate the communication process in data flow diagram, be not necessarily to before data transmission and communicate
Peer Negotiation enables MPI technology to be preferably adapted to the computing platform of distributed deployment client information, to improve distribution
The efficiency that data are transmitted in formula computing system, to promote the computational efficiency to distributed computing system data flow diagram.
In order to achieve the above objectives, the embodiment of the present invention adopts the following technical scheme that
In a first aspect, the application provides a kind of distributed computing system, which includes the first calculating section
Point and the second calculate node, which is characterized in that the first graph data structure in first calculate node preserves the first data
Title, size and the Correspondent Node mark of the first data flow diagram parameter in flow graph, wherein the first data flow diagram parameter is
The parameter that one connection side of first data flow diagram is carried;The second graph data structure in second calculate node saves
There are the title, size and Correspondent Node of the first data flow diagram parameter in the second data flow diagram to identify;First data
The Correspondent Node mark of the first data flow diagram parameter in flow graph corresponds to second calculate node, second data flow diagram
In the first data flow diagram parameter Correspondent Node mark correspond to first calculate node;First calculate node
For being identified, being utilized according to the title, size and Correspondent Node of the first data flow diagram parameter in first graph data structure
First interface parameter generation algorithm generates the first triple, and first triple includes message label, message size and purpose
Process sequence number, wherein the message label corresponds to the title of the first data flow diagram parameter, and the message size is corresponding
In the size of the first data flow diagram parameter, the purpose process sequence number, which corresponds in second calculate node, receives institute
State the process of the first data flow diagram parameter;Second calculate node is used for, according to first in second graph data structure
Title, size and the Correspondent Node mark of data flow diagram parameter, generate the second triple using second interface parameter generation algorithm,
The second interface parameter generation algorithm is identical as the first interface parameter generation algorithm, and second triple includes described
Message label, the message size and originating process sequence number, wherein the originating process sequence number corresponds to described first and calculates section
The process of the first data flow diagram parameter is sent in point;First calculate node is used for, using first triple as
Interface parameters calls message passing interface MPI to send primitive and sends the first data flow diagram ginseng to second calculate node
Number;Second calculate node is used for, and according to second triple, is called MPI to receive primitive and is handled first data flow
Graph parameter.
In this way, sending primitive using the first triple as the MPI of interface parameters and using the second triple as the MPI of interface parameters
It is corresponding to receive primitive, in this way, including Correspondent Node mark, solution in the first graph data structure and the second graph data structure
It has determined the process problem unknowable in data flow diagram operational process of Correspondent Node, and, need to transmit first data flow
The communicating pair of graph parameter, the information and the generation calculation of identical interface parameters in data flow diagram stored using respective calculate node
Method generates triple, is just not necessarily to the information to opposite end interaction oneself, and the algorithm of triple, this method energy are generated without negotiation
Enough independent operatings in data sender and recipient, generate corresponding triple in the case where both sides' no interactions, simplify
The process communicated using MPI primitive can be improved the efficiency that data are transmitted in Distributed Computing Platform.
It should be understood that call MPI to receive primitive according to second triple and handle the first data flow diagram parameter, it can
To reach the mesh for the calculating for enabling the process in the second calculate node to carry out using the first data flow diagram parameter data flow diagram
's.It calls MPI to receive primitive and handles the first data flow diagram parameter, " processing " can correspond to difference under different scenes
Operation, the application is without limitation.Such as one of can be following operation or a variety of: call MPI receive primitive by this
One data flow diagram parameter receives the data buffer zone of host memory;It calls MPI to receive primitive and modifies first data flow diagram ginseng
The first data flow diagram parameter in host memory is supplied to the process use for calculating data flow diagram by several labels;It will
The first data flow diagram parameter is stored from data buffer zone to destination address.
The title (name) of first data flow diagram parameter can be first figure for identifying the first data flow diagram parameter
A field in data structure is also possible to the information being dispersed in first graph data structure.The first data flow diagram parameter
Size (size) for indicating memory space shared by the first data flow diagram parameter, that is, the number of the data flow diagram parameter
According to amount.
The Correspondent Node mark of the first data flow diagram parameter in first data flow diagram can be second calculate node
Mark;Or the mark of the storage equipment where the destination address of the first data flow diagram parameter, the storage equipment are located at the
Two calculate nodes;Or the mark of the corresponding purpose process of the first data flow diagram parameter, the purpose process are located at second and calculate
Node;Or the information of other receiving ends for being used to indicate the first data flow diagram parameter.
It is similar for the Correspondent Node mark of the first data flow diagram parameter in second data flow diagram, it can be this
The mark of first calculate node;Or the mark of the storage equipment where the source address of the first data flow diagram parameter, the storage
Equipment is located at the first calculate node;Or first calculate node summarize send the first data flow diagram parameter process mark;
Or the information of other transmitting terminals for being used to indicate the first data flow diagram parameter.
The first graph data structure in first calculate node preserves the first data flow diagram parameter in the first data flow diagram
Title, size and Correspondent Node mark, can be in first graph data structure include carry three kinds of information field,
It is also possible to preserve the information for title, size or the Correspondent Node mark that can obtain the first data flow diagram parameter.I.e.
So-called " preserving ", which can be, directly to read from first graph data structure, be also possible to according to the first diagram data knot
Information analysis in structure obtains.
Second data flow diagram is stored in the second calculate node, and the second data flow diagram can be the pair of the first data flow diagram
This, all can also be two subgraphs of a data flow diagram with the first data flow diagram.
Wherein, message label is for indicating that the MPI sends the data that primitive is sent.Message size is for indicating that the MPI is sent out
The size for the information for sending primitive to send.Originating process sequence number is the sequence that the first calculate node executes that the MPI sends the process of primitive
Row number, purpose process sequence number be the second calculate node execute with the MPI send primitive corresponding MPI reception primitive into
The sequence number of journey.The concept of triple is served only for indicating three parameters therein, without limiting the sequence between these three parameters.It should
The format of three parameters in triple meets the call format that MPI sends interface function parameter entrained by primitive.In addition,
The interface parameters that MPI sends primitive includes but is not limited to first triple, and the interface parameters that MPI receives primitive includes but unlimited
In second triple.
Under a kind of implementation, using first triple as interface parameters, message passing interface MPI is called to send former
Language to second calculate node send the first data flow diagram parameter in terms of the first data flow diagram parameter, it is described first meter
Operator node is used for using first triple as interface parameters, sends primitive from described first by message passing interface MPI
The first data flow diagram parameter is read in the host memory of calculate node, to send described first to second calculate node
Data flow diagram parameter.
In this way, MPI sends primitive directly reads the first data flow diagram parameter from host memory, it can improve and read data
Efficiency.
Under a kind of implementation, first calculate node also preserves the storage where the first data flow diagram parameter
The information of equipment, the first calculate node described in the first data flow diagram parameter are also used to be designated as it in the information of the storage equipment
In the case that he stores equipment, the first data flow diagram parameter is calculated from other described storage device replications to described first
The host memory of node, other described storage equipment are the memory in first calculate node in addition to host memory.
The information of the storage equipment can be the mark of the storage equipment, be also possible to the volume for indicating the storage equipment
Number, the storage class of the storage equipment can be determined according to mark or number, can also be the type for identifying the storage equipment
The information that above-mentioned function can be reached of information etc. or other forms.
Such first calculate node will prepare the first data flow diagram parameter first before sending primitive using MPI
In the host memory of calculate node, and MPI sends primitive and only reads first number from the host memory of first calculate node
According to Flowsheet parameter, without fighting for reading the resource of other storage equipment with computing platform, improve MPI transmission primitive executes effect
Rate.
Under a kind of implementation, the first interface parameter generation algorithm includes that the first algorithm, the second algorithm and third are calculated
Method is identified according to the title, size and Correspondent Node of the first data flow diagram parameter in first graph data structure, is utilized
First interface parameter generation algorithm generates the aspect of first triple the first data flow diagram parameter, and first calculate node is used
In: according to title the first data flow diagram parameter and described first of the first data flow diagram parameter in first graph data structure
Algorithm determines the message label in first triple, is joined according to the first data flow diagram in first graph data structure
The size of several first data flow diagram parameters and second algorithm, determine the message size in first triple, Yi Jigen
According to the Correspondent Node mark of the first data flow diagram parameter in the first graph data structure and the third algorithm, described first is determined
Purpose process sequence number in triple;Correspondingly, according to the first data flow diagram parameter in second graph data structure
Title, size and Correspondent Node mark, utilize second interface parameter generation algorithm generate second the first data flow diagram of triple
The aspect of parameter, second calculate node are used for: according to the first data flow diagram parameter in second graph data structure
The first algorithm in title and the second interface parameter generation algorithm determines the message label in second triple;Root
In size and the second interface parameter generation algorithm according to the first data flow diagram parameter in second graph data structure
Second algorithm determines the message size in second triple;And according to the first data flow in the second graph data structure
Third algorithm in the Correspondent Node mark of graph parameter and the second interface parameter generation algorithm, determines second triple
In originating process sequence number.
The first interface parameter generation algorithm being then outlined above is identical as second interface parameter generation algorithm, refers to, the
One interface parameters generating algorithm includes the first algorithm, the second algorithm and third algorithm, and, second interface parameter generation algorithm includes
The first algorithm identical or corresponding with the first interface parameter generation algorithm, the second algorithm and third algorithm.
First algorithm can be the algorithm for converting any binary length value to fixed binary length value, such as can be with
It is hash algorithm etc.;Or other can convert the title of the first data flow diagram parameter in the interface parameters for meeting MPI primitive
The algorithm of the format of message label.To the second algorithm, then under a kind of implementation, the value of the message size field can be made to be equal to upper
State the parameter value of the size of data flow diagram parameter, i.e. size.Under another implementation, the value of the message size field can be made
Parameter value equal to the size of above-mentioned data flow diagram parameter is worth plus one.Third algorithm is process sequence number and Correspondent Node
Mapping relations between mark, wherein including reflecting between purpose process sequence number and Correspondent Node mark in the first calculate node
Relationship is penetrated, includes the mapping relations between originating process sequence number and Correspondent Node mark in the second calculate node.Third algorithm can
To be functional relation, it is also possible to the mapping table safeguarded in calculate node, the application is with no restrictions.
Under a kind of implementation, according to second triple, calls MPI to receive primitive and handle first data flow
Aspect the first data flow diagram parameter of graph parameter, second calculate node are used to detect primitive detection described second by MPI
Data buffer area in calculate node host memory, to obtain the second triple of the first data flow diagram parameter, the number
It is exclusively used in storing the data handled by MPI primitive according to buffer area;MPI is called to receive primitive, to handle first data flow diagram
Parameter, the interface parameters that the MPI receives primitive includes second triple.
In this way, primitive processing can be received more in time by allowing for data, but also other are pending for the first calculate node
Transmission primitive can perform faster, thus, improve data transmission efficiency.And by the way that dedicated data buffer area is arranged
With poll thread, message-passing communication buffer area is not yet called in message sink primitive, message final purpose address
In the case where unknown, so that message sends primitive and is able to carry out data transmission, and returned immediately after data are sent completely.It is slow
Rush area and be that following message sink primitive temporarily saves data so that message send primitive need not with message sink primitive into
Row simultaneously operating solves the intrinsic temporal constraint of the two.Sender synchronous need not wait, and make to have which save the time is executed
Help improving performance.
Under a kind of implementation, the first data flow diagram parameter receives primitive and carries the first data flow diagram parameter
Destination address, being received by the first data flow diagram parameter in terms of primitive handles the first data flow diagram parameter, described the
Two calculate nodes are used to call the MPI to receive using second triple as the MPI interface parameters for receiving primitive former
Language stores the first data flow diagram parameter to the destination address from the data buffer area.For example, the destination address
User memory space in host memory
Second aspect, the embodiment of the present invention record the data transmission method in a kind of distributed computing system, the distribution
Formula computing system includes the first calculate node and the second calculate node, which comprises from first calculate node
In first graph data structure, title, size and the Correspondent Node mark of the first data flow diagram parameter of the first data flow diagram are determined,
The wherein parameter that the first data flow diagram parameter is carried by a connection side of first data flow diagram, the Correspondent Node
Corresponding second calculate node of mark;According to the title of the first data flow diagram parameter in first graph data structure, greatly
Small and Correspondent Node mark generates the first triple using first interface parameter generation algorithm, and first triple includes disappearing
Breath label, message size and purpose process sequence number, wherein the message label corresponds to the first data flow diagram parameter
Title, the message size correspond to the size of the first data flow diagram parameter, and the purpose process sequence number corresponds to institute
State the process that the first data flow diagram parameter is received in the second calculate node;Using first triple as interface parameters,
It calls message passing interface MPI to send primitive and sends the first data flow diagram parameter to second calculate node, in order to
Second calculate node calls MPI to receive former using the second triple corresponding with first triple as interface parameters
Language processing the first data flow diagram parameter, second triple is according to the second diagram data in second calculate node
Structure is generated using second interface parameter generation algorithm, and the second interface parameter generation algorithm and the first interface are raw
It is identical at algorithm.
Under a kind of implementation, using first triple as interface parameters, message passing interface MPI is called to send
Primitive to second calculate node send the first data flow diagram parameter in terms of, which comprises with described first
Triple sends primitive from the host memory of first calculate node as interface parameters, by message passing interface MPI
The first data flow diagram parameter is read, to send the first-class data flow diagram parameter to second calculate node.
Under a kind of implementation, first calculate node also preserves the storage where the first data flow diagram parameter
The information of equipment, the method also includes: in the case where the information of the storage equipment is designated as other storage equipment, by institute
State the first data flow diagram parameter from it is described other storage device replications to first calculate node host memory, it is described other
Storing equipment is the memory in first calculate node in addition to host memory.
The third aspect, the application record the data transmission device in a kind of distributed computing system, the distributed computing
System includes the first calculate node and the second calculate node, and the data transmission device is located at first calculate node, described
Data transmission device comprises determining that module, and the determining module is for the first diagram data knot from first calculate node
In structure, title, size and the Correspondent Node mark of the first data flow diagram parameter of the first data flow diagram are determined, wherein described first
The parameter that data flow diagram parameter is carried by a connection side of first data flow diagram, the Correspondent Node identify described in correspondence
Second calculate node;Generation module, the generation module are used for according to the first data flow diagram in first graph data structure
The title of parameter, size and Correspondent Node mark generate the first triple using first interface parameter generation algorithm, and described first
Triple includes message label, message size and purpose process sequence number, wherein the message label corresponds to first number
According to the title of Flowsheet parameter, the message size corresponds to the size of the first data flow diagram parameter, the purpose process sequence
Row number corresponds to the process that the first data flow diagram parameter is received in second calculate node;Communication module, the communication
Module is used to call message passing interface MPI to send primitive to second meter using first triple as interface parameters
Operator node sends the first data flow diagram parameter, in order to which second calculate node is with corresponding with first triple
Second triple calls MPI to receive primitive and handles the first data flow diagram parameter, second triple as interface parameters
It is to be generated according to the second graph data structure in second calculate node using second interface parameter generation algorithm, it is described
Second interface parameter generation algorithm is identical as the first interface generating algorithm.
Fourth aspect, the application record a kind of physical machine, and the physical machine includes: that at least one processor and storage can be held
The non-transient computer-readable media of line code is to run the first calculate node in distributed computing system, the distributed meter
Calculation system includes first calculate node and the second calculate node;The executable code is by least one described processor
In processor be configured as executing any method that the first calculate node in above-mentioned system executes when executing.
As it can be seen that fourth aspect and the third aspect are the corresponding devices of method of second aspect, the method for second aspect is executed
Under the first calculate node, some cases, which is the first calculate node in the system of first aspect.About
The explanation of each step in second aspect, the third aspect and fourth aspect, the explanation of noun, various implementations and beneficial effect
Fruit illustrates that the discussion in relation to the first calculate node is equally applicable in the system of first aspect, reference can be made to the correlation in first aspect
Content, details are not described herein again.
5th aspect, the application record the data transmission method in a kind of distributed computing system, the distributed computing
System includes the first calculate node and the second calculate node, which comprises from the second figure number of second calculate node
It is identified according to the title, size and Correspondent Node of the first data flow diagram parameter in structure, determined in the second data flow diagram, described the
The Correspondent Node mark of the first data flow diagram parameter in two data flow diagram corresponds to first calculate node;According to institute
Title, size and the Correspondent Node mark for stating the first data flow diagram parameter in the second graph data structure, are joined using second interface
Number generating algorithms generate the second triple, and second triple includes message label, message size and originating process sequence number,
In, the message label corresponds to the title of the first data flow diagram parameter, and the message size corresponds to first number
According to the size of Flowsheet parameter, the originating process sequence number, which corresponds in first calculate node, sends first data flow diagram
The process of parameter;According to second triple, calls message passing interface MPI to receive primitive processing and calculated from described first
The first data flow diagram parameter of node, the first data flow diagram parameter are that first calculate node is sent by MPI
What primitive was sent, the interface parameters that the MPI sends primitive includes the first triple corresponding with second triple, institute
Stating the first triple is first calculate node according to the first graph data structure in first calculate node, utilizes first
What interface parameters generating algorithm generated, the second interface parameter generation algorithm is identical as the first interface generating algorithm.
Under a kind of implementation, the second calculate node calls message transmission using second triple as interface parameters
Interface MPI receives primitive and receives the first data flow diagram parameter, so that second calculate node uses first data flow diagram
The calculating of parameter progress data flow diagram.
Under a kind of implementation, the second calculate node operation has first thread and the second thread, and described second calculates
It include data buffer area in the host memory of node, the data buffer area is exclusively used in storing the data handled by MPI primitive,
Using second triple as interface parameters, calls message passing interface MPI to receive primitive processing and calculated from described first
The aspect of the first data flow diagram parameter of node, which comprises the first thread passes through message passing interface MPI
Detection primitive detects the data buffer area in the host memory, to obtain the second triple;The first thread is according to
The second triple in data buffer area calls the first MPI to receive primitive, to handle the first data flow diagram parameter, the number
According to the second triple in buffer area, to be second calculate node send primitive according to the MPI obtains;Second line
Journey receives primitive modification after determining that the first data flow diagram parameter receives primitive processing by the first MPI, by the 2nd MPI
Primitive is waited for MPI, the 2nd MPI, which receives primitive, to join for what is do not executed by second thread with first data flow diagram
The corresponding reception primitive of number, the interface parameters that the 2nd MPI receives primitive includes the second of the second calculate node generation
Triple, the MPI wait primitive to be finished for waiting the first MPI to receive primitive.
Second triple can be to be obtained according to the interface parameters that the MPI received sends primitive, is also possible to basis and is connect
Mouth parameter and the MPI send to analyze in the data that primitive transmits and obtain, and the application is with no restrictions.
That is the second calculate node can again specially can specifically can one thread of pull-up (can again in host memory
Referred to as poll thread) it executes MPI and detects primitive, to detect the buffer area of the host memory of second calculate node, in buffer area
Including above-mentioned data buffer area, the data for having not enough time to be connect primitive processing by MPI can be thus found.
In an implementation mode, the is called according to the second triple in the data buffer area in the first thread
One MPI receives primitive, to handle the aspect of the first data flow diagram parameter, which comprises in first data flow
The destination address of graph parameter corresponds to the feelings that the memory headroom that user uses is distributed in the host memory of second calculate node
Under condition, the first thread is with the interface ginseng that the second triple in the data buffer area is that the first MPI receives primitive
Number calls the first MPI to receive primitive, and the first data flow diagram parameter is stored from the data buffer area to described
The destination address of first data flow diagram parameter.
In an implementation mode, the feelings of other storage equipment are corresponded in the destination address of the first data flow diagram parameter
Under condition, second calculate node stores the first data flow diagram parameter in the host memory to the destination address, institute
Stating other storage equipment is the memory in second calculate node in addition to host memory.
6th aspect, the application also record the data transmission device in a kind of distributed computing system, the distributed meter
Calculation system includes the first calculate node and the second calculate node, and the data transmission device is located at second calculate node, institute
It states data transmission device and comprises determining that module, the determining module are used for the second diagram data knot from second calculate node
In structure, title, size and the Correspondent Node mark of the first data flow diagram parameter in the second data flow diagram, second number are determined
Correspond to first calculate node according to the Correspondent Node mark of the first data flow diagram parameter in flow graph;Generation module,
The generation module is used for according to the title of the first data flow diagram parameter in second graph data structure, size and communication pair
End mark generates the second triple using second interface parameter generation algorithm, and second triple includes message label, message
Size and originating process sequence number, wherein the message label corresponds to the title of the first data flow diagram parameter, the message
Size corresponds to the size of the first data flow diagram parameter, and the originating process sequence number corresponds in first calculate node
Send the process of the first data flow diagram parameter;Communication module, the communication module are used to be adjusted according to second triple
Primitive, which is received, with message passing interface MPI handles the first data flow diagram parameter from first calculate node, it is described
First data flow diagram parameter is that first calculate node is sent by MPI transmission primitive, and the MPI sends connecing for primitive
Mouth parameter includes the first triple corresponding with second triple, and first triple is the first calculate node root
According to the first graph data structure in first calculate node, generated using first interface parameter generation algorithm, described second
Interface parameters generating algorithm is identical as the first interface generating algorithm.
7th aspect, the application also record a kind of physical machine, and the physical machine includes: at least one processor and storage can
The non-transient computer-readable media of code is executed to run the second calculate node in distributed computing system, the distribution
Computing system includes the first calculate node and second calculate node;The executable code is by least one described processing
Processor in device is configured to execute any method executed by the second calculate node above-mentioned when executing.
As it can be seen that the 6th aspect and the 7th aspect are the corresponding device of method of the 5th aspect, the method execution of the 5th aspect
Under the second calculate node, some cases, which is the second calculate node in the system of first aspect.About
The explanation of each step in 5th aspect, the 6th aspect and the 7th aspect, the explanation of noun, various implementations and beneficial effect
Fruit illustrates that the discussion in relation to the second calculate node is equally applicable in the system of first aspect, reference can be made to the correlation in first aspect
Content, details are not described herein again.
Eighth aspect, the application record a kind of non-transient computer-readable media for being stored with executable program, it is described can
It executes program and is used to execute any method that the first calculate node or the second calculate node execute in above-mentioned system.About
The explanation of each step involved in eighth aspect, the explanation of noun, various implementations and beneficial effect explanation are aforementioned related
Discuss equally applicable, reference can be made to related content above-mentioned, details are not described herein again.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention.
Fig. 1 is the schematic diagram of a data flow diagram provided in an embodiment of the present invention.
Fig. 2 is a kind of computing platform schematic diagram provided in an embodiment of the present invention;
Fig. 3 is a kind of network architecture schematic diagram of Distributed Computing Platform provided in an embodiment of the present invention;
Fig. 4 is a kind of timing diagram of method that interprocess communication is carried out using MPI technology provided in an embodiment of the present invention;
Fig. 5 is a kind of schematic diagram of data transmission method provided in an embodiment of the present invention;
Fig. 6 is a kind of schematic diagram of cutting data flow diagram provided in an embodiment of the present invention;
Fig. 7 is a kind of schematic diagram of the framework of Tensor Flow machine learning platform provided in an embodiment of the present invention;
Fig. 8 is a kind of schematic diagram of data transmission device provided in an embodiment of the present invention
Fig. 9 is provided in an embodiment of the present invention a kind of for executing the schematic diagram of the physical machine of the present processes.
Specific embodiment
Character "/" herein, typicallys represent the relationship that forward-backward correlation object is a kind of "or".For example, A/B is understood that
For A perhaps B then and/or can be understood as and or.
Term " first " and " second " in description and claims of this specification etc. are not intended to description object
Particular order, but for distinguishing different objects, in the case where having certain illustrated, " first " and " second " can describe identical
Object.For example, the first process and the second process are different process in the case where without certain illustrated.
In the description of the present invention, unless otherwise indicated, the meaning of " plurality " is refer to two or more.For example, more
A equipment refers to two or more equipment.
In addition, the term " includes " being previously mentioned in description of the invention and " having " and their any deformation, it is intended that
It is to cover and non-exclusive includes.Such as the process, method, system, product or equipment for containing a series of steps or units do not have
It is defined in listed step or unit, but optionally further comprising the step of other are not listed or unit, or optionally
It further include the other step or units intrinsic for these process, methods, product or equipment.
Be described below this application involves some nouns.
Message transmission: in computer system, the general designation of a kind of data communications method between process or between component software.It will
Data abstraction to be communicated is simultaneously encapsulated as " message ", and the both sides or multi-party pass through for participating in communication such as call message transmission, receive at the originals
Language realizes transmitting of the message between process or component, to complete data communication.
Primitive: referring to be instructed by several and form, for completing one section of code an of function or a process.Primitive
Execution should be continuous.
Calculate node can be physical machine, be also possible to operate in the host (host) in physical machine, virtual machine
(Virture Machine) or container.It should be understood that virtual machine and container require to be deployed in physical machine.Under that is,
First calculate node and the second calculate node described in text, it may be possible to identical physical machine or different physical machines.
For example, calculate node is the virtual machine or container being deployed on same physical machine.Certainly, hair described in Examples below
Sending end physical machine and receiving end physical machine, it is also possible to be identical physical machine or different physical machines.
It should be noted that the node for including in calculate node and data flow diagram is noun of different nature in present specification, tool
There is different semantemes.
Equipment: referring to the hardware in physical machine, for example, the hardware can support virtual machine in the physical machine or container or
The operation of process or thread.Such as it can be and calculate equipment or storage equipment.Equipment is calculated to refer in physical machine for calculating
Hardware, calculate equipment can be CPU, be also possible to GPU, FPGA (Field-Programmable Gate Array, scene
Programmable gate array), MIC (Many Intergrated Core, Cheng Zhonghe) or other hardware with computing capability set
It is standby.Storage equipment refer in physical machine can storing data or code hardware, the memory such as used for above-mentioned calculating equipment, such as
Host memory (also known as CPU memory), the various memories such as GPU memory, FPGA memory, then such as external memory, such as hard disk, CD etc..
Host: refer in computer hardware system for placing mainboard and the container (Mainframe) of other main components.Than
It such as may include CPU, memory, hard disk, power supply and other input/output interfaces.Such as the input/output interface can be USB
Any one of controller, video card, network interface card, sound card.
Data flow diagram: a kind of flow direction and calculated relationship by expression data in calculating logic, to reflect meter
Calculate the design principle of logic and the pattern data structure of implementation process.The application is with the machine learning of common calculating data flow diagram
It is described for platform.It should be noted that it is flat that data flow diagram can be pre-loaded to this before calculating in machine learning platform
Platform, the preloading procedure include to the node for including in data flow diagram, while and while on parameter definition.
In machine learning platform, the calculating logic of algorithm carries out Formal Representation usually using data flow diagram.Use this
Machine learning platform calculates data flow diagram, needs first to describe the data flow diagram with code, this process is properly termed as counting
According to the definition of flow graph, after which defines, this partial code is compiled, when calculating the data flow diagram, according to volume
Code after translating is read out and executes, rather than executes according to coded sequence when defining data flow diagram.
Data flow diagram is a kind of directed acyclic graph (Directed Acyclic Graph), it is by several nodes and node
Between connection in (referred to as " while, edge ") composition, that is to say, that mono- node of Bian Shicong is directed toward another node.
Node and side, two levels are explained when can define from data flow diagram and when operation.When data flow diagram defines, section
Point represents operator used in calculating process or variable.Wherein, operator is exactly the symbol for expressing operation rule, example
Such as add (+), subtracts (-), multiplies (×), except (÷), integral (∫), differential, index, logarithm (log or ln) and other function shapes
Formula etc..In fact, variable is also considered as a kind of special operator, that is, there are 0 input, 1 operator exported.Side table
Show the operation relation between operator and/or variable.In data flow diagram operation, the storage of node on behalf data, each node
A corresponding storage location.For example, node may map to some physical address in hard disk, memory or CPU register or
Virtual address.Wherein data can be a variable perhaps this variable institute's assigned value or an operation result, the operation
As a result it can be the mathematical expressions form such as a variable or constant, Bian Ze is represented in the transmission of data, that is, a node
Data transmit another node being directed toward to the side.
Correspondent Node: to the both sides of communication, when describing the communication process of a side, this side is local terminal, and another party is logical
Believe opposite end.Such as send data one end (transmitting terminals of the data) Correspondent Node be reception data one end (data
Receiving end), the Correspondent Node for receiving one end (receiving ends of the data) of data is one end (transmission of the data for sending data
End).The both sides of communication can be from physical machine, virtual machine, container, and multiple granularities such as process and thread describe.For example, sending number
According to one end by a process perhaps thread execute send primitive with send data receive one end of data by a process or
Thread, which executes, receives primitive to receive the data, then the process or thread that can also execute the reception primitive are referred to as the hair of the data
The Correspondent Node of sending end, the process or thread that can either execute the reception primitive are referred to as the Correspondent Node of the transmission primitive;Together
Reason can execute the process of the transmission primitive perhaps thread be referred to as the data receiving end Correspondent Node or can claim to execute the hair
Send process or the thread of primitive for the Correspondent Node of the transmission primitive.Above-mentioned expression involved in following description, no longer into
Row explanation.
Opposite end calculate node: to the communication process that a data are transferred to another calculate node from a calculate node, make
It is source calculate node with the calculate node that transmission primitive sends the data, is using the calculate node that reception primitive receives the data
Purpose calculate node.Then to the source calculate node of the data, opposite end calculate node is the purpose calculate node of the data, to this
The purpose calculate node of data, opposite end calculate node are the source calculate node of the data.
Data flow diagram parameter: in data flow diagram, parameter refer to it is that the side on figure is carried, for by calculate node processing or
The data fed back by calculate node.That is, data flow diagram parameter seeks to pass from a node (i.e. the source node on the side)
The defeated data to this in another node destination node of while (i.e. this) being directed toward.Obviously, to a data flow diagram, data flow diagram
The transmission of parameter is to calculate a part of the data flow diagram, also, the storage location of the node instruction in a data flow diagram is same
In the case where one equipment (such as same CPU memory, same GPU memory etc.), the transmission of the data flow diagram parameter be can be in process
Memory reproduction process;On the other hand, the storage location striding equipment of the node instruction in a data flow diagram is (such as same host
CPU memory and GPU memory, the equipment etc. on different hosts) in the case where, the transmission of the data flow diagram parameter can be between process
Communication process, and if storage location indicated by source node and destination node needs to be communicated based on network across host.
The source address of data flow diagram parameter is the storage location of the data flow diagram parameter in the calculate node of source, the source address
It can be documented in the source node for carrying the side of the data flow diagram parameter.
The destination address of data flow diagram parameter is the storage location of the data flow diagram parameter in purpose calculate node, the mesh
Address can be documented in carry the data flow diagram parameter side destination node in.
Address of node: to the node in data flow diagram, it can be physical address indicated by the node or virtually
Location.Address of node can be used in the communication of data flow diagram parameter.
Size (size): the size or message size of the data flow diagram parameter referred in such as present specification.It indicates
Memory space shared by one data or message, that is, the data volume that the data or message include generally are with byte (Byte)
Dimension.Such as 2KB, 0.5MB etc..
Below with reference to Fig. 1, illustrate data flow diagram, node, side and data flow diagram parameter are how to express calculating to patrol
Volume.Shown in FIG. 1 is a data flow diagram, and the data flow diagram is for expressing " multiplied by third number to obtain after two number additions
To a result " as calculating logic, can be expressed as E=(A+B) × D with formula.Wherein, the data flow diagram have A, B, C,
D, totally 4 sides totally 5 nodes and a, b, c, d E.When data flow diagram defines, node A, B, D respectively indicate a variable, node
C, E respectively indicates addition and multiplying.Side a, b indicate that two addends of add operation in node C come from node A, B, side c, d
Indicate that two factors of multiplying in node E come from node C, D.
In data flow diagram operation, node A, B, D indicate the storage location of input variable, and node C indicates add operation knot
The storage location of fruit, E indicate the storage location of multiplication result.Storage location represented by node may map to hard disk,
Address for storing data in the physical equipments such as memory or CPU register.Side a, b represent the storage position of node A, B mapping
Data in setting are transferred to the process of the storage location of node C mapping, and side c, d are represented in the storage location of node C, D mapping
Data are transferred to the process of the storage location of node E mapping.The data transmission procedure that these sides represent can be the process of being mapped to
Interior memory reproduction process, such as storage location indicated by the node that is connected of these sides is in same host.Certainly, these sides
The data transmission procedure of representative is also possible to be mapped to network-based data communication process between process, as these sides are connected
Node distribution is in a distributed system.Such as shown in the figure, it inputs to input in 1, B in A and inputs 5 in 3, D, then transmitted on the side a
It is the value transmitted on the side 3, d is 5 that value, which is the value transmitted on the side 1, b, then it is in 4, E that value obtained in C, which is the value transmitted on the side 4, c,
Obtained value is 20, this data flow graph representation, calculating is (1+3) × 5=20.
It should be appreciated that this application involves computing platform, can be deployed in one or more calculate node, the calculating
Node can be physical machine or virtual machine, that is to say, that this application involves computing platform, it is equally suitable in virtual machine field
With.The physical machine can be the server perhaps some terminal with computing capability such as PCs or laptop
Deng the application is with no restrictions.The application is illustrated for calculating data flow diagram, can be transported in above-mentioned each calculate node
Row one or more processes, with to data flow diagram subgraph or copy carry out operation, below citing description computing platform it is soft
Hardware running environment.It is to be understood that the node in calculate node and data flow diagram is different art-recognized meanings.User is flat in calculating
The data flow diagram that need to be calculated is loaded on platform, and the data flow diagram is calculated using computing platform.
The application scenarios of the application be operate in it is in one or multiple calculate nodes, support Distributed Computing Platform soft
Part.For example, computing platform can be machine learning platform, then such as machine learning platform can be for multilayer neural network
Carry out the deep learning platform of machine learning.That is, the calculating task striding equipment in the application is run, so-called striding equipment,
Refer to the case where executing in multiple calculating equipment that the program code of data flow diagram is distributed in one or more server, here
Calculating equipment can be CPU, being also possible to GPU, FPGA, (Field-Programmable Gate Array, scene can compile
Journey gate array), MIC (Many Intergrated Core, Cheng Zhonghe) or other hardware devices with computing capability.This
Class platform software includes but is not limited to TensorFlow, MXNet, CNTK etc..
It is to be understood that in the scene of above-mentioned striding equipment operation computing platform software, the data flow diagram that need to calculate can be with
Be divided into multiple subgraphs (a part that each subgraph can be the data flow diagram) or multiple data flow diagram copy be distributed in it is more
A equipment.
In computing platform, the multiple copies or multiple subgraphs of data flow diagram are to execute calculating by multiple processes,
Such as process calculates a subgraph perhaps a copy or a process calculates the son of multiple equipment in a physical machine
The subgraph or copy of multiple equipment in perhaps copy or a physical machine are schemed by two or more process meter
It calculates.Fig. 2 is the angle from process, the software and hardware resources of the corresponding computing platform of description process.With two use in computing platform
It is illustrated for the process for calculating data flow diagram, in the example of Fig. 2, whether does not limit the two processes in same calculating
On node and a process can calculate the subgraph or copy of a multiple equipment physically.
Process 1 shown in Fig. 2, i.e., 1011, during calculating data flow diagram, using in host memory 1012 and GPU
Deposit 1016.Process 2, i.e., 1021 use host memory 1022 and GPU memory 1026 during calculating data flow diagram.Host
Memory 1012 and 1022 can be the host memory of identical physical machine or different physical machines.GPU memory 1016 and 1026 can
To be the GPU memory in different physical machines, or the different GPU memories on same physical machine.It should be understood that in host
In the case where depositing the host memory that 1012 and 1022 are identical physical machine, represents and be respectively allocated to 1 He of process in host memory
The memory address space of process 2 represents different address fields.Platform runtime code 1013 and 1023 is loaded in host memory
It is run in 1012 and 1022, platform runtime code is exactly the code of computing platform own system, for running computing platform
Software environment, this partial code user can not edit.Kernel function code in Fig. 2, then load is in the corresponding host of process
It is run on memory and GPU memory.Kernel function code is for realizing the various kernel function for expressing local calculation logic, it can be understood as
One include various kernel function kernel function library, kernel function, can be by for indicating some more complicated logical operations rules
Node in data flow diagram calls, for example, kernel function can be the operation of matrix, such as dot product, multiplication cross etc. or convolutional calculation,
These operations need more complicated one section to instruct and realize.Under a kind of implementation, (phase homotype in the memory of generic equipment
Number equipment, if the GPU of different model is not generic equipment, GPU and CPU are also not generic equipment), the kernel function of deployment
It is the same, and the kernel function then disposed to different classes of equipment can be different.In different types of equipment, the core disposed
Function type has deployment kernel function A, B, C in intersection, such as certain GPU memory, disposes kernel function B in the memory of another CPU,
C, D etc., that is to say, that a kind of kernel function can be deployed in multiple equipment.How to be disposed as kernel function, be by computing platform into
Row distribution, such as can write in the library of computing platform, present specification does not repeat them here.And computing platform is used in user
, can be according to the loading condition of distinct device and the distribution situation of data flow diagram when resource is calculated, dynamic dispatching is different
Kernel function in equipment.
And the copy or subgraph of data flow diagram, it loads saved on host and GPU memory respectively, in Fig. 2, data flow diagram
In circle represent node, the short-term between circle with arrow represents connection side, and (short-term is without arrow for the circle of short-term starting
The circle that is connected of one end of head) source node on the connection side is represented, the circle that the arrow of the short-term is directed toward is exactly the connection side
Destination node.Source node and destination node can be directed toward either host memory or physical address in GPU memory or
Virtual address.
It is described as in specific Fig. 2, kernel function code 1014 runs on host memory 1012, and data flow diagram 1015 is stored in master
Machine memory 1012;Kernel function code 1024 runs on host memory 1022, and data flow diagram 1025 is stored in host memory 1022;Core
Function code 1017 runs on GPU memory 1016, and data flow diagram 1018 is stored in GPU memory 1016;Kernel function code 1027 is transported
Row is in GPU memory 1026.Data flow diagram 1028 is stored in GPU memory 1026.Data flow diagram 1015,1018,1028 and 1025
It is the copy of same data flow diagram.For example, what address direction process 1 (i.e. 1011) that the source node on connection side 1019 is recorded used
An address in host memory 1012, the address that destination node is recorded are directed toward in the GPU memory 1016 that process 1 (1011) use
An address, therefore calculating process of the side from source node to destination node needs to be implemented the data flow diagram of striding equipment in process
Parameter communication 1010.For another example the address that the source node on connection side 1029 is recorded is directed toward in the GPU that process 1 (i.e. 1011) use
An address in 1016 is deposited, one in the GPU memory 1026 that process 2 (i.e. 1021) use is directed toward in the address that destination node is recorded
A address, therefore calculating process of the side from source node to destination node needs to be implemented process span device data Flowsheet parameter and leads to
Letter 1020.If process 1 and process 2 are located at different hosts, process span device data Flowsheet parameter communication 1020 is across object
The communication of reason machine.
Below with reference to Fig. 3, further illustrate that this application involves a kind of software runtime environment arrived and hardware structures.In Fig. 3,
By taking a machine learning platform 3011 as an example, which operates in server 1 (i.e. 3001), server 2
On (i.e. 3002), server 3 (i.e. 3003) and server 4 (i.e. 3004), between this four servers of server 1 to server 4
It is communicated by the network switch 3031.By taking server 1 as an example, the software that includes in server and hard is illustrated in detail
Part.Hardware aspect is fitted with CPU (3021 in such as server 1), host memory on server 1 to server 4 (as serviced
3022 in device 1) and network interface card (3024 in such as server 1).It may include GPU card in server (in such as server 1
3023), wherein GPU card is packaged with GPU memory (3025 in such as server 1).Software aspects, 1 arrives server 4 on server
It is deployed with machine learning platform software (3011 in such as server 1), which includes programming interface (in such as server 1
3012) engine (3013 in such as server 1), memory management (3014 in such as server 1) and telecommunication management is (such as when, running
3015 in server 1) modules such as.Wherein, the data flow diagram that memory management module (3014 in such as server 1) manages is (such as
3016 in server 1) parameter on is stored in host memory (3022 in such as server 1), some parameter may be used also
To be stored in GPU memory (3025 in such as server 1).Memory management module 3014 is by host memory 3022 or GPU memory
The data flow diagram 3016 stored in 3025 is read out, to be calculated, in calculating process if necessary to other servers into
The communication of row data Flowsheet parameter can then pass through communication management module (3015 in such as server 1) and network interface card (such as server
3024 in 1) reception and transmission of data, are realized.It is to be understood that machine learning platform 3011 by server 1 into
Cheng Yunhang, and engine, memory management etc. may be considered several sections of codes with different function when programming interface, operation.
It should be understood that machine learning platform is equivalent to the operating system run on computer hardware, it can also be with class
It is divided into application layer and core layer than other operating systems, application layer is edited for user or input data, application layer
Interface is provided between core layer, the instruction edited so as to user or function call these interfaces execute these using core layer
Instruction or function.
It should be understood that host memory is namely for CPU memory workable for the process for calculating data flow diagram, therefore even if should be into
Journey is located on virtual machine or container, the memory being assigned to, and such as virtual cpu memory, can also be referred to as lead in this application
Machine memory.
This application involves one width data flow diagram of scene requirement multiple subgraphs or multiple copies can striding equipment collaboration hold
Row, it is same including being performed in unison between same server or multiple processes of multiple servers, or in same server
It is performed in unison between the CPU and accelerator hardware of process administration.However, striding equipment, which carries out data flow diagram calculating, necessarily refers to data
The interaction of Flowsheet parameter, parameter exchange method used at present are not able to satisfy machine learning platform to the needs of computational efficiency.
On the other hand, it in high-performance calculation (High Performance Computing, HPC), uses with message transmission
Interface (Message Passing Interface, MPI) technology is message transmission (Message Passing, MP) machine of representative
System, MPI includes agreement and semantic description, and the efficiency of this mechanism configured transmission is higher.High-performance calculation refers to using ultra-large type
Electronic computer carries out the calculating of large amount of complex, for example, biological heredity genetic testing, missile trajectory is calculated, the reactor of core industry
Simulation, space industry aerial vehicle trajectory calculate or astronomical observation field star orbital calculate etc..These calculating use hard
Part performance is much higher than the computer in general civilian or commercial scene, such as supercomputer (super computer).It is super
Grade computer is a kind of ultra-large type electronic computer, and the ability with very strong calculating and processing data, main feature is shown as
High speed and large capacity, with there are many external and peripheral equipment and abundant, H.D software systems.Existing supercomputing
Machine arithmetic speed mostly can achieve per second one too (Trillion, trillion) it is secondary above for being easy to understand high-performance computing sector
The difference of server used in middle supercomputer and other field, comes below for example, some producers are specified that: meter
The arithmetic speed of calculation machine is 10,000,000 times per second or more average;Memory capacity belongs to supercomputer, example more than 1000 myriabits
Such as, the ILLIAC- IV in the U.S., Japanese NEC, European Eugene, Chinese " milky way " computer etc..For another example being suitable for existing
For the supercomputer of cluster architecture, such as the BlueGene in the U.S., " capital " of Japan, Piz Daint in Europe, China
" martial prowess ", " Milky Way " etc..If supercomputer is used under distributed scene as calculate node, between calculate node
Using high performance communication network (hereinafter referred high performance network), such as infinite bandwidth InfiniBand technology, wherein using special
The network equipment (such as routing, network interface card, cable, interchanger), cost be about the several times of the same category of device of enterprise (such as
Routing for InfiniBand is 4 times of common routing price).Common ICP/IP protocol or Ethernet transport protocol fortune
Row, can also be contour using InfiniBand between common commercial or civilian server on this kind of private network device
Performance communication network, but payload ratio is low, (such as bandwidth gulps down the performance for being unable to fully using this kind of dedicated network equipment
The amount of spitting), therefore, usually using dedicated between high performance communication field, this kind of dedicated network equipment and supercomputer
Communication protocol carry out data transmission.For example, being carried out data transmission using MPI technology, in a device in high performance communication field
It can show as, MPI library is mounted in equipment.
It should be noted that MPI library refers to MPI development library, including a variety of MPI interface functions (also referred to as MPI instruction or MPI
Primitive) and MPI communication protocol.From software architecture angle, MPI interface function can be understood as application layer and core layer is indirectly
Mouthful, if analogy uses the software architecture of TCP/IP technology, MPI interface function is equivalent to Socket interface function (i.e. wherein
Socket).
Under a kind of implementation, the interface function in MPI function library is used to carry out the transmission of a data, the transmission process
In, originating process is using the process for sending primitive transmission data in transmitting terminal physical machine, and purpose process is in the physical machine of receiving end
The process of the data is received using reception primitive.Transmission primitive and reception primitive in MPI function library use in pairs.To logical
Two processes of letter, originating process send primitive using one and send data to the process of a mesh, which receives former using one
Language handles the data, in the interface parameters of this pair of of primitive, carries identical message size (size), identical message label
(tag), and, the mark of the purpose process is also carried in the interface parameters of the transmission primitive, the information of the purpose process is
It is used to indicate the purpose process, for example, it may be the sequence number of the purpose process, can also be somebody's turn to do with the message identification of other forms
Purpose process, the application is without limitation.And the mark of the originating process is also carried in the interface parameters of the reception primitive, the source into
The mark of journey is for identifying the originating process, for example, it may be the sequence number of the originating process, can also use the information of other forms
The originating process is identified, the application is without limitation.
It is to be appreciated that it is (as follows unique process sequence number can be assigned to the process that can be carried out communication in MPI library
The rank that text refers to).And the mapping relations of the process sequence number in the MPI library with physical machine where process can be safeguarded in host,
Perhaps the mapping relations or process sequence number between the information of process sequence number equipment corresponding with process, object where process
Mapping relations between reason machine and the corresponding equipment three of process.It can thus learn the process communicated whether same
One physical machine, or even whether identical equipment is used, it can choose so that MPI sends primitive through network interface card or shared drive,
Or the modes such as local winding network equipment (namely kernel virtual unit) are sent.Select which kind of mode send MPI primitive with
Whether originating process and purpose process related in same physical host or same data center (Data Center, DC), also with communication
The network communication technology used between process is related (such as high performance network, Ethernet etc.), and the application does not repeat them here.
MPI library can match RDMA (storage of Remote Direct Memory Access remote direct data) skill
Art, in the case where the CPU of host where purpose process is not carried out and receives primitive, originating process can will be counted by sending primitive
Accordingly and the relevant information of data (triple as mentioned below) write-in purpose process where host the buffer area MPI.Certainly,
ICP/IP protocol also can be used in MPI library, then the host where purpose process can be connect by I/O (Input/Output) equipment
The data from originating process are received, and will be where the relevant information of the data (triple referred in as follows) write-in purpose process
The buffer area MPI of host, to write data into the buffer area MPI of host during purpose process executes and receives primitive.For
MPI technology and which kind of communication protocol collocation use, and belong to different application scene involved in the technical solution of the application record, this
Application is with no restrictions.And it uses from different communication protocol collocation in the scene that carries out data transmission, MPI primitive is being realized
How the data such as the transmission or reception of data work during transmitting, and belong to the internal operation mechanism of MPI library, please refer to not
With the correlation technique data and explanation of the MPI library of version, the application is not repeated them here.
In MPI technology, the ground for the data for having a dedicated storage to handle in the host memory of MPI library by MPI primitive is used
Location space, the referred to as buffer area MPI.The buffer area MPI is generally defined as fixed size, such as 64KB, 1MB etc..It should be understood that
Buffer size can be less than by sending the data to be transmitted of primitive, can also be greater than buffer size, big in the data to be transmitted
In the case where buffer size, the data that originating process can will transmit during executing and sending primitive are split.And
In the case where the corresponding buffer area MPI of purpose process is occupied full, sending primitive can not just continue to write data into purpose process
The buffer area MPI.In this case it needs purpose process to execute and receives the purpose that the data are written in received data by primitive
Address.For example, the destination address can be located at GPU chip, the memory headroom that user uses (is distributed in host memory by system
One user is used to store the data of the user) and other memory devices in.Purpose process execute receive primitive to receive data,
The receive process may include having been written into source in the buffer area MPI for detect the host where purpose process by the reception primitive
The data that process is sent, and the data that originating process transmission comes are saved into (the example into the destination address of the data from the buffer area MPI
Such as host memory or GPU memory).In this way, the purpose process can use the data.
Table 1 is illustrated common MPI interface function.
Table 1
For example, the waiting primitive in table 1, for waiting a primitive to be finished.Table 1 only illustrates to several primitive
It is bright, for example, sending primitive can also include that other can be realized the MPI interface letter for sending the other forms of this function of information
Number, receiving primitive can also include that other can be realized the MPI interface function for receiving the other forms of this function of information.
In order to improve data transmission performance of the machine learning platform under distributed network environment, (data are mainly data flow
Graph parameter), industry has begun trying the MPI communication technology of high-performance computing sector using computing platform at present, than as follows
The TensorFlow-Allreduce project of Baidu (Baidu) exploitation referred in text, but it is hard due to computing platform itself
Part condition is different from high-performance computing sector, so that existing use computing platform to will affect data transmission MPI technology
Performance, again such that the computational efficiency of computing platform is by biggish limitation.
Below in conjunction with Fig. 4, the side that the TensorFlow-Allreduce project of Baidu (Baidu) exploitation proposes simply is introduced
Machine learning platform is distributed in a distributed system by case, the project.Its technical solution is exactly to use in machine learning platform
The message transmission primitive of MPI library constructs the collective communication that all processes participate in jointly, to complete data during collective communication
The reduction of data flow diagram parameter in flow graph calculating calculates and parameter distribution, so as in the high performance network such as InfiniBand
The performance of data flow diagram parameter communication is promoted in environment.Collective communication step be multiple processes simultaneously participate in include reduction calculate
Communication steps, reduction calculates the interaction for needing this multiple process to carry out parameters, for example, maximizing, minimum value, or asks flat
It, is all a kind of reduction calculating, for average, a part numerical value to be averaging respectively is can be read in multiple processes, is needed
It participates in the multiple processes calculated and respective data is sent to the process for executing averaging algorithm, that is, need the ginseng between process
Number communication.And it is clearly common algorithm in machine learning that reduction, which calculates, so collective communication can be used to carry out data flow diagram
The transmission of parameter.
Fig. 4 gives the timing diagram of the technical solution at runtime.In the program, it is all participate in data flow diagram calculate into
Journey (2011,2021,2031) is respectively used to one subgraph of operation or copy.Each course cycles reciprocally execute two kinds of routines.
Routine is exactly one group of function set, can execute certain function, such as a system provides external interface or clothes using routine
Business.For example, the API of operating system, service etc. are exactly routine;The canonical function and library function that Delphi or C++Builder is provided
Deng be also routine.Routine A carry out data flow diagram local parameter generate calculate (in figure 2012,2015,2022,2025,2032,
2035 be all routine A);Routine B carry out data flow diagram global parameter reduction calculate (2013 in figure, 2016,2023,2026,
2033,2036 be all routine B).Wherein, the technology realization of routine B is that the parameter based on MPI_Send and MPI_Recv primitive is logical
Believe (2014,2017,2024,2027,2034,2037).The transmitting-receiving chain of these parameters communication constitutes ring-type, so that each process
Reduction calculated result can finally reach other all processes, collect and distributed to complete a global data Flowsheet parameter
Journey.This collective communication process constitutes primary global grid and hinders synchronous (globle barrier synchronization)
(2041,2042), i.e., all processes, which must assure that, carries out the calculating of global parameter reduction when can be in identical wheel iteration,
It is exactly after all processes all complete mutually same wheel calculating, to enter back into next round iteration, 2041 and 2042 indicate two-wheeled iterative process.
Fig. 1 is specifically combined, global grid barrier is synchronous it is to be understood that the process for first reaching routine B has to wait for other processes and also reaches
Routine B, routine B can be just finally completed.Such as a mark or instruction can be set in code, so that each process is being held
Row detects whether other processes also go to the identical mark or instruction when perhaps instructing to the mark, in all processes
In the case where being performed both by identical mark or instruction, it is further continued for executing next instruction.
It should be noted that carrying out shown in Fig. 4 before carrying out parameter communication using MPI primitive, participation in the program
The originating process and purpose process of communication need first to interact to obtain the information of Correspondent Node, otherwise can not pass through MPI primitive
Transmit data.
However, being dynamic using TensorFlow as the communication timing of the computing platform of representative, random, when operation
Can learn the information such as source/target and the size of the intercommunication primitive message of being handled, and existing machine learning platform use it is logical
It includes the primitive and interface parameters of pairing that letter technology, which does not require originating process and purpose process to use, and disappearing using MPI library as representative
Breath pass through mechanism requires just to need to specify the interfaces such as message source or target and message size for intercommunication primitive in programming development
Parameter, and the interface parameters close coupling of receiving-transmitting sides pairing primitive.TensorFlow-Allreduce scheme does not solve this
Contradiction, it has increased one group of programming interface newly, writes its scheme using the newly-increased interface, develops several for collective communication
Instruction, encapsulates MPI interface in the instruction, and the programming habit for changing TensorFlow is allowed to adapt to the requirement of MPI library.In this way
One, user must be learned by the exploitation bank interface provided using the program, rewrites or rewrites application code, could be disappeared
Breath transmitting communicates brought performance advantage.Therefore, the ease for use of the program and versatility are insufficient.Importantly, calculating flat
The dynamic and randomness of the communication timing of platform make the both sides of data communication be difficult to confirm opposite end in time and hold consultation, should
Negotiations process also increases the burden of data communication in computing platform, to influence the efficiency of transmission of data.
On the other hand, based on the operation be ordered into using MPI library as the requirement communication of the message passing mechanism of representative, synchronized,
Allow appropriate non-matching message interspersed under the auxiliary of library internal buffer mechanism and asynchronous operation.However, being with TensorFlow
The machine learning of representative, deep learning platform not sought common ground step due to calculating process, thus communication timing therein be also it is out-of-order and
Asynchronous, a large amount of random traffic operations are interspersed to be executed, it is not required that each process is when calculating all in identical iteration wheel
It is secondary.TensorFlow-Allreduce scheme does not solve this contradiction, it selects the calculating iteration round in TensorFlow
Between synchronized using global grid barrier, avoid the communication of back gear time interspersed, to meet the constraint of MPI library, this makes computation rate
Faster process is frequently waited for, to cause computing resource waste.This synchronous time overhead waited also has
It may be decreased or offset the performance advantage of message-passing communication, so that the overall calculation rate of entire machine learning platform depends on
Most slow process, to affect the computation rate of machine learning platform.
Baidu's TensorFlow-Allreduce scheme be not improved in the TensorFlow system of script, and
It is the development library except a set of TensorFlow system for being encapsulated in script, which is external function library, it is relatively independent,
By the external expansion interface access system of TensorFlow system, the transmission mode using collective communication provided is (i.e.
It Allreduce) is another group of communication interface arranged side by side with the interface of original TensorFlow system offer.As external function
Library, there is no modify TensorFlow platform core layer identification code.TensorFlow-Allreduce scheme is a set of independent generation
Code, the application programming interface that this set code can call TensorFlow platform to provide from TensorFlow platform exterior
(Application Programming Interface,API).It should be understood that machine learning platform can also be divided into and be answered
With layer and core layer, wherein application layer is used to receive the model of user's input, wait train or data to be learned, runs user
Algorithm or code for writing etc..And as above with described in Fig. 3 run when engine 3013, memory management 3014, communication tube
The modules such as reason 3015 can think to belong to core layer.This set development library is with cannot differentiating the source of data flow diagram parameter to be communicated
The physical location that location or destination address are directed toward is host memory or GPU memory, and TensorFlow system shields the development library
These information.This requires the source of its capable perception data Flowsheet parameter of MPI library used or the physical bits of destination address
It sets, so that external mechanism can correctly read and write data.
TensorFlow-Allreduce scheme uses one kind CUDA (Compute Unified Device
Architecture the MPI library of (CUDA-aware)) is perceived.CUDA programming can be supported to connect if this kind of MPI library matches
The GPU of mouth, such as the GPU of NVIDIA (tall and handsome to reach), then the transmission primitive of the MPI library can be determined to by transmission primitive processing
The source address of information which kind of memory be located at, the reception primitive of the MPI library can determine the information to be handled by the reception primitive
Destination address which kind of memory be located at.For example, the MPI library, which sends primitive, will just be located at if data to be sent are located at GPU memory
Data in GPU memory copy to corresponding host memory and retransmit.And in fact, not every MPI library all CUDA perceive,
Which has limited the selections by MPI Technology application in machine learning platform, to MPI library.
On the other hand, a variety of machine learning platforms including TensorFlow, core layer often call non-thread
The CUDA driving layer interface of safety accesses GPU memory.So, the MPI library of CUDA perception is used in same TensorFlow etc.
When the machine learning platform of CUDA interface is used in conjunction with, the mutual preempting resources of meeting, so that there are performance deficiencies.As it can be seen that MPI library and
The mechanism of the platform access GPU memory such as TensorFlow is identical, and MPI library and TensorFlow platform core layer are using difference
Thread access GPU memory, and multiple threads cannot concurrently access GPU memory, that is to say, that occupy in a thread
In the case where CUDA driving layer interface, other threads are not available the interface also can not just access GPU memory.This just needs to adopt
With some scheduling schemes so that multiple threads can access GPU memory, such as use mutual exclusion lock or stream synchronization mechanism.Due to Baidu
Scheme is external function library, therefore can not perceive a function in the process of implementation, related subfunction relationship and calling
Process.Such as using the transmission primitive of MPI function library in Baidu's scheme and receiving primitive, if the data of transmission are located at
GPU memory, the then thread for executing the transmission primitive or reception primitive can be all locked during the entire process of executing the primitive,
Either computing platform is to send primitive or receive primitive as the instruction managed by stream synchronization mechanism.And in fact, sending former
Language or the implementation procedure for receiving primitive include multiple subprocess, and not all subprocess requires access GPU memory, this is just
Additional waiting time expense can be brought, to influence the efficiency of message transmission.For sending primitive, executes and send primitive
Process includes cutting data, is inserted into pointer, multiple subprocess such as memory duplication, and only memory duplication needs to access GPU memory.
The method that the application proposes data transmission in a kind of distributed system can simplify MPI technical application in calculating data
Communication process in flow graph.This method can realize that the code is included in computing platform software by software code, deployment
In distributed computing system.The object calculated below using data flow diagram as computing platform, transmitting data stream figure ginseng in calculating process
It is illustrated for number, the application is not intended to limit the computing object of computing platform, does not also limit the data transmitted in calculating process
Type.Dispose the copy or subgraph that data flow diagram to be trained is preserved in more physical machines of computing platform, wherein should
Distributed computing system includes the first calculate node and the second calculate node.Wherein, in the corresponding embodiment of Fig. 5, first is calculated
Node and the second calculate node are different calculate node.When operation, program code of the invention runs on the host of server
Memory, alternatively, host memory and GPU memory.It is illustrated below in conjunction with Fig. 5.It should be understood that such as non-explanation, following S501~
The number of S508, does not represent the sequencing of step execution, for example, not providing the sequencing that S501 and S502 is executed.
S501: the first calculate node according to the title of the first data flow diagram parameter in the first graph data structure, size and
Correspondent Node mark generates the first triple using first interface parameter generation algorithm, and first triple includes message mark
Note, message size and purpose process sequence number, wherein the message label corresponds to the name of the first data flow diagram parameter
Claim, the message size corresponds to the size of the first data flow diagram parameter, and the purpose process sequence number corresponds to described
The process of the first data flow diagram parameter is received in second calculate node.
Wherein, the first graph data structure in the first calculate node preserves the first data flow diagram in the first data flow diagram
Title, size and the Correspondent Node mark of parameter, wherein the first data flow diagram parameter is a connection of first data flow diagram
The parameter that side is carried.The Correspondent Node mark of the first data flow diagram parameter in first data flow diagram corresponds to described second
Calculate node.
First graph data structure can be different data structure in different computing platforms, and the application does not limit
System.For example, can be Tenser data structure in TensorFlow platform.As it can be seen that in the first data flow diagram and the second number
It is that the data flow of second calculate node is transferred to from first calculate node according to the first data flow diagram parameter in flow graph, is recorded
Graph parameter.
The title (name) of first data flow diagram parameter can be first figure for identifying the first data flow diagram parameter
A field in data structure is also possible to the information being dispersed in first graph data structure, that is to say, that the first data
The title of Flowsheet parameter, which can be, to be obtained according to the information analysis in first graph data structure, and specific implementation is in difference
Computing platform in it is different, such as in TensorFlow, reference can be made to the relevant paragraph of the application hereafter.
The size (size) of the first data flow diagram parameter is for indicating the sky of storage shared by the first data flow diagram parameter
Between, that is, the data volume of the data flow diagram parameter.The size of the first data flow diagram parameter can be by the first diagram data knot
A field in structure obtains, such as unit is byte, the numerical value of this parameter of the size of record data stream graph parameter, such as
3000,4000 etc..It is also possible to indicate with the information being dispersed in first graph data structure, such as first graph data structure
In parameter, a part of data volume of the first data flow diagram parameter is respectively identified in multiple minor structures, it can be according to these information meters
Calculation obtains the size of the first data flow diagram parameter.
The Correspondent Node mark of the first data flow diagram parameter in first data flow diagram can be second calculate node
Mark;Or the mark of the storage equipment where the destination address of the first data flow diagram parameter, the storage equipment are located at the
Two calculate nodes;Or second receive in calculate node the first data flow diagram parameter process mark;Or other are used for
Indicate the information of the receiving end of the first data flow diagram parameter, the application is with no restrictions.
To sum up, the first graph data structure in the first calculate node preserves the first data flow diagram in the first data flow diagram
Title, size and the Correspondent Node mark of parameter, can be includes carrying three kinds of information in first graph data structure
Field is also possible to preserve the letter for title, size or the Correspondent Node mark that can obtain the first data flow diagram parameter
Breath.I.e. so-called " preserving ", which can be, directly to read from first graph data structure, be also possible to according to first figure
Information analysis in data structure obtains.
Certainly, which also can store one or more data in the first calculate node
In structure.
For example, the realization of the S501 and S503 can be in the first calculate node and second in TensorFlow platform
One metamessage management module of addition (specifically can be addition in the memory management module (3014 in such as Fig. 3) of calculate node
One section of code), which can will be on the side of the data flow diagram in the first calculate node and the second calculate node
The information of data flow diagram parameter be stored in data structure, include the name of above-mentioned data flow diagram parameter in the data structure
Claim, size and Correspondent Node identify.
In addition, it is necessary to understand, in general machine learning platform, since the random ordering of communication is at random, process executes each
The corresponding operation of kind primitive, i.e. a process can execute transmission operation or execute reception operation, in most cases, not have
It is special to execute transmission primitive or specially execute the process for receiving primitive.
S503: the second calculate node according to the title of the first data flow diagram parameter in the second graph data structure, size and
Correspondent Node mark generates the second triple, the second interface parameter generation algorithm using second interface parameter generation algorithm
It is identical as the first interface parameter generation algorithm, second triple include message label, the message size and
Originating process sequence number, wherein the originating process sequence number, which corresponds in first calculate node, sends first data flow
The process of graph parameter;
Wherein, the second graph data structure in second calculate node preserves first number in the second data flow diagram
It is identified according to the title, size and Correspondent Node of Flowsheet parameter;The first data flow diagram parameter in second data flow diagram
Correspondent Node mark correspond to first calculate node.
Second data flow diagram is stored in the second calculate node, and the second data flow diagram can be the pair of the first data flow diagram
This, all can also be two subgraphs of a data flow diagram with the first data flow diagram.In relation to first in second data flow diagram
Title, the explanation of size and Correspondent Node mark of data flow diagram parameter, refer to the corresponding part in S501, herein no longer
It repeats.
In this way, without being interacted with Correspondent Node or user, so that it may obtain MPI and send primitive and MPI reception original
Interface function parameter needed for language.
It should be noted that generating message label, message size and source (or purpose) process sequence number usually using not
Same algorithm, i.e. the first interface parameter generation algorithm include the first algorithm, the second algorithm and third algorithm.First algorithm,
Second algorithm and third algorithm, which can convert the information in the first graph data structure and the second graph data structure to, to be met MPI and connects
The above-mentioned triple of mouth parameter format.
And the first interface parameter generation algorithm being outlined above is identical as second interface parameter generation algorithm, refers to, the
One interface parameters generating algorithm includes the first algorithm, the second algorithm and third algorithm, and, second interface parameter generation algorithm includes
The first algorithm identical or corresponding with the first interface parameter generation algorithm, the second algorithm and third algorithm.
Then to S501, a kind of implementation be can be according to the first data flow diagram parameter in first graph data structure
Title the first data flow diagram parameter and first algorithm, determine in first triple message label, according to described
The size of first data flow diagram parameter the first data flow diagram parameter in first graph data structure and second algorithm, determine institute
State the message size in the first triple, and the Correspondent Node according to the first data flow diagram parameter in the first graph data structure
Mark and the third algorithm, determine the purpose process sequence number in first triple.
Correspondingly, to S503, it is a kind of to be achieved in that according to the first data flow diagram ginseng in second graph data structure
The first algorithm in several titles and the second interface parameter generation algorithm, determines the message mark in second triple
Note;According to the size and the second interface parameter generation algorithm of the first data flow diagram parameter in second graph data structure
In the second algorithm, determine the message size in second triple;And according to the first number in the second graph data structure
According to the third algorithm in the Correspondent Node mark of Flowsheet parameter and the second interface parameter generation algorithm, the described 2nd 3 is determined
Originating process sequence number in tuple.
Wherein, message label is for indicating that the MPI sends the data that primitive is sent.Message label can pass through the first algorithm
The title for handling the first data flow diagram parameter obtains, and the first algorithm, which can be to convert any binary length value to, fixes two
The algorithm of system length value, such as can be hash algorithm etc.;Or other can convert the title of the first data flow diagram parameter
The algorithm for the format that message marks in interface parameters to meet MPI primitive.
Message size is used to indicate that the MPI to send the size for the information that primitive is sent.To the second algorithm, then a kind of realization side
Under formula, the value of the message size field can be made to be equal to the parameter value of the size of above-mentioned data flow diagram parameter, i.e. size.It is another real
Under existing mode, the value of the message size field can be made to be equal to the parameter value of the size of above-mentioned data flow diagram parameter plus one
Value.This value added is the size of the transmission primitive other information to be carried, the packet of the transmission primitive as mentioned below
Head length.For example, it includes data to be sent and packet header that MPI, which sends the information that primitive is sent, then the value of the size of the message is exactly
The size of the data to be sent adds the size in packet header.
Originating process sequence number is the sequence number that the first calculate node executes that the MPI sends the process of primitive, purpose process sequence
Row number is to execute the sequence number that MPI corresponding with MPI transmission primitive receives the process of primitive in the second calculate node.It needs to manage
Solution, due to preserving the source of the first data flow diagram parameter in the first data flow diagram and the second data flow diagram of the application
Node and destination node then know the corresponding storage equipment of source address and the first data flow diagram of the first data flow diagram parameter
The corresponding storage equipment of the destination address of parameter, and since in computing platform, for calculating data flow diagram, (data flow diagram parameter is passed
Defeated is a part for calculating data flow diagram)
Third algorithm is the mapping relations between process sequence number and Correspondent Node mark, wherein wrapping in the first calculate node
Include the mapping relations between purpose process sequence number and Correspondent Node mark, include originating process sequence number in the second calculate node with
Mapping relations between Correspondent Node mark.Third algorithm can be functional relation, and be also possible to safeguard in calculate node reflects
Firing table, the application is with no restrictions.The specific implementation of first algorithm, the second algorithm and third algorithm may refer to hereafter
The example of TensorFlow platform, implementation can also be used in other computing platforms in detail below.About process sequence
Number description may refer to the example of hereafter TensorFlow platform, implementation can also be used in other meters in detail below
Calculate platform.
Obviously, above primitive is sent as the MPI of interface parameters using the first triple to join with by interface of the second triple
Several MPI reception primitive is corresponding, in this way, including Correspondent Node in the first graph data structure and the second graph data structure
Mark, solve the problems, such as that the process of Correspondent Node is unknowable in data flow diagram operational process, and, need to transmit this first
The communicating pair of data flow diagram parameter, the information and identical interface parameters in data flow diagram stored using respective calculate node
Generating algorithm generates triple, is just not necessarily to the information to opposite end interaction oneself, and the algorithm of triple is generated without negotiation, should
Method can in data sender and recipient independent operating, generate corresponding triple in the case where both sides' no interactions,
The process communicated using MPI primitive is simplified, can be improved the efficiency that data are transmitted in Distributed Computing Platform.
S505: the first calculate node calls message passing interface MPI to send using first triple as interface parameters
Primitive sends the first data flow diagram parameter to second calculate node;
Wherein, the concept of triple described in this application is served only for indicating three parameters therein, without limit this three
Sequence between a parameter.The format of three parameters in the triple meets MPI and sends interface function parameter entrained by primitive
Call format.In addition, the interface parameters that MPI sends primitive includes but is not limited to first triple, MPI receives connecing for primitive
Mouth parameter includes but is not limited to second triple.
Under a kind of implementation, the first calculate node described in S505 is used for the ginseng using first triple as interface
Number sends primitive by message passing interface MPI and reads first data from the host memory of first calculate node
Flowsheet parameter, to send the first data flow diagram parameter to second calculate node.
Under a kind of implementation, the first calculate node is also preserved the first data flow diagram parameter and is calculated described first
The information of the storage equipment of node, that is, hereafter described in data type of memory.Then before S505, first calculating
Node executes S504, i.e., in the case where the information of the storage equipment is designated as other storage equipment, by first data
Host memory of the Flowsheet parameter from other described storage device replications to first calculate node, other described storage equipment are
Memory in first calculate node in addition to host memory.
The information of the storage equipment can be the mark of the storage equipment, be also possible to the volume for indicating the storage equipment
Number, the storage class of the storage equipment can be determined according to mark or number, can also be the type for identifying the storage equipment
The information that above-mentioned function can be reached of information etc. or other forms, the application is with no restrictions.Specific implementation can refer to hereafter phase
Close paragraph.
For example, other storage equipment can be GPU memory or the memory of other processing units, such as FPGA or DSP
The memory etc. of equal processing units.The step can be with reference to specific implementation in TensorFlow platform described below to reinforce managing
Solution.The communication management module that the step is construed as computing platform above-mentioned is realized, computing platform core layer is used
Access the mechanism of other storage equipment.Such as to GPU memory, the function that the CUDA programming interface offer in the platform can be used will
Data to be sent copy to host memory.In this way, the first calculate node using MPI send primitive before, will be by the first data
Flowsheet parameter prepares in the host memory of the first calculate node, and MPI sends primitive only from the host of first calculate node
The first data flow diagram parameter is read in memory, without fighting for reading the resource of other storage equipment with computing platform, is improved
The execution efficiency of MPI transmission primitive.In addition can be more flexible to the selection of MPI library, not requiring MPI library to support access, other are deposited
Equipment is stored up, the competition of computing platform and MPI library to other storage equipment are accessed will not be generated, specific discuss can be with reference to hereafter
Relevant paragraph.Certainly, if selecting the MPI library for supporting access GPU memory, the step can also be executed by MPI library.
It should be understood that the first data flow diagram parameter can store, in the buffer area of host memory, (it, which is illustrated, please refers to this
The relevant paragraph of application), it is also possible to distribute to the memory space of user in host memory, the application is with no restrictions.Such as MPI
In the case that library matches RDMA technology, data in available host memory in any address being registered, and MPI library
In the case where matching TCP/IP technology, then need to answer the first data flow diagram parameter in the memory space for being stored in user
It makes in the buffer area MPI or data buffer area (seeing below) and uses.
That is, the data buffer area hereinafter referred to, can be all arranged in source calculate node and purpose calculate node,
In this way, the data buffer area is used cooperatively with original buffer area MPI (the two can be collectively referred to as buffer area), it is not complete in buffer area
In the case where full occupancy, it can more tolerate the asynchronous progress for sending and receiving operation, be suitable for needing to carry out in learning platform
Complicated, the asynchronous and out-of-order transmitting-receiving operation of multiple data.
S507: the second calculate node calls MPI to receive primitive and handles first data flow diagram according to second triple
Parameter.
It is to be appreciated that calling MPI to receive primitive handles the first data flow diagram parameter, " processing " is in different fields
Under scape, different operations can be corresponded to, the application is without limitation.Such as can be it is following operation one of or it is a variety of: call
MPI receives the data buffer zone that the first data flow diagram parameter is received host memory by primitive;MPI is called to receive primitive modification
The first data flow diagram parameter in host memory is supplied to calculating data flow by the label of the first data flow diagram parameter
The process of figure uses;The first data flow diagram parameter is stored from data buffer zone to destination address.Primitive is received such as about MPI
Where manages the first data flow diagram parameter, can be with further reference to the dependent segment in the explanation hereafter by taking TensorFlow platform as an example
It falls.
Under a kind of implementation, the first data flow diagram parameter receives primitive and carries the first data flow diagram parameter
Destination address is implemented as calling second triple as the MPI interface parameters for receiving primitive then to S507
The MPI receives primitive, and the first data flow diagram parameter is stored from the data buffer area to the destination address.Example
Such as, which is located at the user memory space in host memory.
If the destination address is located at other storage equipment, which is in second calculate node except in host
Other storage equipment of access, such as GPU memory, the first MPI are supported to receive former if depositing the MPI library that outer storage equipment uses
Language can also store destination address to corresponding destination address in the data of GPU memory.And under another implementation method, it is this
Situation can then be carried out after S507 by the mechanism of other storage equipment of the access of computing platform itself.That is S508: in institute
State the first data flow diagram parameter destination address correspond to other storage equipment in the case where, second calculate node is by the master
To the destination address, other described storage equipment are the second calculating section for the first data flow diagram parameter storage in machine memory
Memory in point in addition to host memory.
S508 is similar with the S504 of previously described first calculate node, illustrates and refers to the step with beneficial effect
Paragraph and relevant paragraph hereinafter.S507 may be considered the MPI client executing being mentioned above.
It should be understood that data flow diagram is all preserved in the multiple physical machines for being deployed with distributed machines learning platform,
The code of the machine learning platform is executed also by process with the training data flow diagram.So for the first data in data flow diagram
Flowsheet parameter, the first calculate node are transmitting terminals, and for another data in data flow diagram, which may
It is receiving end.Then the specific implementation of S505 and S507 can refer to hereinafter relevant description.
Due to the random ordering that machine learning platform executes instruction, data are possibly can not be after the host memory of write-in receiving end
Sentence processing is received by MPI in time, and the space for the buffer area MPI that MPI library carries is small, is unable to satisfy several easily in machine learning
The data transportation requirements of MB, therefore one piece of data buffer area can be divided in the host memory of the second calculate node, the data are slow
It deposits area and is exclusively used in the data that storage MPI primitive uses, make a concrete analysis of see the application relevant paragraph.
Then under a kind of implementation, S507 includes detecting primitive by MPI to detect the second calculate node host memory
In data buffer area, to obtain the second triple of the first data flow diagram parameter, the data buffer area is exclusively used in depositing
Store up the data handled by MPI primitive;MPI is called to receive primitive, to handle the first data flow diagram parameter, the MPI is received
The interface parameters of primitive includes second triple.
Under a kind of implementation, second calculate node operation has first thread and the second thread, and S507 includes:
The first thread detects primitive by message passing interface MPI and detects the data buffer area in the host memory,
To obtain second triple;The first thread calls the first MPI to receive according to the second triple in the data buffer area
Primitive, to handle the first data flow diagram parameter, the second triple in the data buffer area is the second calculating section
Point sends what primitive obtained according to the MPI;Second thread is determining the first data flow diagram parameter by described first
After MPI receives primitive processing, the 2nd MPI reception primitive is revised as MPI and waits primitive, it is not that the 2nd MPI, which receives primitive,
The reception primitive corresponding with the first data flow diagram parameter executed by second thread, the 2nd MPI receive primitive
Interface parameters include the second triple that second calculate node generates, the MPI waits primitive for waiting described the
One MPI receives primitive and is finished.
Wherein, the second triple can be obtains according to the interface parameters that the MPI received sends primitive, is also possible to root
It sends to analyze in the data of primitive transmission according to interface parameters and the MPI and obtain, the application is with no restrictions.
That is the second calculate node can execute MPI detection original by one thread of pull-up (can be described as poll thread) again
Language includes above-mentioned data buffer area, the data in buffer area to detect the buffer area of the host memory of second calculate node
Buffer area is often bigger than the buffer area MPI that system carries, and illustrates and can be found in hereafter relevant paragraph.It can thus find not
It is in time for being connect the data of primitive processing by MPI.The thread can execute MPI detection primitive detection buffering by the way of poll
Area, once finding such data, then calling the corresponding MPI of the data to receive primitive, (to distinguish, referred to as the first MPI receives former
Language) and originally pending MPI primitive (to distinguish, referred to as the 2nd MPI receives primitive) is revised as MPI and waits primitive, the MPI
Wait primitive for wait the first MPI receive primitive be finished, to the first MPI receive primitive be finished, then the thread after
Continuous poll receives the data of instruction processing to continue with to MPI.
In this way, primitive processing can be received more in time by allowing for data, but also other are pending for the first calculate node
Transmission primitive can perform faster, thus, improve data transmission efficiency.
To sum up describe the data transmission method of the machine learning platform from transmitting terminal to receiving end, this method by using
Local graph data structure and interface parameters generating algorithm, obtains using interface function parameter needed for MPI primitive, avoids
The parameter of transmitting terminal and receiving end is matched before transmitting data, is improved the efficiency of data communication, further, is obtained number to be transmitted
According to the storage location in transmitting terminal and receiving end, to be sent in the storage location not in the case where host memory in data
, will be mobile across storage equipment in data progress physical machine by the mechanism of machine learning platform after preceding and data receiver, it widens
The range of choice of MPI library also avoids MPI library and machine learning platform and fights for resource when across storage equipment mobile data.
And by the way that dedicated data buffer area and poll thread is arranged, enable message-passing communication buffer area in message sink original
In the case that language not yet calls, message final purpose address is unknown, so that message transmission primitive is able to carry out data and sends, and
Data return immediately after being sent completely.Buffer area is that following message sink primitive temporarily saves data, so that message
Operation need not be synchronized with message sink primitive by sending primitive, solve the intrinsic temporal constraint of the two.Sender need not be same
Step waits, and makes to facilitate improving performance which save the time is executed.
The characteristics of these above-mentioned improvement enable MPI library to be well adapted for machine learning platform, improve communication effect
Rate, since MPI library is the technology in high-property transmission field, this, which allows for machine learning platform, can sufficiently use high-property transmission
The resource of network, greatly improves communication efficiency, to promote the computational efficiency of computing platform.
About the other technologies details for the method for transmitting data in the machine learning platform of above-mentioned corresponding diagram 5, to wherein
The detailed description of the beneficial effect of the explanation and each step of the noun or step that are related to, can be with further reference to present specification
Other relevant paragraphs.
It is to be appreciated that thought described in the above method can use a variety of computing platforms, such as can be hereinafter
The machine learning platform of detailed description is also possible to figure computing platform or stream calculation platform etc., and the application is with no restrictions.
Be described below this application involves computing platform calculate data flow diagram a kind of process.It should be understood that the process be for
Explain computing platform in the process for calculating data flow diagram, only a kind of example, the application to the process with no restrictions.The process
The machine learning platform such as TensorFlow referred to suitable for the application.Generally, comprising data flow diagram creation (or for " number
Defined according to flow graph ") and data flow diagram operation.Wherein, in an implementation mode, data flow diagram creation can be subdivided into full figure structure
It makes, subgraph extraction, figure cutting, scheme the sub-steps such as optimization.Data flow diagram operation can be subdivided into input data filling, algorithm kernel function
The sub-steps such as execution, output data acquisition.For example, step S501~S507 can consider category in the method that the present embodiment proposes
Sub-step is executed in algorithm kernel function, and before step S501, figure number is written into the information such as the title of data flow diagram parameter, size
Then belong to data flow diagram creation process according to structure.
Data flow diagram creation, it is intelligible which by user is converted to computing platform using the algorithm of programming language
Data flow diagram structure.
Specifically, including full figure construction, i.e., all algorithmic codes write user are all converted to data flow diagram
Structure.Later, subgraph extraction is carried out to the data flow diagram structure after conversion, this is because often including in the data flow diagram
The node and side unrelated with final calculation result is obtained.Therefore during subgraph in one case extracts, computing platform will with it is final
The node and side that node where calculated result is connected are extracted from full figure, as subgraph to be run.Other are not by most
The node and side that whole calculated result is connected will be ignored, and be not involved in follow-up operation process.Wherein, the connection can be directly with
Node connection where final calculation result is also possible to connect by several sides with node where final calculation result.
Next, being illustrated with saving a part of subgraph in each equipment.Computing platform by the subgraph extracted into
Row figure cutting, i.e. cutting are several width Local maps, and every width Local map corresponds to an equipment.Such as can be specified according to user
Equipment allocation strategy carries out cutting.All nodes correspond to algorithm logic and will be executed as that equipment where it on one width subgraph.
It should be understood that a node on subgraph will not be sliced into two equipment by figure dicing process, it is likely that a line is cut off.?
In this case, computing platform can be automatically inserted into pairs of data and send running node in the Local map after cutting
(SendOp) and data receiver running node (RecvOp).Under the auxiliary of traffic operation, it is split several on distinct device
The overall calculation logic of width Local map can be remained exactly the same with the subgraph before cutting.As it can be seen that after data flow diagram is split,
Complete the calculating to these subgraphs with wanting multi-process, it is necessary to transmitting data stream graph parameter.
That is, various data and information in data flow diagram, such as data flow diagram parameter, the letter of data flow diagram parameter
Breath, may be stored in graph data structure.
It should be understood that the copy of a data flow diagram can also be distributed to multiple equipment by machine learning platform, then without carrying out
The cutting of subgraph.In this case, transmitting data stream graph parameter is also needed, can be similarly inserted into data flow diagram and be used for table
Show the node for sending operation and receiving operation.Due to including the information on side and node in data flow diagram, machine learning platform has more
The method of the kind node of insertion for indicating to send operation and reception operation in data flow diagram, the application is with no restrictions.For side
Just understand, below with reference to Fig. 6, schematically illustrated for scheming cutting.
As shown in fig. 6, the data flow diagram includes node a, b, c, w, y, x before figure cutting, and is directed toward the side of b by a, by
A be directed toward c side, by w be directed toward y while and by x be directed toward y while.Figure dicing process makes node a be directed toward b, and a is directed toward c, and x refers to
To the side of y be cut off, computing platform can where node a, x in subgraph insertion send running node s1, s2, and node b, c,
Insertion receives running node r1, r2 in subgraph where y.This to establish a pair between s1 to r1 respectively between s2 to r2
Correspondence ensure that the overall calculation logic of two width Local maps with completely the same before cutting.
That is, figure cutting is also a kind of figure distribution, then machine learning platform is carrying out figure cutting or figure distribution
During, necessarily have been determined the allocation plan of each node relevant to the node where final result, i.e., this is each
It is certain which equipment node, which is dispensed in, therefore also can determine the source for the data that the side between these nodes is carried
Calculate node and purpose calculate node.As it can be seen that and based on above description and Fig. 2, it is known that these subgraphs are assigned to calculating
Multiple processes of platform execute, therefore the corresponding process of each node is also specific in subgraph.Based on these information, the application is mentioned
It out can be in Tensor data structure new field, the type of memory and communication of Correspondent Node equipment is written in some embodiments
Opposite end mark.
In an implementation mode, after carrying out figure cutting, figure optimization can also be carried out, i.e., is carried out the subgraph after cutting excellent
Change processing enables to rate of the data flow diagram in future operation to have to reach under the premise of not changing its calculating logic
The purpose promoted.
Steps mentioned above belongs to the step of computing platform creation data flow diagram.
Following computing platform operation data flow graph, in this step, computing platform dispatches each equipment and executes data flow diagram,
Obtain the final calculation result of algorithm.
Under one of implementation, including input data filling, i.e. computing platform are read to be calculated from storage equipment
External data collection, these external data collection are filled into the variable node in data flow diagram, so that the fortune in calculate node
Operator has input data.Then, multiple computational threads are according to the subgraph in corresponding equipment, by executing and respectively son
Scheme relevant kernel function to calculate data flow diagram.Specifically, computational threads by the node on the corresponding subgraph of equipment according to
Certain scheduling strategy is lined up, and kernel function corresponding to the operator on each node is successively executed, to obtain algorithm
Intermediate calculation results.Wherein the execution sequence of each node is dynamically determined by loading condition when scheduling strategy and operation, common
Scheduling strategy have full synchronization policy (Bulk Synchronous Parallel, BSP), half synchronization policy such as SSP (Stale
Synchronous Parallel) and asynchronous (Asynchronous Parallel, ASP) strategy, it should be noted that
Machine learning field, it is not required that the synchronization that multiple computational threads calculate, therefore in most cases, the meter of this kind of learning platform
Calculating all has asynchronous, random feature.
It should be noted that one kind is traffic operation node in these pending nodes, i.e. figure dicing process before this
Middle be inserted into data send running node (SendOp) and data receiver running node (RecvOp), such as s1, the s2 in Fig. 5,
R1 and r2.That is the present processes are described uses MPI primitive communication process, is exactly that computing platform is executing this
Data representated by a little nodes are sent or the operation of data receiver.For example, in existing TensorFlow platform, traffic operation
It is realized using remote procedure call protocol (Remote Procedure Call Protocol, RPC) mechanism.
Finally, computing platform is completed to calculate, and exports calculated result from from the node for represent final calculation result, returns to
User program.
Below with reference to Fig. 7, by taking the TensorFlow machine learning platform of open source as an example, description is based on side described herein
Method realizes the process of data flow diagram parameter communication.It is understood that the implementation detail of following data flow diagram parametric procedures
Suitable for other computing platforms, the application is not caused to limit.Using following communication process so that TensorFlow this
In the common computing platform of kind, MPI technical application can be simplified and calculating the communication process in data flow diagram, without being transmitted in data
It is preceding to negotiate with Correspondent Node to client information, it is more flexible using MPI interface and meet the calculation features of computing platform, energy
Preferably play the communication capacity of high performance network.After tested, following methods is used in the hardware environment of following examples,
Communication efficiency can be improved 50%, to substantially reduce the time that computing platform calculates data flow diagram.
It is calculated it should be noted that the sender of data involved in following processes may be considered mentioned above first
Node, recipient may be considered the second calculate node mentioned above, and sender and recipient can be identical calculating section
Point or different calculate nodes, can be deployed in identical physical machine, can also be deployed in different physical machines, under
The data flow diagram parameter communication process stated is applicable in.
In one example, the server of TensorFlow machine learning platform operation, reaches configured with tall and handsome
(NVIDIA) GPU card and InfiniBand network interface card, the message-passing communication library which uses are MPI interface function.Its
In, the GPU card of NVIDIA is provided by CUDA programming interface calculates acceleration capacity, and InfiniBand network interface card is mentioned by rdma protocol
For efficient communication ability.
As shown in fig. 7, the module that the present embodiment is related to includes distribution in TensorFlow (5011) software frame
(Distributed Runtime) module 5012 when operation, engine when which is the operation in TensorFlow have this Shen
Please above described in run when engine function, corresponding method and step can be performed;(Common when public operation
Runtime) module 5013, the module realize TensorFlow in memory management, have the application above described in memory
Corresponding method and step can be performed in the function of management module;And remote rendezvous point (Remote Rendezvous) module
5014, the module realize TensorFlow telecommunication management, have the function of the application above described in communication management module,
Corresponding method and step can be performed.It is to be understood that module described in the present embodiment is all a section code, it is believed that one
The code of a module is continuously to write together.It wherein, include data flow diagram in Common Runtime module 5013
Graph5016.It further include host memory 5021, GPU memory 5022 and InfiniBand network interface card 5023 in server.
Part shown in dotted line frame changes in Fig. 7 for what the present embodiment carried out on the basis of existing TensorFlow software frame
Into.Inside Distributed Runtime, the present embodiment is added to the function of MPI scheduling;In Common Runtime
Portion, the present embodiment are added to the function of metamessage management, for managing the metamessage of data flow diagram parameter, under which includes
Size, title and the Correspondent Node mark for the data flow diagram parameter that text refers to, in an implementation mode, metamessage further includes number
According to the storage location of Flowsheet parameter.Management can be increase, delete, at least one of operation such as modification;In Remote
Inside Rendezvous, the present embodiment is added to the function of MPI client;And MPI is integrated in TensorFlow platform
Library;In the host memory that TensorFlow is used, it is (namely described below slow to be also assigned with message-passing communication buffer area
Rush area), so that the instruction in MPI library uses.As it can be seen that these following improvement are all in the core layer of TensorFlow platform, this
Following improvement is allowed for, such as interface parameters generates and interface calling procedure is hidden in the original data flow diagram of TensorFlow
Inside creation and operational process, rather than it is exposed to application developer calling.TensorFlow platform is integrated in as one kind
Internal improves mechanism, these improvement do not change the original programming mode of TensorFlow, can be realized existing application program
Accelerate.
In the data structure of data flow diagram, information needed for saving the communication of message transmission specifically can be and mention above
And the title of data flow diagram parameter, size and Correspondent Node mark.Common Runtime of this step in TensorFlow
Module 5013 executes when creating data flow diagram.The existing Graph5016 module of TensorFlow uses a series of numbers such as Tensor
The data flow diagram parameter of connection side and its carrying is saved according to structure.It has contained in these data structures for indicating data flow
The information of the size (size) of the title (name) and data Flowsheet parameter of graph parameter, these data structures are that MPI primitive will pass
Defeated data.The information for including in existing Tensor data structure can satisfy existing data flow diagram parameter transmission means,
Use RPC communication.However, not including that the transmission primitive of MPI and reception primitive needs are taken in existing Tensor data structure
The information of the opposite end process of band and the type of memory of parameter to be communicated.In the embodiment of the present invention, in Tensor data structure
Increase the type of memory (such as being expressed as Dev type) and Correspondent Node identification field of data flow diagram parameter.It should be noted that
To the above-mentioned information referred to, the field for indicating the information can be defined in Tensor data structure, such as definition is used for
The dev_type field of type of memory is stored to store the corresponding type of memory of this end node of a data flow diagram parameter, to transmission
Primitive, this end node are source node, and to the waiting primitive for receiving primitive and receiving end execution, this end node is purpose node.Again
For example, it is also possible to define the corresponding type of memory of memory field storage peer node, to sending primitive, for the purpose of peer node
Node, to the waiting primitive for receiving primitive and receiving end execution, peer node is source node.Under a kind of form, it can also define
Multiple fields respectively correspond type of memory with this end node and peer node for storing with data flow diagram parameter.It is also possible to carry
In Tensor data structure in multiple sub-data structures, need to carry out wherein part relevant to above- mentioned information parsing or
Splicing calculates, and the embodiment of the present invention does not limit, and the means that computing platform provides some auxiliary go analysis Tensor data
Content in structure, to obtain 4 kinds of above-mentioned information.
Wherein, the size of data flow diagram parameter is exactly the data volume of the data flow diagram parameter, that is, data flow diagram ginseng
The shared memory space of number, unit byte record the numerical value of this parameter of the size of data flow diagram parameter to be transmitted, such as
3000,4000 etc..
During Common Runtime module creation data flow diagram, at least traversal has with the node where final result
The connection of pass is while (such as when can also traverse all connections in the data flow diagram), for the data flow diagram carried on connection side
Parameter, it is based on data flow diagram cutting as a result, knowing the type of memory of the data flow diagram parameter for indicating to carry on connection side
Information, by this be used to indicate the type of memory of data flow diagram parameter carried on connection side information its Tensor data are written
Structure, such as can be the type of memory field for filling in definition, it is also possible to be distributed in multiple fields.A kind of implementation
Under, type of memory refers to the corresponding type of memory of this end node of a data flow diagram parameter, and to primitive is sent, this end node is source section
Point, to the waiting primitive for receiving primitive and receiving end execution, this end node is purpose node.Pair for including in title based on side
The mark of the equipment at end knows the identifier of the equipment of opposite end corresponding to the parameter carried on the side, by the equipment of the opposite end
Identifier be written Tensor data structure, specifically can be Correspondent Node identification field therein.For example, TensorFlow
In data flow diagram, the format of side title is connected are as follows: [src_device];[src_incarnation];[dst_device];
[tensor_name];[frame_id]:[iter_id]
Wherein, [dst_device] field indicate be exactly the connection side destination node (i.e. receiving end) device identification
Symbol.What [src_device] field indicated is the device identifier of the source node (i.e. transmitting terminal) on the connection side.These equipment marks
Knowing symbol is often character string.[src_device] can also be abbreviated as Src Dev;[dst_device] can also be abbreviated as Dst
Dev。
And to memory type field, different enumerated values can be used and identify different type of memory.Such as it is marked with 01
Know host memory, 10 mark GPU memories, or identifies GPU memory with 0 mark host memory, 1, or in 001 mark host
It deposits, 010 mark GPU memory, 100 identify other hardware memories etc., and the application is with no restrictions.
Before the Remote Rendezvous module of TensorFlow initiates the communication of data flow diagram parameter, make MPI interface
Function carries above-mentioned information.The step belongs to the process of operation data flow graph (i.e. calculating data flow diagram), it is believed that is
The function for the MPI client being mentioned above.It is noted that above-mentioned information is needed by processing under a kind of implementation
It could be carried by MPI interface function.Such as title (name), the information and the number of size (size) of data flow diagram parameter
It identifies according to Correspondent Node corresponding to Flowsheet parameter, is carried after processing as the interface function parameter of MPI primitive, and data flow
A part in data that the type of memory of graph parameter is carried as MPI interface function is transmitted.In this manner it is possible to using existing
The various general MPI interface functions having, substantially increase versatility.For another example data flow diagram to be transmitted can also be joined
Several type of memory is also used as the function parameter of MPI interface function.Certainly, in this case, need to modify MPI interface function
Definition and operating specification, the application to being not described in detail in this respect.
For ease of description, being illustrated so that MPI sends primitive as an example.Primitive is received to MPI, can refer to and original is sent to MPI
Also the message size field of a data flow diagram parameter, message are carried in the interface function parameter of the description of language, i.e. MPI reception primitive
Tag field and target process sequence-number field.Under a kind of implementation, it includes data in the data that primitive carries that MPI, which is received,
Type of memory corresponding to the destination address of Flowsheet parameter.Details are not described herein again.MPI send primitive interface function parameter include
Message size field, the message size field are used to indicate the size of the transmission primitive information to be sent, then a kind of realization side
Under formula, the value of the message size field can be made to be equal to the parameter value of the size of above-mentioned data flow diagram parameter, i.e. size.It is another real
Under existing mode, the value of the message size field can be made to be equal to the parameter value of the size of above-mentioned data flow diagram parameter plus one
Value.This value added is the size of the transmission primitive other information to be carried, the packet of the transmission primitive as mentioned below
Head length.Under a kind of implementation, packet header includes the size of the data flow diagram parameter, mark (for marking the data flow diagram to join
Number, such as its title), the number of corresponding target process and corresponding originating process number.MPI sends the message carried in primitive
Label (tag) is used to indicate the MPI and sends the data carried in primitive, such as the title for indicating data flow diagram parameter.MPI
It sends the message carried in primitive and is labeled as the binary numeral of a regular length, therefore the title of data flow diagram parameter can be made
It with some algorithms, is converted into meeting the format of message label, its value is sent into message in primitive as MPI and marks this parameter
Value, such as algorithm can be hash function.It should be understood that can be collided due to the verification scheme of MPI interface function to avoid Hash
It influences.
Another aspect executes the host of MPI transmission primitive from process mapping table, according to the data flow being outlined above
The Correspondent Node identification field of graph parameter, finds opposite end process sequence number, i.e. rank, and opposite end process is to execute to send out with the MPI
The process for sending the corresponding MPI of primitive to receive primitive.For example, process sequence number can be some numbers, such as 0,1,2,3,28 etc..
Above-mentioned process mapping table, identifier including equipment in the computing platform and is closed using the mapping between the process sequence number of the equipment
System.In the Correspondent Node identification field of one data flow diagram parameter it is in store be the data flow diagram parameter receiving end equipment
Identifier.It is to be appreciated that, once calling the process pull-up of an equipment, appointing having executed to calculate after gallery operation
Before business, the corresponding relationship of the process and the equipment will not all change, which can know the identifier for the equipment that it is called,
Therefore the process mapping table can be generated in machine learning platform.For example, can be during process pull-up, it will by certain functions
The identifier of equipment in the machine learning platform is converted to obtain sequence number, and the obtained sequence number of the transformation is just used as and sets
The sequence number of standby corresponding process.Can also be after process pull-up, the mark for the equipment that the sequence number and process of record the process call
Know the mapping relations of symbol.Can also when generating triple, can also mark to Correspondent Node using certain function at
Reason, to obtain the sequence number of required process.It should be understood that these process sequences number are used for the primitive in MPI library, the machine
Mapping where device learning platform can also save the process sequence number in the MPI library and process between the information of physical machine is closed
System, the sequence number of each process are different.
Certainly, under a kind of implementation, under a kind of implementation, type of memory does not carry in MPI interface parameters, but
It is carried in the data of MPI transmission.Such as type of memory sequence is melted into byte stream, a word as Tensor data structure
Section.Obviously, MPI primitive is transferred to Correspondent Node process, and opposite end process parses the data received, you can learn that this was received
The type of memory of the corresponding purpose equipment of data flow diagram parameter.
More than, without transmit data both sides negotiate, the program code independent operating of sender and recipient, both sides without
In the case where interaction, just obtain in MPI interface, function parameter triple needed for sending primitive and receiving primitive <
rank,size,tag>.Wherein, rank is the process sequence number of Correspondent Node, and size is the size of the information to be transmitted, and tag is
Message label.
It should be noted that MPI is received in the function parameter of primitive, including data to be transmitted (such as data Flowsheet parameter)
Destination address, that is to say, that MPI sends the information that opposite equip. is only carried in primitive, but not including data to be transmitted
Destination address.Therefore, the data that sending ending equipment transmits of can receive of the receiving end of data to be transmitted (such as described below are deposited
It is placed in the host memory of receiving end), but will be during the process of receiving side calls MPI to receive primitive, or even MPI is called to connect
After receiving primitive, received data could be used for training data flow graph.
Such as by taking the data flow diagram of the cutting in Fig. 6 as an example, need progress striding equipment logical between source node S 2 and destination node R2
Letter.Wherein, node R 2 is located at equipment B, carries following information: Name:Edge_x_y_0, size:4000, Src Dev:Dev
A, Dev Type:CUDA_GPU.Wherein Dev Type:CUDA_GPU indicates opposite equip., that is, equipment A type of memory.Node S2
Positioned at equipment A, following information: Name:Edge_x_y_0, size:4000, Dst Dev:Dev B, Dev Type is carried:
CUDA_GPU.Wherein Dev Type:CUDA_GPU indicates opposite equip., that is, equipment B type of memory.The then transmission of source node S 2
Primitive can be written as MPI_Isend (tag=hash " Edge_x_y_0 ", size=4000+LEN_HEADER, rank=
Dev2rank (" Dev B ")), i.e., carry triplet parameters mentioned hereinabove in the transmission primitive, and Dev Type:CUDA_
GPU then carries the data portion in the transmission primitive, and the data portion of the transmission primitive can also be including where the source node
The title of equipment, i.e. Dev A.The reception primitive of destination node R2 carries interface message, can be written as MPI_Irecv (Tag=
Hash " Edge_x_y_0 ", size=4000+LEN_HEADER, rank=dev2rank (" Dev A ")), i.e. the reception primitive
Middle carrying triplet parameters mentioned hereinabove, and Dev Type:CUDA_GPU), then carry the data portion in the reception primitive
Point, the data portion of the reception primitive can also include the title of the equipment where destination node, i.e. Dev B.
In example, the size of the data flow diagram parameter of 4000 expression MPI primitive carryings in size, LEN_HEADER,
Indicate the length (length of header) in the packet header being described above.The data of MPI primitive transmission are serializings
(Serialization) the Tensor data structure after, i.e. one group of byte.In addition to comprising to be transmitted in Tensor data structure
Data flow diagram parameter, also include some other information fields, these information fields serializing after be known as " packet header ".Such as
The title of the Tensor data structure.The length in packet header is determining, it is possible to add a constant.
In this way, transmitting terminal and receiving end are not necessarily to interact to can be obtained by opposite end during data flow diagram calculates
Identifier of equipment etc. generates the parameter of MPI interface function, reduces the number of communications and waiting between each process.
It continues with and is illustrated with the process that process sends data flow diagram parameter.From the foregoing, it can be understood that executing MPI sends original
The process of language need to first obtain the MPI and send the data flow diagram parameter to be sent that primitive carries.The process can describe in this way: really
Whether fixed data flow diagram parameter to be sent is located at host memory, is located at other storages in the data flow diagram parameter to be sent and sets
In the case where standby, which is copied into host memory.For example, other storage equipment can be the host in addition to host memory
Other memories, such as GPU memory.This is because for general MPI interface function, it can only be directly using in host memory
Data.
It include the buffer area MPI in host memory, which is allocated to what MPI library used under a kind of implementation
Memory headroom.To certain MPI libraries, this is included, to store the data called by MPI primitive, as the memory headroom can be with
It is 64KB.Optionally, the buffer area MPI is also possible to distribute to the memory headroom of certain user, it can multiplexing.Obvious MPI buffering
Area space is smaller, although can break the synchronism of the communication by MPI library to a certain extent, such as to can receive 64KB below
Information, but be easy to use up, it is not able to satisfy the needs of machine learning scene, for example, the data flow diagram of several hundred KB and a few MB are joined
Number is all relatively conventional.The application also proposed another implementation, in the memory of the host where computing platform, then be arranged
One data buffer area, the address space of the data buffer area are greater than the buffer area MPI, which is exclusively used in storage and passes through
The data that MPI primitive calls.For example, the data buffer area of several MB or more than ten MB can be arranged in memory, it might even be possible to
The MB up to a hundred even data buffer area of several GB are set in the memory of host.It is to be understood that machine learning platform is deployed with
Physical machine, biggish memory headroom is configured on hardware, can satisfy Memory Allocation demand above-mentioned.In this way, data buffer storage
Area and MPI fits use, and have expanded the capacity of the buffer area MPI, so that the ability of data used in processing MPI primitive
It further enhances.And poll thread hereafter can be cooperated to use, accelerate receiving end to the number sent in host memory
According to processing, thus accelerate MPI send primitive execution, to accelerate the data interaction in machine learning platform.It retouches for convenience
It states, the above-mentioned buffer area MPI and data buffer area, collectively referred to as buffer area, common ground is to be exclusively used for storage to pass through MPI original
The data of intonation.
On the other hand, in the case where MPI technology uses RDMA communication mechanism, a host can remotely be write data into
The CPU of memory and the host is without perceiving, that is to say, that when the host for needing to receive data is also not carried out MPI and receives primitive,
The host for sending data can be sent primitive by MPI and send data to the host for needing to receive data.And MPI is sent
The destination address of the data of transmission is not carried in primitive, only MPI is received just to be carried in primitive, then only execution MPI in opposite end connects
Primitive is received, could be by received data transmission to destination address, when being not carried out MPI reception primitive, received data can be stored first
Enter these buffer areas.So, if necessary to receive data host in, in buffer area idle insufficient space with place by
The data to be sent, the MPI for carrying the data that will be sent, which sends primitive, will be unable to execute.This is that is, to MPI
Technology is to exist synchronize, have order constrained, i.e. transmitting terminal sends primitive to swimmingly continuously carry out MPI, receiving end is needed to exist
After receiving data, MPI corresponding with the data received is executed as early as possible and receives primitive.So, increase described above
Data buffer area can greatly allow asynchronous, out-of-order transmitting-receiving to operate, without increasing additional message synchronization mechanism, thus
Meets the needs of this kind of computing platform of TensorFlow.Need it is synchronous, have the MPI library of order constrained can be docked to it is asynchronous,
The TensorFlow of scrambling characteristic facilitates the property of the parameter communication of hoisting machine learning platform in data flow diagram communication process
Energy.
It, the step of above-mentioned transmitting terminal process obtains data flow diagram parameter to be transmitted, can in TensorFlow platform
It is executed with being considered that Remote Rendezvous module executes before MPI sends primitive (such as MPI_Send or MPI_Isend)
's.For example, Remote Rendezvous module reads the type of memory field of data flow diagram parameter to be communicated, judge that it is
It is no to be located at host memory address space.If it is, terminating this step.If it is not, such as data flow diagram ginseng to be communicated
Numerical digit then executes the cudaMemcpy function of CUDA programming interface offer, by the data flow diagram parameter to be communicated in GPU memory
It is copied in host memory from GPU memory.In this way, whether the MPI library no matter selected supports access GPU, it can be in engineering
It practises and being used in platform, without accessing GPU memory by MPI interface function, so that the range of choice of MPI library is bigger, it is also significantly slow
The preempting resources problem that GPU is accessed in the scheme for the Baidu being mentioned above is solved.And since execute the step is
Remote Rendezvous module in TensorFlow platform, belongs to the core layer of TensorFlow platform, then, is not necessarily to
One thread will execute MPI and send lock of the whole process of primitive or MPI reception primitive all to GPU memory plus the process, and
It is that only data to be sent mentioned above need to be copied to this step in host memory from GPU memory to lock, reduces
The waiting time of other threads substantially reduces different processes in the lock competition of access GPU memory.
In the calculating process of the data flow diagram of striding equipment deployment, it is clear that reception and transmission including data flow diagram parameter,
It may be considered the function of MPI client mentioned hereinabove.The data flow diagram parameter transmitted before sending with reception (this
Reception refers to that data flow diagram parameter is received primitive processing by MPI in receiving end) before, it will all be stored in corresponding host memory
In buffer area, which namely aims at the address field in the memory of MPI library distribution, it is slow to can be the MPI that is mentioned above
Area is rushed, can also be data buffer area.
The reception and transmission of data flow diagram parameter are described below.Under a kind of implementation, whether have in detection data flow graph
Data flow diagram parameter needs to carry out striding equipment communication.If there is, it is determined that a pending traffic operation is that data send operation
Or data reception operation sends if it is data and operates, and sends data using MPI_Send or MPI_Isend primitive and (counts
According to Flowsheet parameter);If it is data reception operation, data are received using MPI_Recv or MPI_Irecv primitive;Later,
Remote Rendezvous module is used the data received as data flow diagram parameter.Once send or receive operation knot
Shu Hou detects whether to need to carry out striding equipment communication there are also parameter again, and so circulation executes.The cyclic process can be one
What multiple threads of physical machine operation executed, these threads are controlled by some scheduling mechanisms, to execute not according to scheduling mechanism
Same instruction is to complete different operations, such as sending data and receiving data is that thread can be by the operation of dynamic dispatching execution
Two kinds.Such as the primitive executed after a variety of events occur is defined in the scheduling mechanism, that is, occur what what triggers
The execution of primitive.Than as mentioned above, such as detect that this host is responsible in the data flow diagram calculated also data flow diagram parameter
It needs to carry out striding equipment communication, then MPI is executed according to action type and send primitive or MPI reception primitive.Machine learning platform
Often asynchronous, out-of-order, this mode is relatively common in machine learning platform.
It is also used to carry out the data in buffer area to be processed so that machine learning it should be noted that MPI receives primitive
The data of the buffer area are used in platform for the process or thread of calculating.For example, it may be to the metamessages of the data into
Row processing (such as confirming the state of the data, confirm in data to be received, which data is received), is also possible to this
The synchronization process of data, for example, the process or the thread data of informing in machine learning platform for calculating be ready for it is ready,
It can also be and store data into destination address.It is possible each to execute the program realization that MPI reception primitive includes for different MPI libraries
It is not identical.
Primitive is received since an above-mentioned thread can only once execute MPI, and the machine learning of distributed deployment
Platform is related to the interaction of more physical machines, thus a physical machine for deploying machine learning platform may receive in a short time it is more
MPI sends the data of primitive transmission, then above-mentioned thread, which possibly can not receive primitive by MPI in time and handle, be sent to object
Data in the host memory of reason machine.Under another implementation, the physical machine of deployment machine learning platform can distribute one specially
Thread, which is exclusively used in detection and sends the data that primitive is sent by MPI, and receives the data detected.It should
Dedicated thread belongs to a kind of dedicated thread, is not required to previously described scheduling mechanism control.For example, it may be described above
Distributed Runtime module cooperates Remote Rendezvous module to complete.It thus can be improved and pass through MPI
The real-time that primitive receives information is received, can also be reduced on receiver terminal, it is to be performed that remaining MPI sends primitive etc.
Time.
The method that the dedicated thread running is described below: it is in the memory of the loop detection host by the way of poll
The no triplet information for having the data sent, and the corresponding data of triple that processing detects, to accelerate to host
The processing of the data that primitive processing is received to MPI in memory.For convenience of description, this thread can be described as poll thread.Specifically
, the buffer area in poll thread poll host memory, i.e., the data mentioned above for being exclusively used in storage and being called by MPI primitive
Memory space, such as the buffer area poll MPI;Or include the buffer area MPI and in the case where data buffer area in host memory,
The buffer area poll MPI and data buffer area.The process is it is also assumed that be the function for the MPI scheduling being described above, realization is being made
In host for the receiving end of data flow diagram parameter.The process of a wheel in the polling procedure is described below.Poll thread dispatching MPI
The MPI_Probe or MPI_Iprobe in library etc. detect primitive, whether have the data flow diagram sent ginseng to detect in host memory
The several or corresponding triple of data flow diagram parameter is waiting corresponding MPI reception primitive processing, that is to say, that this sent
The corresponding MPI of data flow diagram parameter receives primitive and has not been performed, if it is not, continuing to execute in detection primitive poll host
Buffer area in depositing;If so, then calling MPI_Irecv primitive corresponding with the data flow diagram parameter that this is detected, thus
Data flow diagram parameter to be received can be received in local memory.Wherein, the detection thread is according to the purpose meter of the detection
Corresponding triple<the rank in operator node side, size, tag>, so that it is determined that going out the MPI_ for handling the data flow diagram parameter
The interface parameters of Irecv primitive.Then, corresponding with the data flow diagram parameter in implementation strategy script executed to other threads
MPI_Recv or MPI_Irecv primitive be changed to MPI_Wait primitive, with etc. thread to be polled by MPI_Irecv primitive it is complete
The processing (such as being placed into the corresponding memory space of destination address) of the pairs of data flow diagram parameter.Pass through MPI_ to poll thread
Irecv primitive completes the processing to the data flow diagram parameter, then epicycle end of polling(EOP), and poll thread continues in poll host memory
Buffer area, whether have the data flow diagram parameter sent that corresponding MPI is being waited to receive at primitive in host memory to detect
Reason.In fact, can be held in script to what other threads executed when the MPI_Irecv primitive of the poll thread starts to execute
Reception primitive in row strategy sends return information and is changed to MPI_Wait primitive to trigger the reception primitive, such as returns to one
MPI requests (MPI_Request) object, may include in the object in the triple detected, thus can be according to object tune
With the MPI_Wait primitive for being originally used for handling the corresponding data of the ternary not being performed.Thus, it is possible to quickly handle
The data of the buffer area of receiving end host memory are written to, can will thus have been connect in the buffer area of receiving end host memory
Data the space occupied of receipts frees out, other data from transmitting terminal are quickly written, that is to say, that transmitting terminal can be with
The waiting time before executing other transmission primitive is reduced, so that receiving end can be handled within the shorter time in more buffer areas
Data, and transmitting terminal can also execute within the shorter time and more send primitive.
In an implementation mode, it can constantly be detected in a manner of Infinite Cyclic.For example the thread runs to entire meter
The process for calculating data flow diagram is completed to calculate.
It is briefly described below detection thread is how to detect the data for receiving primitive processing in host memory to MPI.
In MPI library, there are some data structures to do corresponding record, if a data not yet receive primitive processing by MPI, it will
It is stored in a data structure, MPI detection primitive can detect the record so that it is determined that this records corresponding data not
Primitive processing is received by MPI.If a data receive primitive reception by MPI or are received by MPI primitive, the number
It will be detected from data structure above-mentioned from removing so that primitive can not be detected by MPI according to corresponding record.In a kind of realization
Under mode, above-mentioned MPI receives primitive (such as MPI_Recv) and MPI waits primitive (such as MPI_Wait) may be considered
What the Remote Rendezvous module in TensorFlow platform executed.
In general, these data flow diagram parameters by by MPI receive primitive be supplied to receiving end for calculating thread or
Process is received primitive and data flow diagram parameter is placed in host memory and belonged to using the calculating to carry out machine learning, such as MPI
In the memory space of a user, which is the user for carrying out the machine learning.On the other hand, the data flow diagram parameter is being determined
Destination address not in the case where host memory, by the copy of data flow diagram parameter storage to the corresponding equipment of destination address
In.This step is it is also assumed that be a part of MPI client functionality in Remote Rendezvous module.With destination address
For GPU memory, then the cudaMemcpy function of CUDA programming interface offer is provided, received data are copied into GPU
In memory.
Due to using the primitive of MPI library, the buffer area of host memory is first written in data before destination address is written, and in number
In calculating according to flow graph, the destination address of many data flow diagram parameters is the other equipment such as GPU, in this way, having received data uses meter
Calculate platform in CUDA programming interface be written GPU, without using MPI library support access GPU memory, greatly expand usable
The type of MPI library equally in the scheme for also alleviating the Baidu being mentioned above significantly, access the preempting resources problem of GPU.
And since execute the step is Remote Rendezvous module in TensorFlow platform, belong to TensorFlow
The core layer of platform, then, without to execute in a process, MPI sends primitive or MPI receives the whole process of primitive to GPU
Memory adds the lock of the process, but only need to copy to data to be sent mentioned above in host memory from GPU memory
This step lock, reduce other threads waiting time substantially reduce different processes access GPU memory lock it is competing
It strives.
The corresponding method of Fig. 5 to Fig. 7 can be run in Fig. 7 and system shown in Fig. 3 and server.
To sum up, the transmission method for the data that the application proposes is believed without negotiating opposite end with Correspondent Node before data transmission
Breath, improves the synchronization of MPI library, has order constrained to communicate asynchronous, out-of-order contradiction with data flow diagram, and improves MPI library and meter
It calculates platform and problem is seized to the resource of access GPU, enable MPI technology preferably suitable with the computing platform of distributed deployment
Match, network transmission resource can be made full use of, the efficiency of data transmission in machine learning platform is improved, to improve machine learning
The business processing speed of platform.
On the other hand, the embodiment of the present invention provides the data transmission dress in one kind distributed computing system as shown in Figure 8
It sets.The distributed computing system includes the first calculate node and the second calculate node, wherein the data transmission device is located at described
First calculate node, the data transmission device comprise determining that module 801, which is used to calculate section from described first
In the first graph data structure in point, the title, size and communication pair of the first data flow diagram parameter of the first data flow diagram are determined
End mark, wherein the parameter that the first data flow diagram parameter is carried by a connection side of first data flow diagram, described
Corresponding second calculate node of Correspondent Node mark.
Generation module 802, the generation module 802 are used for according to the first data flow diagram in first graph data structure
The title of parameter, size and Correspondent Node mark generate the first triple using first interface parameter generation algorithm, and described first
Triple includes message label, message size and purpose process sequence number, wherein the message label corresponds to first number
According to the title of Flowsheet parameter, the message size corresponds to the size of the first data flow diagram parameter, the purpose process sequence
Row number corresponds to the process that the first data flow diagram parameter is received in second calculate node;
Communication module 803, the communication module 803 are used to call message using first triple as interface parameters
Passing interface MPI sends primitive and sends the first data flow diagram parameter to second calculate node, in order to described second
Calculate node is called MPI to receive primitive and is handled institute using the second triple corresponding with first triple as interface parameters
The first data flow diagram parameter is stated, second triple is according to the second graph data structure in second calculate node, benefit
It is generated with second interface parameter generation algorithm, the second interface parameter generation algorithm and the first interface generating algorithm phase
Together.
Under a kind of implementation, using first triple as interface parameters, message passing interface MPI is called to send
Primitive to second calculate node send the first data flow diagram parameter in terms of, the communication module 803 be used for institute
The first triple is stated as interface parameters, host of the primitive from first calculate node is sent by message passing interface MPI
The first data flow diagram parameter is read in memory, to send the first data flow diagram parameter to second calculate node.
Under a kind of implementation, first calculate node also preserves the storage where the first data flow diagram parameter
The information of equipment, first calculate node further include read module 804, and the read module 804 is used to set in the storage
In the case that standby information is designated as other storage equipment, the first data flow diagram parameter is answered from other described storage equipment
The host memory of first calculate node is made, other described storage equipment are that host memory is removed in first calculate node
Outer memory.
Under a kind of implementation, the first interface parameter generation algorithm includes that the first algorithm, the second algorithm and third are calculated
Method is identified according to the title, size and Correspondent Node of the first data flow diagram parameter in first graph data structure, is utilized
First interface parameter generation algorithm generates the aspect of the first triple, which is used for according to first diagram data
The title of the first data flow diagram parameter in structure and first algorithm determine the message label in first triple,
According to the size of the first data flow diagram parameter in first graph data structure and second algorithm, the described 1st is determined
Message size in tuple, and according to the Correspondent Node of the first data flow diagram parameter in the first graph data structure mark and institute
Third algorithm is stated, determines the purpose process sequence number in first triple.
As it can be seen that data transmission device shown in Fig. 8 is as the transmitting terminal in data transmission in above-mentioned implementation
's.In some other implementation, data transmission device shown in Fig. 8 be can execute with transmitting terminal corresponding operation, with
As the receiving end in data transmission.That is, in some cases, data transmission device shown in Fig. 8 can have hair
The function of sending end and receiving end, in other words, data transmission device shown in Fig. 8 are transmitting terminals in the transmission of some data,
It is receiving end in the transmission of other data.
Realization side of the data transmission device in distributed computing system shown in Fig. 8 as data receiver is described below
Formula.The distributed computing system includes the first calculate node and the second calculate node, and the data transmission device is located at described the
Two calculate nodes, wherein the determining module 801 of the data transmission device, the determining module 801 are used to calculate section from described second
In second graph data structure of point, the title, size and communication pair of the first data flow diagram parameter in the second data flow diagram are determined
End identifies, and the Correspondent Node mark of the first data flow diagram parameter in second data flow diagram corresponds to first meter
Operator node.
Generation module 802, the generation module 802 are used to be joined according to the first data flow diagram in second graph data structure
Several title, size and Correspondent Nodes mark generate the second triple using second interface parameter generation algorithm, and the described 2nd 3
Tuple includes message label, message size and originating process sequence number, wherein the message label corresponds to first data flow
The title of graph parameter, the message size correspond to the size of the first data flow diagram parameter, the originating process sequence number pair
The process of the first data flow diagram parameter is sent in first calculate node described in Ying Yu.
Communication module 803, the communication module 803 are used to call message passing interface MPI to connect according to second triple
It receives primitive and handles the first data flow diagram parameter from first calculate node, the first data flow diagram parameter is institute
It states the first calculate node and sends what primitive was sent by MPI, the interface parameters that the MPI sends primitive includes and described second
Corresponding first triple of triple, first triple are first calculate nodes according in first calculate node
The first graph data structure, utilize first interface parameter generation algorithm generate, the second interface parameter generation algorithm and institute
It is identical to state first interface generating algorithm.
Under a kind of implementation, which includes first thread and the second thread, second calculate node
It include data buffer area in host memory, the data buffer area is exclusively used in storing the data handled by MPI primitive, with described
Second triple calls message passing interface MPI to receive primitive processing from first calculate node as interface parameters
The aspect of the first data flow diagram parameter, the first thread, which is used to detect primitive by message passing interface MPI, detects institute
The data buffer area in host memory is stated, to obtain the second triple;The first thread is used for according to the data buffer area
In the second triple call the first MPI to receive primitive, to handle the first data flow diagram parameter, in the data buffer area
The second triple to be second calculate node send primitive according to the MPI obtains;Second thread is used for true
After the fixed first data flow diagram parameter receives primitive processing by the first MPI, the 2nd MPI reception primitive is revised as MPI
Wait primitive, it is not executed by second thread with the first data flow diagram parameter pair that the 2nd MPI, which receives primitive,
The reception primitive answered, the interface parameters of the 2nd MPI reception primitive include the second ternary that second calculate node generates
Group, the MPI wait primitive to be finished for waiting the first MPI to receive primitive.
In this manner it is possible to accelerate to handle the data flow diagram ginseng having had been prepared for indicated by triple in data buffer area
Number, accelerates the speed for the data that receiving end processing receives, so that accelerating transmitting terminal executes the speed for sending primitive, in addition, number
MPI primitive is also enhanced to the adaptability of out-of-order asynchronous receiving-transmitting operation according to buffer area, can be better adapted in computing platform
The characteristics of data are transmitted.
It is former calling the first MPI to receive according to the second triple in the data buffer area under above-mentioned implementation
Language, to handle the aspect of the first data flow diagram parameter, the first thread in the communication module 803 is used for, described first
The destination address of data flow diagram parameter corresponds to that the memory that user uses is distributed in the host memory of second calculate node is empty
Between in the case where, with the second triple in the data buffer area be the first MPI receive primitive interface parameters, call
First MPI receives primitive, and the first data flow diagram parameter is stored from the data buffer area to first number
According to the destination address of Flowsheet parameter.
Under a kind of implementation, the data transmission device further includes memory module 805, and the memory module 805 is used for
In the case where the destination address of the first data flow diagram parameter corresponds to other storage equipment, by the in the host memory
For the storage of one data flow diagram parameter to the destination address, other described storage equipment are in second calculate node except in host
Memory except depositing.
Under a kind of implementation, which includes the first algorithm, the second algorithm and third algorithm,
It is identified according to the title, size and Correspondent Node of the first data flow diagram parameter in second graph data structure, utilizes the
Two interface parameters generating algorithms generate the aspect of the second triple, and the generation module 802 is used for according to second diagram data
The first algorithm in the title and the second interface parameter generation algorithm of the first data flow diagram parameter in structure, determine described in
Message label in second triple;According to the size of the first data flow diagram parameter in second graph data structure and described
The second algorithm in second interface parameter generation algorithm determines the message size in second triple;And according to second
The in the Correspondent Node mark of the first data flow diagram parameter in graph data structure and the second interface parameter generation algorithm
Three algorithms determine the originating process sequence number in second triple.
As it can be seen that data transmission device corresponding to Fig. 8 corresponds to the first calculate node and second above in some cases
Calculate node can execute the method for sender or recipient in the method as described in Fig. 5 to Fig. 7 above.Then, right to Fig. 8 institute
The beneficial effect for the step of various explanations involved in the data transmission device answered and each module execute, refers to above
Corresponding paragraph, details are not described herein again.In this way, can simplify MPI technical application is calculating the communication process in data flow diagram, it is not necessarily to
Negotiate to enable MPI technology preferably flat with the calculating of distributed deployment client information with Correspondent Node before data transmission
Platform adaptation, so that the efficiency that data are transmitted in distributed computing system is improved, to be promoted to distributed computing system data flow
The computational efficiency of figure.
It should also be noted that, determining module 801, generation module 802 and communication module 803 in Fig. 8, can be difference
Code or run code process or thread, division mode shown in Fig. 8 is only a kind of example, in some implementations
In, these modules may be using name otherwise or division, such as certain module is a module.Determining module 801 can
Corresponding to above 3014 module of memory management or TensorFlow platform in Commen Runtime5013 module;It is raw
It can correspond in 3015 module of telecommunication management or TensorFlow platform above at module 802 and communication module 803
Remote Rendezvous5014 module.And in the case where communication module 803 includes first thread and the second thread, it can be with
Think when first thread belongs to operation mentioned hereinabove in 3013 module of engine or TensorFlow platform
5012 module of Distributed Runtime, and the second thread belongs in communication management module or TensorFlow platform
Remote Rendezvous5014 module.Therefore the function of previously described module may be implemented in data transmission device shown in Fig. 8
Can, specific embodiment refers to description above.
Through the above description of the embodiments, it is apparent to those skilled in the art that, for description
It is convenienct and succinct, only the example of the division of the above functional modules, in practical application, can according to need and will be upper
It states function distribution to be completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, to complete
All or part of function described above.The device of foregoing description and the specific work process of unit can refer to aforementioned side
Corresponding process in method embodiment, details are not described herein.
On the other hand, as shown in figure 9, Fig. 9 shows a kind of schematic block diagram of physical machine provided in an embodiment of the present invention,
The embodiment of the present invention provides a kind of physical machine, which includes processor 40 and memory 42, which is that storage can
Execute the non-transient computer-readable media of code.The physical machine can be previously described first calculate node or the second meter
Operator node.Distributed Computing Platform is operated on processor 40 by the program in memory 42.
The physical machine can execute the executable code in memory 42 by processor 40, previously described each to execute
Kind method.Obviously, Fig. 9 is more simpler by one than Fig. 3 or the server shown in Fig. 7 that can run various methods in the application
Kind expression-form.Data transmission device shown in Fig. 8 may operate in as in Fig. 9 or Fig. 3 or framework shown in Fig. 7.
In an implementation mode, the non-transient computer-readable media of the storage executable code is memory 42, should
Physical machine further includes interface circuit 41 and system bus 43.
Processor 40 can be realized by multiple processors.Processor 40 can be central processing unit (English: central
Processing unit, abbreviation: CPU).Processor 40 can also for other general processors, graphics processor (English:
Graphics Processing Unit, abbreviation: GPU) digital signal processor (English: digital signal
Processing, abbreviation: DSP), specific integrated circuit (English: application specific integrated
Circuit, abbreviation: ASIC), field programmable gate array (English: field-programmable gate array, abbreviation:
FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components etc..General procedure
Device can be microprocessor or the processor is also possible to any conventional processor etc..
Such as physical machine and corresponding embodiment in Fig. 5 to Fig. 7, by taking the physical machine including CPU and GPU as an example
It is described.Wherein, if including GPU in processor, GPU and GPU memory is generally housed in same chip, i.e. processor 40
In may include certain processors memory.
Interface circuit 41 specifically can be the communication interface of hardware in physical machine.The communication interface can connect for wireless communication
Mouthful, can also including antenna etc. radio circuits.For example, wireless communication interface can be the wireless module etc. of physical machine.Processor
40 pass through the transmitting-receiving that data are carried out between interface circuit 41 and other equipment, such as other physical machines.Such as in Fig. 3 and Fig. 7
Network interface card 3024 or InfiniBand network interface card 5023 shown in physical machine belong to a kind of implementation of communication interface.
Memory 42 may include volatile memory (English: volatile memory), such as random access memory
(English: random-access memory, abbreviation: RAM), memory etc.;Memory 42 can also include nonvolatile memory
(English: non-volatile memory), such as read-only memory (English: read-only memory, abbreviation: ROM), fastly
Flash memory (English: flash memory), hard disk (English: hard disk drive, abbreviation: HDD) or solid state hard disk (English
Text: solid-state drive, abbreviation: SSD);Memory 42 can also include the combination of the memory of mentioned kind.Storage
Device 42 can have multiple so that above-mentioned multiple processors 40 use, a such as host memory described in previous embodiment, in GPU
It deposits.
Memory 42 may include bottom storage medium and memory.Wherein, memory is coupled to bottom storage medium, for making
For the caching of bottom storage medium.
System bus 43 may include data/address bus, power bus, control bus and signal condition bus etc..System bus
43 for connecting above-mentioned processor 40, memory 42 and interface circuit 41.For clear explanation in the present embodiment,
Various buses are all illustrated as system bus 43 in Fig. 9.
Physical machine provided in an embodiment of the present invention can be used for executing any method of the application record, such as Fig. 5 to figure
7 corresponding first calculate nodes or the second calculate node either execute method, then, to relating in physical machine corresponding to Fig. 9
And to the explanation of noun and the beneficial effect of each step involved in the description of method, method, refer to above
Corresponding paragraph, details are not described herein again.In this way, can simplify MPI technical application is calculating the communication process in data flow diagram, it is not necessarily to
Negotiate with Correspondent Node before data transmission to client information, enable MPI technology preferably with the computing platform of distributed deployment
Adaptation, so that the efficiency that data are transmitted in distributed computing system is improved, to be promoted to distributed computing system data flow diagram
Computational efficiency.
Optionally, the present embodiment also provides the non-transient computer-readable media for being stored with executable program, and journey can be performed
Sequence includes for executing any method described herein.The non-transient computer-readable media is mountable to physical machine, when
When physical machine is run, the processor of physical machine executes the computer executed instructions, so that physical machine executes described herein
A kind of method.Then, to various nouns in the description of the method saved in above-mentioned non-transient computer-readable media, related this method
Explanation and each step beneficial effect, refer to corresponding paragraph above, details are not described herein again.
Optionally, the non-transient computer-readable media in the present embodiment can be above-mentioned memory 42 as shown in Figure 9.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the module or
The division of unit, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units
Or component can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, institute
Display or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit
Indirect coupling or communication connection can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) or processor execute described in each embodiment of the present invention
The all or part of the steps of method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, ROM, RAM, magnetic or disk etc.
The various media that can store program code.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain
Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.
It is apparent to those skilled in the art that for convenience and simplicity of description, only with above-mentioned each function
The division progress of module can according to need and for example, in practical application by above-mentioned function distribution by different function moulds
Block is completed, i.e., the internal structure of device is divided into different functional modules, to complete all or part of function described above
Energy.The specific work process of the system, apparatus, and unit of foregoing description, can be with reference to corresponding in preceding method embodiment
Journey, details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the module or
The division of unit, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units
Or component can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, institute
Display or the mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, device or unit
Indirect coupling or communication connection.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member can be realized in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, all or part of the technical solution
It can be embodied in the form of software products, which is stored in a storage medium, including several fingers
It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) or processor execute
All or part of the steps of the method according to each embodiment of the present invention.The storage medium is non-transitory (English: non-
Transitory) medium, comprising: flash memory, mobile hard disk, read-only memory, random access memory, magnetic disk or light
The various media that can store program code such as disk.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain
Lid is within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (28)
1. a kind of distributed computing system, the distributed computing system includes the first calculate node and the second calculate node,
It is characterized in that, the first graph data structure in first calculate node preserves the first data flow diagram in the first data flow diagram
Title, size and the Correspondent Node mark of parameter, wherein the first data flow diagram parameter is the one of first data flow diagram
The parameter that connection side is carried;The second graph data structure in second calculate node preserves the institute in the second data flow diagram
State title, size and the Correspondent Node mark of the first data flow diagram parameter;The first data flow diagram in first data flow diagram
The Correspondent Node mark of parameter corresponds to second calculate node, first data flow diagram in second data flow diagram
The Correspondent Node mark of parameter corresponds to first calculate node;
First calculate node is used for, according to the title of the first data flow diagram parameter in first graph data structure, greatly
Small and Correspondent Node mark generates the first triple using first interface parameter generation algorithm, and first triple includes disappearing
Breath label, message size and purpose process sequence number, wherein the message label corresponds to the first data flow diagram parameter
Title, the message size correspond to the size of the first data flow diagram parameter, and the purpose process sequence number corresponds to institute
State the process that the first data flow diagram parameter is received in the second calculate node;
Second calculate node is used for, according to the title of the first data flow diagram parameter in second graph data structure, greatly
Small and Correspondent Node mark generates the second triple using second interface parameter generation algorithm, and the second interface parameter generates
Algorithm is identical as the first interface parameter generation algorithm, and second triple includes that message label, the message are big
Small and originating process sequence number, wherein the originating process sequence number, which corresponds in first calculate node, sends first number
According to the process of Flowsheet parameter;
First calculate node is used for, and using first triple as interface parameters, message passing interface MPI is called to send
Primitive sends the first data flow diagram parameter to second calculate node;
Second calculate node is used for, and according to second triple, is called MPI to receive primitive and is handled first data flow
Graph parameter.
2. system according to claim 1, which is characterized in that using first triple as interface parameters, call
Message passing interface MPI send primitive to second calculate node send the first data flow diagram parameter in terms of first number
According to Flowsheet parameter, first calculate node is used to pass through message passing interface using first triple as interface parameters
MPI sends primitive and reads the first data flow diagram parameter from the host memory of first calculate node, with to described the
Two calculate nodes send the first data flow diagram parameter.
3. system according to claim 2, which is characterized in that first calculate node also preserves first data
The information of storage equipment where Flowsheet parameter, the first calculate node described in the first data flow diagram parameter are also used in the storage
In the case that the information of equipment is designated as other storage equipment, the first data flow diagram parameter is stored into equipment from described other
The host memory of first calculate node is copied to, other described storage equipment are in first calculate node except in host
Deposit outer memory.
4. according to claim 1 to system described in 3 any claims, which is characterized in that the first interface parameter generates
Algorithm includes the first algorithm, the second algorithm and third algorithm, according to the first data flow diagram in first graph data structure
Title, size and the Correspondent Node mark of parameter, generate first the first data of triple using first interface parameter generation algorithm
The aspect of Flowsheet parameter, first calculate node are used for:
According to title the first data flow diagram parameter of the first data flow diagram parameter in first graph data structure and described
One algorithm determines the message label in first triple, according to the first data flow diagram in first graph data structure
The size of parameter the first data flow diagram parameter and second algorithm determine the message size in first triple, and
According to the Correspondent Node of the first data flow diagram parameter in the first graph data structure mark and the third algorithm, described the is determined
Purpose process sequence number in one triple;
Correspondingly, in title, size and Correspondent Node according to the first data flow diagram parameter in second graph data structure
Mark generates the aspect of second triple the first data flow diagram parameter, second meter using second interface parameter generation algorithm
Operator node is used for:
It is generated and is calculated according to the title of the first data flow diagram parameter in second graph data structure and the second interface parameter
The first algorithm in method determines the message label in second triple;According to first in second graph data structure
The second algorithm in the size of data flow diagram parameter and the second interface parameter generation algorithm determines in second triple
Message size;And according to the Correspondent Node of the first data flow diagram parameter in the second graph data structure mark and described second
Third algorithm in interface parameters generating algorithm determines the originating process sequence number in second triple.
5. according to claim 1 to system described in 4 any claims, which is characterized in that according to second triple,
MPI is called to receive aspect the first data flow diagram parameter that primitive handles the first data flow diagram parameter, described second calculates section
Point is for detecting the data buffer area in the second calculate node host memory by MPI detection primitive, to obtain described the
Second triple of one data flow diagram parameter, the data buffer area are exclusively used in storing the data handled by MPI primitive;It calls
MPI receives primitive, and to handle the first data flow diagram parameter, the interface parameters that the MPI receives primitive includes described second
Triple.
6. according to claim 1 to system described in 4 any claims, which is characterized in that the first data flow diagram parameter
The destination address that primitive carries the first data flow diagram parameter is received, primitive processing is being received by the first data flow diagram parameter
The aspect of the first data flow diagram parameter, second calculate node using second triple as the MPI for connecing
The interface parameters for receiving primitive calls the MPI to receive primitive, by the first data flow diagram parameter from the data buffer area
Store the destination address.
7. the data transmission method in a kind of distributed computing system, the distributed computing system include the first calculate node and
Second calculate node, which is characterized in that the described method includes:
From the first graph data structure in first calculate node, the first data flow diagram parameter of the first data flow diagram is determined
Title, size and Correspondent Node mark, wherein the first data flow diagram parameter be first data flow diagram a connection
The parameter that side is carried, corresponding second calculate node of Correspondent Node mark;
It is identified, is utilized according to the title, size and Correspondent Node of the first data flow diagram parameter in first graph data structure
First interface parameter generation algorithm generates the first triple, and first triple includes message label, message size and purpose
Process sequence number, wherein the message label corresponds to the title of the first data flow diagram parameter, and the message size is corresponding
In the size of the first data flow diagram parameter, the purpose process sequence number, which corresponds in second calculate node, receives institute
State the process of the first data flow diagram parameter;
Using first triple as interface parameters, calls message passing interface MPI to send primitive and calculate section to described second
Point sends the first data flow diagram parameter, in order to second calculate node with first triple corresponding second
Triple calls MPI to receive primitive and handles the first data flow diagram parameter, second triple is root as interface parameters
According to the second graph data structure in second calculate node, generated using second interface parameter generation algorithm, described second
Interface parameters generating algorithm is identical as the first interface generating algorithm.
8. the method according to the description of claim 7 is characterized in that being called using first triple as interface parameters
Message passing interface MPI send primitive to second calculate node send the first data flow diagram parameter in terms of, it is described
Method includes:
Using first triple as interface parameters, primitive is sent by message passing interface MPI and calculates section from described first
The first data flow diagram parameter is read in the host memory of point, to send first flow data to second calculate node
Flowsheet parameter.
9. according to the method described in claim 8, it is characterized in that, first calculate node also preserves first data
The information of storage equipment where Flowsheet parameter, the method also includes:
In the case where the information of the storage equipment is designated as other storage equipment, by the first data flow diagram parameter from institute
Other storage device replications are stated to the host memory of first calculate node, other described storage equipment are first calculating
Memory in node in addition to host memory.
10. according to any method of claim 7 to 9, which is characterized in that the first interface parameter generation algorithm includes
First algorithm, the second algorithm and third algorithm, in the name according to the first data flow diagram parameter in first graph data structure
Title, size and Correspondent Node mark generate the aspect of the first triple, first meter using first interface parameter generation algorithm
Operator node is used for:
According to the title of the first data flow diagram parameter in first graph data structure and first algorithm, described is determined
Message label in one triple, according to the size of the first data flow diagram parameter in first graph data structure and described the
Two algorithms determine the message size in first triple, and according to the first data flow diagram in the first graph data structure
The Correspondent Node mark of parameter and the third algorithm, determine the purpose process sequence number in first triple.
11. the data transmission device in a kind of distributed computing system, the distributed computing system includes the first calculate node
With the second calculate node, which is characterized in that the data transmission device is located at first calculate node, the data transmission dress
It sets and includes:
Determining module, the determining module are used for from the first graph data structure in first calculate node, determine first
Title, size and the Correspondent Node mark of first data flow diagram parameter of data flow diagram, wherein the first data flow diagram parameter
The parameter carried by a connection side of first data flow diagram, the Correspondent Node mark corresponding described second calculate section
Point;
Generation module, the generation module are used for the name according to the first data flow diagram parameter in first graph data structure
Title, size and Correspondent Node mark, utilize first interface parameter generation algorithm to generate the first triple, the first triple packet
Include message label, message size and purpose process sequence number, wherein the message label, which corresponds to first data flow diagram, joins
Several titles, the message size correspond to the size of the first data flow diagram parameter, the purpose process sequence correspondence
The process of the first data flow diagram parameter is received in second calculate node;
Communication module, the communication module are used to call message passing interface MPI using first triple as interface parameters
Send primitive and send the first data flow diagram parameter to second calculate node, in order to second calculate node with
Corresponding second triple of first triple calls MPI to receive primitive and handles first data flow as interface parameters
Graph parameter, second triple are to be joined according to the second graph data structure in second calculate node using second interface
Number generating algorithm generates, and the second interface parameter generation algorithm is identical as the first interface generating algorithm.
12. device according to claim 11, which is characterized in that using first triple as interface parameters, adjust
With message passing interface MPI send primitive to second calculate node send the first data flow diagram parameter in terms of, institute
Communication module is stated for using first triple as interface parameters, by message passing interface MPI transmission primitive from described
The first data flow diagram parameter is read in the host memory of first calculate node, described in sending to second calculate node
First data flow diagram parameter.
13. device according to claim 12, which is characterized in that first calculate node also preserves first number
According to the information of the storage equipment where Flowsheet parameter, first calculate node further includes read module, and the read module is used
In in the case where the information of the storage equipment is designated as other storage equipment, by the first data flow diagram parameter from described
For other storage device replications to the host memory of first calculate node, other described storage equipment are the first calculating section
Memory in point in addition to host memory.
14. 1 to 13 any device according to claim 1, which is characterized in that the first interface parameter generation algorithm packet
The first algorithm, the second algorithm and third algorithm are included, according to the first data flow diagram parameter in first graph data structure
Title, size and Correspondent Node mark, the aspect of the first triple, the generation are generated using first interface parameter generation algorithm
Module is used for title and first algorithm according to the first data flow diagram parameter in first graph data structure, determines institute
The message label in the first triple is stated, according to the size of the first data flow diagram parameter in first graph data structure and institute
The second algorithm is stated, determines the message size in first triple, and according to the first data in the first graph data structure
The Correspondent Node mark of Flowsheet parameter and the third algorithm, determine the purpose process sequence number in first triple.
15. a kind of physical machine, the physical machine includes: the non-transient computer of at least one processor and storage executable code
For readable medium to run the first calculate node in distributed computing system, the distributed computing system includes first meter
Operator node and the second calculate node;The executable code is matched when being executed by the processor at least one described processor
It is set to perform claim and requires method described in any one of 7-10.
16. a kind of non-transient computer-readable media for being stored with executable program, the executable program include for executing
The program of method described in any one of claim 7-10.
17. the data transmission method in a kind of distributed computing system, the distributed computing system includes the first calculate node
With the second calculate node, which is characterized in that the described method includes:
From the second graph data structure of second calculate node, the first data flow diagram parameter in the second data flow diagram is determined
Title, size and Correspondent Node mark, the Correspondent Node of the first data flow diagram parameter in second data flow diagram
Mark corresponds to first calculate node;
It is identified, is utilized according to the title, size and Correspondent Node of the first data flow diagram parameter in second graph data structure
Second interface parameter generation algorithm generates the second triple, second triple include message label, message size and source into
Program row number, wherein the message label corresponds to the title of the first data flow diagram parameter, and the message size corresponds to
The size of the first data flow diagram parameter, the originating process sequence number, which corresponds in first calculate node, sends described the
The process of one data flow diagram parameter;
According to second triple, message passing interface MPI is called to receive primitive processing from first calculate node
The first data flow diagram parameter, the first data flow diagram parameter are that first calculate node passes through MPI transmission primitive hair
It sends, the interface parameters that the MPI sends primitive includes the first triple corresponding with second triple, and described first
Triple is first calculate node according to the first graph data structure in first calculate node, is joined using first interface
Number generating algorithm generates, and the second interface parameter generation algorithm is identical as the first interface generating algorithm.
18. according to the method for claim 17, which is characterized in that second calculate node operation has first thread and the
Two threads, include data buffer area in the host memory of second calculate node, and the data buffer area is exclusively used in storage quilt
The data of MPI primitive processing call message passing interface MPI to receive primitive using second triple as interface parameters
Handle the aspect of the first data flow diagram parameter from first calculate node, which comprises
The first thread detects primitive by message passing interface MPI and detects the data buffer area in the host memory, with
Obtain the second triple;
The first thread calls the first MPI to receive primitive according to the second triple in the data buffer area, to handle
The first data flow diagram parameter is stated, the second triple in the data buffer area is second calculate node according to the MPI
Send what primitive obtained;
Second thread is after determining that the first data flow diagram parameter receives primitive processing by the first MPI, by second
MPI receives primitive is revised as MPI and waits primitive, and it is not by second thread execution and institute that the 2nd MPI, which receives primitive,
The corresponding reception primitive of the first data flow diagram parameter is stated, the interface parameters that the 2nd MPI receives primitive includes second meter
The second triple that operator node generates, the MPI wait primitive to be finished for waiting the first MPI to receive primitive.
19. claim according to claim 18, which is characterized in that in the first thread according to the data buffer storage
The second triple in area calls the first MPI to receive primitive, to handle the aspect of the first data flow diagram parameter, the method
Include:
Use is distributed in the host memory that the destination address of the first data flow diagram parameter corresponds to second calculate node
In the case where the memory headroom that family uses, the first thread is with the second triple in the data buffer area for described first
MPI receives the interface parameters of primitive, and the first MPI is called to receive primitive, by the first data flow diagram parameter from described
Data buffer area stores the destination address to the first data flow diagram parameter.
20. claim according to claim 17, which is characterized in that the method also includes:
In the case where the destination address of the first data flow diagram parameter corresponds to other storage equipment, second calculate node
By the first data flow diagram parameter storage in the host memory to the destination address, other described storage equipment are described the
Memory in two calculate nodes in addition to host memory.
21. 7 to 20 any claim according to claim 1, which is characterized in that the second interface parameter generation algorithm includes
First algorithm, the second algorithm and third algorithm, in the name according to the first data flow diagram parameter in second graph data structure
Title, size and Correspondent Node mark, utilize second interface parameter generation algorithm to generate the aspect of the second triple, the method packet
It includes:
It is raw according to the title of the first data flow diagram parameter in second graph data structure and the second interface parameter
At the first algorithm in algorithm, the message label in second triple is determined;According in second graph data structure
The second algorithm in the size of first data flow diagram parameter and the second interface parameter generation algorithm determines second ternary
Message size in group;And according to the Correspondent Node of the first data flow diagram parameter in the second graph data structure mark and it is described
Third algorithm in second interface parameter generation algorithm determines the originating process sequence number in second triple.
22. the data transmission device in a kind of distributed computing system, the distributed computing system includes the first calculate node
With the second calculate node, the data transmission device is located at second calculate node, which is characterized in that the data transmission dress
It sets and includes:
Determining module, the determining module are used for from the second graph data structure of second calculate node, determine the second number
It is identified according to the title, size and Correspondent Node of the first data flow diagram parameter in flow graph, it is described in second data flow diagram
The Correspondent Node mark of first data flow diagram parameter corresponds to first calculate node;
Generation module, the generation module are used for the name according to the first data flow diagram parameter in second graph data structure
Title, size and Correspondent Node mark, utilize second interface parameter generation algorithm to generate the second triple, the second triple packet
Include message label, message size and originating process sequence number, wherein the message label corresponds to the first data flow diagram parameter
Title, the message size correspond to the first data flow diagram parameter size, the originating process sequence number correspond to institute
State the process that the first data flow diagram parameter is sent in the first calculate node;
Communication module, the communication module are used to call message passing interface MPI to receive at primitive according to second triple
The first data flow diagram parameter from first calculate node is managed, the first data flow diagram parameter is first meter
Operator node sends what primitive was sent by MPI, and the interface parameters that the MPI sends primitive includes and second triple pair
The first triple answered, first triple are first calculate nodes according to the first figure in first calculate node
Data structure is generated using first interface parameter generation algorithm, and the second interface parameter generation algorithm connects with described first
Mouth generating algorithm is identical.
23. data transmission device according to claim 22, which is characterized in that the communication module include first thread and
Second thread includes data buffer area in the host memory of second calculate node, and the data buffer area is exclusively used in storing
The data handled by MPI primitive, it is former using second triple as interface parameters, calling message passing interface MPI to receive
Language handles the aspect of the first data flow diagram parameter from first calculate node, and the first thread is used for by disappearing
Breath passing interface MPI detection primitive detects the data buffer area in the host memory, to obtain the second triple;
The first thread is used to call the first MPI to receive primitive according to the second triple in the data buffer area, with place
The first data flow diagram parameter is managed, the second triple in the data buffer area is second calculate node according to
MPI sends what primitive obtained;
Second thread is used for after determining that the first data flow diagram parameter receives primitive processing by the first MPI, will
2nd MPI receives primitive and is revised as MPI waiting primitive, and it is not executed by second thread that the 2nd MPI, which receives primitive,
Reception primitive corresponding with the first data flow diagram parameter, the interface parameters that the 2nd MPI receives primitive includes described the
The second triple that two calculate nodes generate, the MPI wait primitive to execute for waiting the first MPI to receive primitive
Finish.
24. claim according to claim 23, which is characterized in that according to the two or three in the data buffer area
Tuple calls the first MPI to receive primitive, and to handle the aspect of the first data flow diagram parameter, the first thread is used in institute
The destination address for stating the first data flow diagram parameter, which corresponds in the host memory of second calculate node, distributes to what user used
It is the interface ginseng that the first MPI receives primitive with the second triple in the data buffer area in the case where memory headroom
Number calls the first MPI to receive primitive, and the first data flow diagram parameter is stored from the data buffer area to described
The destination address of first data flow diagram parameter.
25. claim according to claim 22, which is characterized in that the data transmission device further includes storage mould
Block, the memory module are used in the case where the destination address of the first data flow diagram parameter corresponds to other storage equipment,
By the first data flow diagram parameter storage in the host memory to the destination address, other described storage equipment are described the
Memory in two calculate nodes in addition to host memory.
26. any claim according to claim 22 to 25, which is characterized in that the second interface parameter, which generates, to be calculated
Method includes the first algorithm, the second algorithm and third algorithm, is joined according to the first data flow diagram in second graph data structure
Several title, size and Correspondent Node marks, the aspect of the second triple is generated using second interface parameter generation algorithm, described
Generation module is used for: being joined according to the title of the first data flow diagram parameter in second graph data structure and the second interface
The first algorithm in number generating algorithm determines the message label in second triple;According to second graph data structure
In the first data flow diagram parameter size and the second interface parameter generation algorithm in the second algorithm, determine described second
Message size in triple;And according to the Correspondent Node of the first data flow diagram parameter in the second graph data structure mark and
Third algorithm in the second interface parameter generation algorithm determines the originating process sequence number in second triple.
27. a kind of physical machine, the physical machine includes: the non-transient computer of at least one processor and storage executable code
For readable medium to run the second calculate node in distributed computing system, the distributed computing system includes the first calculating section
Point and second calculate node;The executable code is matched when being executed by the processor at least one described processor
It is set to perform claim and requires method described in any one of 17-21.
28. a kind of non-transient computer-readable media for being stored with executable program, the executable program include for executing
The program of method described in any one of claim 17-21.
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710769632.8A CN109426574B (en) | 2017-08-31 | 2017-08-31 | Distributed computing system, data transmission method and device in distributed computing system |
CN202210298513.XA CN114880133A (en) | 2017-08-31 | 2017-08-31 | Distributed computing system, data transmission method and device in distributed computing system |
PCT/CN2018/102919 WO2019042312A1 (en) | 2017-08-31 | 2018-08-29 | Distributed computing system, data transmission method and device in distributed computing system |
EP18850809.7A EP3667496B1 (en) | 2017-08-31 | 2018-08-29 | Distributed computing system, data transmission method and device in distributed computing system |
US16/805,007 US11010681B2 (en) | 2017-08-31 | 2020-02-28 | Distributed computing system, and data transmission method and apparatus in distributed computing system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710769632.8A CN109426574B (en) | 2017-08-31 | 2017-08-31 | Distributed computing system, data transmission method and device in distributed computing system |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210298513.XA Division CN114880133A (en) | 2017-08-31 | 2017-08-31 | Distributed computing system, data transmission method and device in distributed computing system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109426574A true CN109426574A (en) | 2019-03-05 |
CN109426574B CN109426574B (en) | 2022-04-05 |
Family
ID=65505134
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210298513.XA Pending CN114880133A (en) | 2017-08-31 | 2017-08-31 | Distributed computing system, data transmission method and device in distributed computing system |
CN201710769632.8A Active CN109426574B (en) | 2017-08-31 | 2017-08-31 | Distributed computing system, data transmission method and device in distributed computing system |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210298513.XA Pending CN114880133A (en) | 2017-08-31 | 2017-08-31 | Distributed computing system, data transmission method and device in distributed computing system |
Country Status (4)
Country | Link |
---|---|
US (1) | US11010681B2 (en) |
EP (1) | EP3667496B1 (en) |
CN (2) | CN114880133A (en) |
WO (1) | WO2019042312A1 (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111612528A (en) * | 2020-04-30 | 2020-09-01 | ***通信集团江苏有限公司 | Method, device and equipment for determining user classification model and storage medium |
CN112231113A (en) * | 2020-09-18 | 2021-01-15 | 苏州浪潮智能科技有限公司 | Message transmission method and device |
CN112448898A (en) * | 2019-08-28 | 2021-03-05 | 无锡江南计算技术研究所 | Message order-preserving method based on sequence number mechanism |
CN112506677A (en) * | 2020-12-09 | 2021-03-16 | 上海交通大学 | TensorFlow distributed matrix calculation implementation method and system |
CN112966279A (en) * | 2021-02-08 | 2021-06-15 | 北京金山云网络技术有限公司 | Distributed data processing method and system |
CN113127491A (en) * | 2021-04-28 | 2021-07-16 | 深圳市邦盛实时智能技术有限公司 | Flow graph dividing system based on correlation characteristics |
CN113190528A (en) * | 2021-04-21 | 2021-07-30 | 中国海洋大学 | Parallel distributed big data architecture construction method and system |
WO2021175226A1 (en) * | 2020-03-06 | 2021-09-10 | 华为技术有限公司 | Fault recovery method for ring network, and physical node |
CN113553039A (en) * | 2020-04-23 | 2021-10-26 | 杭州海康威视数字技术股份有限公司 | Method and device for generating executable code of operator |
CN113746873A (en) * | 2020-05-27 | 2021-12-03 | 华为技术有限公司 | Abnormal node processing method in ring network and related equipment |
CN114244755A (en) * | 2021-12-15 | 2022-03-25 | 北京恒安嘉新安全技术有限公司 | Asset detection method, device, equipment and storage medium |
CN115809092A (en) * | 2023-02-13 | 2023-03-17 | 湖南大学 | Deep learning calculation library implementation method based on MT3000 heterogeneous processor |
CN115934385A (en) * | 2023-02-08 | 2023-04-07 | 苏州浪潮智能科技有限公司 | Method, system, equipment and storage medium for communication among multiple cores |
WO2023184834A1 (en) * | 2022-03-31 | 2023-10-05 | 深圳清华大学研究院 | Collective communication optimization method for global high-degree vertices, and application |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11474974B2 (en) | 2018-12-21 | 2022-10-18 | Home Box Office, Inc. | Coordinator for preloading time-based content selection graphs |
US11474943B2 (en) | 2018-12-21 | 2022-10-18 | Home Box Office, Inc. | Preloaded content selection graph for rapid retrieval |
US11829294B2 (en) | 2018-12-21 | 2023-11-28 | Home Box Office, Inc. | Preloaded content selection graph generation |
US11475092B2 (en) | 2018-12-21 | 2022-10-18 | Home Box Office, Inc. | Preloaded content selection graph validation |
US11204924B2 (en) * | 2018-12-21 | 2021-12-21 | Home Box Office, Inc. | Collection of timepoints and mapping preloaded graphs |
US11269768B2 (en) | 2018-12-21 | 2022-03-08 | Home Box Office, Inc. | Garbage collection of preloaded time-based graph data |
EP3699770A1 (en) * | 2019-02-25 | 2020-08-26 | Mellanox Technologies TLV Ltd. | Collective communication system and methods |
US11204745B2 (en) * | 2019-05-23 | 2021-12-21 | Xilinx, Inc. | Dataflow graph programming environment for a heterogenous processing system |
US11836485B1 (en) * | 2019-08-19 | 2023-12-05 | Rapid7, Inc. | Software code review |
US11694075B2 (en) * | 2019-09-05 | 2023-07-04 | Alibaba Group Holding Limited | Partitioning control dependency edge in computation graph |
US11876885B2 (en) | 2020-07-02 | 2024-01-16 | Mellanox Technologies, Ltd. | Clock queue with arming and/or self-arming features |
US20200358721A1 (en) * | 2020-07-30 | 2020-11-12 | Intel Corporation | Buffer allocation for parallel processing of data |
US11461143B2 (en) * | 2020-09-08 | 2022-10-04 | Megh Computing, Inc. | Computing resource allocation with subgraph isomorphism |
US11556378B2 (en) | 2020-12-14 | 2023-01-17 | Mellanox Technologies, Ltd. | Offloading execution of a multi-task parameter-dependent operation to a network device |
CN112560184B (en) * | 2020-12-22 | 2023-09-12 | 北京机电工程研究所 | Parallel computing system and method for aircraft simulation model |
US11755543B2 (en) | 2020-12-29 | 2023-09-12 | International Business Machines Corporation | Optimization of workflows with dynamic file caching |
US20230005096A1 (en) * | 2021-06-23 | 2023-01-05 | Nvidia Corporation | Memory allocation using graphs |
CN115525793A (en) * | 2021-06-24 | 2022-12-27 | 平头哥(上海)半导体技术有限公司 | Computer-implemented method, system, and storage medium |
US11582326B1 (en) * | 2021-08-05 | 2023-02-14 | Paypal, Inc. | Scalable messaging framework for providing machine learning services across multiple availability zones |
CN113918351B (en) * | 2021-12-08 | 2022-03-11 | 之江实验室 | Method and device for adapting to distributed training in deep learning framework and AI acceleration card |
US20230244391A1 (en) * | 2022-01-31 | 2023-08-03 | Nvidia Corporation | Graph-based memory storage |
CN114840322B (en) * | 2022-05-17 | 2022-12-09 | 北京百度网讯科技有限公司 | Task scheduling method and device, electronic equipment and storage |
CN115118727B (en) * | 2022-08-26 | 2022-11-29 | 北京数牍科技有限公司 | Data transmission method, device, equipment and storage medium of distributed computing architecture |
US11922237B1 (en) | 2022-09-12 | 2024-03-05 | Mellanox Technologies, Ltd. | Single-step collective operations |
CN115600671B (en) * | 2022-10-20 | 2023-06-20 | 北京百度网讯科技有限公司 | Data processing method, device, equipment and storage medium of deep learning framework |
CN115729688B (en) * | 2022-11-23 | 2023-09-12 | 北京百度网讯科技有限公司 | Multithreading scheduling method and device for processor, electronic equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970580A (en) * | 2014-05-05 | 2014-08-06 | 华中科技大学 | Data flow compilation optimization method oriented to multi-core cluster |
CN104683488A (en) * | 2015-03-31 | 2015-06-03 | 百度在线网络技术(北京)有限公司 | Flow-type calculation system as well as dispatching method and dispatching device of flow-type calculation system |
CN105843706A (en) * | 2016-03-24 | 2016-08-10 | 华中科技大学 | Dynamic group system for layered rollback recovery protocols based on MPI (Message Passing Interface) high performance computing |
US20160378560A1 (en) * | 2014-02-28 | 2016-12-29 | Pivotal Software, Inc. | Executing a foreign program on a parallel computing system |
CN106293892A (en) * | 2015-06-26 | 2017-01-04 | 阿里巴巴集团控股有限公司 | Distributed stream calculates system, method and apparatus |
CN106506490A (en) * | 2016-11-03 | 2017-03-15 | 深圳智高点知识产权运营有限公司 | A kind of Distributed Calculation control method and distributed computing system |
CN107066241A (en) * | 2010-06-15 | 2017-08-18 | 起元技术有限责任公司 | System and method for calculating of the dynamic load based on figure |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9201688B2 (en) * | 2010-12-17 | 2015-12-01 | Microsoft Technology Licensing, Llc | Configuration of asynchronous message processing in dataflow networks |
CN102708404B (en) | 2012-02-23 | 2016-08-03 | 北京市计算中心 | A kind of parameter prediction method during MPI optimized operation under multinuclear based on machine learning |
CN106547522B (en) * | 2015-09-17 | 2020-02-14 | 华为技术有限公司 | Method and device for optimizing stream application |
CN105224502A (en) | 2015-09-28 | 2016-01-06 | 浪潮(北京)电子信息产业有限公司 | A kind of degree of depth learning method based on GPU and system |
CN105227669A (en) | 2015-10-15 | 2016-01-06 | 浪潮(北京)电子信息产业有限公司 | A kind of aggregated structure system of CPU and the GPU mixing towards degree of depth study |
US10409560B1 (en) * | 2015-11-18 | 2019-09-10 | Amazon Technologies, Inc. | Acceleration techniques for graph analysis programs |
-
2017
- 2017-08-31 CN CN202210298513.XA patent/CN114880133A/en active Pending
- 2017-08-31 CN CN201710769632.8A patent/CN109426574B/en active Active
-
2018
- 2018-08-29 EP EP18850809.7A patent/EP3667496B1/en active Active
- 2018-08-29 WO PCT/CN2018/102919 patent/WO2019042312A1/en unknown
-
2020
- 2020-02-28 US US16/805,007 patent/US11010681B2/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107066241A (en) * | 2010-06-15 | 2017-08-18 | 起元技术有限责任公司 | System and method for calculating of the dynamic load based on figure |
US20160378560A1 (en) * | 2014-02-28 | 2016-12-29 | Pivotal Software, Inc. | Executing a foreign program on a parallel computing system |
CN103970580A (en) * | 2014-05-05 | 2014-08-06 | 华中科技大学 | Data flow compilation optimization method oriented to multi-core cluster |
CN104683488A (en) * | 2015-03-31 | 2015-06-03 | 百度在线网络技术(北京)有限公司 | Flow-type calculation system as well as dispatching method and dispatching device of flow-type calculation system |
CN106293892A (en) * | 2015-06-26 | 2017-01-04 | 阿里巴巴集团控股有限公司 | Distributed stream calculates system, method and apparatus |
CN105843706A (en) * | 2016-03-24 | 2016-08-10 | 华中科技大学 | Dynamic group system for layered rollback recovery protocols based on MPI (Message Passing Interface) high performance computing |
CN106506490A (en) * | 2016-11-03 | 2017-03-15 | 深圳智高点知识产权运营有限公司 | A kind of Distributed Calculation control method and distributed computing system |
Non-Patent Citations (2)
Title |
---|
ABHINAV VISHNU ET AL: "Distributed TensorFlow with MPI", 《ARXIV:1603.02339V1[CS.CV]》 * |
靳鹏: "《并行技术基础》", 28 February 2011 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112448898A (en) * | 2019-08-28 | 2021-03-05 | 无锡江南计算技术研究所 | Message order-preserving method based on sequence number mechanism |
WO2021175226A1 (en) * | 2020-03-06 | 2021-09-10 | 华为技术有限公司 | Fault recovery method for ring network, and physical node |
CN113553039A (en) * | 2020-04-23 | 2021-10-26 | 杭州海康威视数字技术股份有限公司 | Method and device for generating executable code of operator |
CN111612528A (en) * | 2020-04-30 | 2020-09-01 | ***通信集团江苏有限公司 | Method, device and equipment for determining user classification model and storage medium |
CN113746873A (en) * | 2020-05-27 | 2021-12-03 | 华为技术有限公司 | Abnormal node processing method in ring network and related equipment |
CN112231113A (en) * | 2020-09-18 | 2021-01-15 | 苏州浪潮智能科技有限公司 | Message transmission method and device |
CN112231113B (en) * | 2020-09-18 | 2023-01-06 | 苏州浪潮智能科技有限公司 | Message transmission method and device |
CN112506677A (en) * | 2020-12-09 | 2021-03-16 | 上海交通大学 | TensorFlow distributed matrix calculation implementation method and system |
CN112966279A (en) * | 2021-02-08 | 2021-06-15 | 北京金山云网络技术有限公司 | Distributed data processing method and system |
CN112966279B (en) * | 2021-02-08 | 2023-11-03 | 北京金山云网络技术有限公司 | Distributed data processing method and system |
CN113190528A (en) * | 2021-04-21 | 2021-07-30 | 中国海洋大学 | Parallel distributed big data architecture construction method and system |
CN113190528B (en) * | 2021-04-21 | 2022-12-06 | 中国海洋大学 | Parallel distributed big data architecture construction method and system |
CN113127491B (en) * | 2021-04-28 | 2022-03-22 | 深圳市邦盛实时智能技术有限公司 | Flow graph dividing system based on correlation characteristics |
CN113127491A (en) * | 2021-04-28 | 2021-07-16 | 深圳市邦盛实时智能技术有限公司 | Flow graph dividing system based on correlation characteristics |
CN114244755A (en) * | 2021-12-15 | 2022-03-25 | 北京恒安嘉新安全技术有限公司 | Asset detection method, device, equipment and storage medium |
CN114244755B (en) * | 2021-12-15 | 2023-11-14 | 北京恒安嘉新安全技术有限公司 | Asset detection method, device, equipment and storage medium |
WO2023184834A1 (en) * | 2022-03-31 | 2023-10-05 | 深圳清华大学研究院 | Collective communication optimization method for global high-degree vertices, and application |
CN115934385A (en) * | 2023-02-08 | 2023-04-07 | 苏州浪潮智能科技有限公司 | Method, system, equipment and storage medium for communication among multiple cores |
CN115809092A (en) * | 2023-02-13 | 2023-03-17 | 湖南大学 | Deep learning calculation library implementation method based on MT3000 heterogeneous processor |
Also Published As
Publication number | Publication date |
---|---|
EP3667496B1 (en) | 2023-05-24 |
EP3667496A1 (en) | 2020-06-17 |
US20200202246A1 (en) | 2020-06-25 |
CN109426574B (en) | 2022-04-05 |
WO2019042312A1 (en) | 2019-03-07 |
CN114880133A (en) | 2022-08-09 |
US11010681B2 (en) | 2021-05-18 |
EP3667496A4 (en) | 2020-09-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109426574A (en) | Distributed computing system, data transmission method and device in distributed computing system | |
US8527739B2 (en) | Iterative process partner pairing scheme for global reduce operation | |
US11237880B1 (en) | Dataflow all-reduce for reconfigurable processor systems | |
US11392740B2 (en) | Dataflow function offload to reconfigurable processors | |
JP5479709B2 (en) | Server-processor hybrid system and method for processing data | |
CN113642734A (en) | Distributed training method and device for deep learning model and computing equipment | |
WO2022133047A1 (en) | Dataflow function offload to reconfigurable processors | |
TWI442248B (en) | Processor-server hybrid system for processing data | |
CN110324204A (en) | A kind of high speed regular expression matching engine realized in FPGA and method | |
CN113312283A (en) | Heterogeneous image learning system based on FPGA acceleration | |
Chu et al. | Dynamic kernel fusion for bulk non-contiguous data transfer on GPU clusters | |
CN115277553B (en) | Stream table storage method, device, equipment and computer readable storage medium | |
CN108140014A (en) | Create and use the system and method for the data structure for multiple programming | |
CN110417860A (en) | File transfer management method, apparatus, equipment and storage medium | |
TWI784845B (en) | Dataflow function offload to reconfigurable processors | |
Brasilino et al. | Data Distillation at the Network's Edge: Exposing Programmable Logic with InLocus | |
CN115687233A (en) | Communication method, device, equipment and computer readable storage medium | |
US11467836B2 (en) | Executing cross-core copy instructions in an accelerator to temporarily store an operand that cannot be accommodated by on-chip memory of a primary core into a secondary core | |
CN117614906B (en) | Method, computer device and medium for multi-thread multi-representation oral package | |
Pickartz et al. | Swift: A transparent and flexible communication layer for pcie-coupled accelerators and (co-) processors | |
CN115392471A (en) | Data synchronization method of quantum circuit in cluster environment and quantum simulation system | |
Kalms et al. | ArcvaVX: OpenVX Framework for A daptive R econfigurable C omputer V ision A rchitectures | |
CN111860788A (en) | Neural network computing system and method based on data flow architecture | |
WO2023147094A1 (en) | Code compilation for dynamic peer-to-peer networked code execution | |
Pelosi | CAST: a Declarative Language and its Execution Platform for Large-Scale Cloud Simulations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |