WO2024001850A1 - 数据处理***、方法、装置和控制器 - Google Patents

数据处理***、方法、装置和控制器 Download PDF

Info

Publication number
WO2024001850A1
WO2024001850A1 PCT/CN2023/101171 CN2023101171W WO2024001850A1 WO 2024001850 A1 WO2024001850 A1 WO 2024001850A1 CN 2023101171 W CN2023101171 W CN 2023101171W WO 2024001850 A1 WO2024001850 A1 WO 2024001850A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
data processing
data
nodes
identifier
Prior art date
Application number
PCT/CN2023/101171
Other languages
English (en)
French (fr)
Inventor
周轶刚
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024001850A1 publication Critical patent/WO2024001850A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake

Definitions

  • the present application relates to the field of data processing, and in particular, to a data processing system, method, device and controller.
  • Super Node Currently, multiple nodes are interconnected into a high-performance cluster through high-bandwidth, low-latency inter-chip interconnect buses and switches, commonly known as Super Node.
  • super nodes can provide higher computing power; compared with Ethernet interconnection bandwidth, super nodes Can provide greater bandwidth.
  • super nodes are usually configured in a static mode and cannot be flexibly expanded to adapt to the computing needs of different application scenarios.
  • This application provides a data processing system, method, device and controller, thereby enabling flexible expansion of the scale of super nodes and adapting to computing needs in different application scenarios.
  • a data processing system in a first aspect, includes a plurality of nodes and a controller.
  • the plurality of nodes includes a first node and a second node.
  • the plurality of nodes and the controller are connected through high-speed interconnection links.
  • a controller configured to allocate a second node identifier to the second node when the second node requests access to the data processing system, where the global physical address of the second node in the data processing system is the node identifier of the second node and the third node identifier.
  • the physical address of the second node; the controller is also used to send the global physical address of the second node to the first node.
  • this application controls the access to new nodes through the controller, and can flexibly expand the scale of the data processing system on demand, for example, supporting access to accelerators and increasing the storage of global memory. space, etc., to meet the computing needs in different application scenarios.
  • the data processing system further includes an interconnection device, which connects multiple nodes based on a high-speed interconnection link; the controller is also configured to send a second node identifier and port to the interconnection device The corresponding relationship is that the port corresponding to the second node identifier is used to forward the message to the second node.
  • the interconnection device can also be called an interconnection chip or a switching chip.
  • the interconnection device is configured to forward a message from the first node to the second node based on the corresponding relationship between the second node identifier and the port. For example, the interconnection device stores a correspondence between the first node identifier and the first port, and the interconnection device forwards a message for the second node to access the first node based on the correspondence between the first node identifier and the first port.
  • the interconnection device stores a correspondence between the second node identifier and the second port, and the interconnection device forwards a message from the first node to the second node based on the correspondence between the second node identifier and the second port.
  • the interconnected devices forward the data communicated between nodes based on the node identification, so that the accelerator resources in the system can be shared by multiple nodes to adapt to the computing needs of different application scenarios.
  • the global physical address refers to an address that can be accessed by any node among multiple nodes included in the data processing system. Any node among the multiple nodes included in the data processing system stores the global physical address of other nodes, so that any node can access the storage space of other nodes based on the global physical addresses of other nodes.
  • the global physical address is used to uniquely indicate the storage space of a node in a data processing system. Understandably, the global physical address includes the node identification and the physical address of the node. Since the node identification is used to uniquely indicate a node in the data processing system, the physical address in the global physical address is used to uniquely indicate the storage space of a node in the data processing system.
  • the physical address of a node refers to the address of the storage space within the node. Although the physical addresses of storage spaces in different nodes may be the same, any node in the data processing system distinguishes the storage spaces in different nodes according to the node identifier in the global physical address.
  • the global physical address of the first node includes a first node identifier and a physical address of the first node. Since the first node identifier is used to uniquely indicate the first node, and the physical address of the first node is used to uniquely indicate the storage space of the first node, the global physical address of the first node can be used to indicate the storage space of the first node.
  • the global physical address of the second node includes a second node identifier and a physical address of the second node. Since the second node identifier is used to uniquely indicate the second node, and the physical address of the second node is used to uniquely indicate the storage space of the second node, the global physical address of the second node can be used to indicate the storage space of the second node.
  • the first node can access the storage space of the second node according to the global physical address of the second node.
  • the second node can access the storage space of the first node according to the global physical address of the first node.
  • the second node is used to obtain the data to be processed according to the source address indicated by the first node, and the source address is used to indicate the node ID and physical address of the node that stores the data to be processed. ;
  • the second node is also used to process the data to be processed, and store the processed data according to the destination address indicated by the first node.
  • the destination address is used to indicate the node ID and physical address of the node where the processed data is stored.
  • the data processing request of the node is expanded to different nodes in the system, and the data to be processed is obtained or the processed data is stored according to the physical address of the node uniquely indicated by the node identifier in the global physical address, so that the accelerator resources in the system can be Shared by multiple nodes, it adapts to the computing needs of different application scenarios.
  • the source address is used to indicate the global physical address of the first node.
  • the destination address is used to indicate the global physical address of the second node.
  • the second node is specifically configured to perform an acceleration operation on the data to be processed according to the operation identifier indicated by the first node to obtain processed data.
  • the second node includes any one of a processor, an accelerator and a memory controller.
  • the accelerator will process jobs with higher computing requirements (such as HPC, big data jobs, database jobs, etc.) to solve the problem of insufficient floating point computing power of general-purpose processors. It cannot meet the heavy floating-point computing requirements of HPC, AI and other scenarios, thereby shortening data processing time, reducing system energy consumption, and improving system performance.
  • Accelerators can also be integrated inside nodes. Independently deployed accelerators and nodes integrating accelerators support flexible plugging and unplugging, and can flexibly expand the scale of the data processing system on demand to meet the computing needs of different application scenarios.
  • the storage media of multiple nodes are uniformly addressed to form a global memory pool.
  • the global memory pool includes the storage space indicated by the source address or/and the storage space indicated by the destination address. Therefore, during data processing, the node reads data from the global memory pool or writes data to the global memory pool to increase the speed of data processing.
  • the first node is also used to access the storage space of the first node according to the physical address of the first node.
  • the second node is also used to access the storage space of the first node according to the physical address of the second node.
  • the controller is also used to control the first node to age the global physical address of the second node when the second node exits the data processing system, and to control the interconnection device to age the second node. Correspondence between node ID and port.
  • controllers and interconnected devices are set up in the data processing system.
  • nodes can be flexibly added and reduced, and an elastically scalable super-node architecture is achieved, which not only solves the problem of the inability of traditional super-node architecture to dynamically expand It avoids the problems of limited scale and low bandwidth of traditional IO bus architecture, and supports a dynamic fault-tolerance mechanism in the event of node or interconnected device failure.
  • a data processing method in a second aspect, includes multiple nodes.
  • the multiple nodes include a first node and a second node.
  • the multiple nodes and the controller are connected through a high-speed interconnection link.
  • the method includes: when the second node requests access to the data processing system, the controller allocates a second node identifier to the second node, wherein the global physical address of the second node in the data processing system is the node identifier of the second node and the second node identifier.
  • the physical address of the second node the controller sends the global physical address of the second node to the first node.
  • the data processing system further includes an interconnection device, and the interconnection device connects multiple nodes based on high-speed interconnection links.
  • the method also includes: the controller sends the corresponding relationship between the second node identifier and the port to the interconnected device, and the port corresponding to the second node identifier is used to forward the message to the second node.
  • the interconnection device forwards the message that the first node accesses the second node based on the corresponding relationship between the second node identifier and the port, and the second node identifier is used to uniquely indicate the second node.
  • the method also includes: when the second node exits the data processing system, The controller controls the global physical address of the first node to age the second node, and controls the corresponding relationship between the identifier of the second node and the port of the interconnected device to age.
  • the method further includes: the second node obtains the data to be processed according to the source address indicated by the first node, and then processes the data to be processed, and according to the purpose indicated by the first node
  • the address stores the processed data.
  • the destination address is used to indicate the node ID and physical address of the node where the processed data is stored.
  • the source address is used to indicate the node identification and physical address of the node where the data to be processed is stored.
  • a control device which device includes various modules for performing the method performed by the controller in the second aspect or any possible design of the second aspect.
  • a fourth aspect provides a data transmission device, which includes various modules for performing the method performed by the interconnected device in the second aspect or any possible design of the second aspect.
  • a data processing node in a fifth aspect, includes various modules for executing the method performed by the node in the second aspect or any possible design of the second aspect.
  • a sixth aspect provides a controller, which includes at least one processor and a memory, and the memory is used to store a set of computer instructions; when the processor serves as the control in the second aspect or any possible implementation of the second aspect When the processor executes the set of computer instructions, the operational steps of the data processing method in the second aspect or any possible implementation of the second aspect are executed.
  • a chip including: a processor and a power supply circuit; wherein the power supply circuit is used to supply power to the processor; and the processor is used to execute the second aspect or any possible implementation of the second aspect.
  • the operation steps of the data processing method in the method including: a processor and a power supply circuit; wherein the power supply circuit is used to supply power to the processor; and the processor is used to execute the second aspect or any possible implementation of the second aspect.
  • a computer-readable storage medium including: computer software instructions; when the computer software instructions are run in a computing device, the computing device is caused to execute the second aspect or any of the possible implementations of the second aspect. The steps of the method.
  • a computer program product is provided.
  • the computer program product When the computer program product is run on a computer, it causes the computing device to perform the operation steps of the method described in the second aspect or any possible implementation of the second aspect.
  • Figure 1 is a schematic architectural diagram of a data processing system provided by this application.
  • Figure 2 is a schematic diagram of a deployment scenario of a global memory pool provided by this application.
  • Figure 3 is a schematic flow chart of a node access data processing system method provided by this application.
  • Figure 4 is a schematic flow chart of a data processing method provided by this application.
  • Figure 5 is a schematic structural diagram of a descriptor provided by this application.
  • Figure 6 is a schematic flow chart of a method for node exiting the data processing system provided by this application.
  • FIG. 7 is a schematic structural diagram of a control device provided by this application.
  • Figure 8 is a schematic structural diagram of a data processing node provided by this application.
  • Figure 9 is a schematic structural diagram of a computing device provided by this application.
  • Super Node refers to interconnecting multiple nodes into a high-performance cluster through high-bandwidth, low-latency inter-chip interconnect buses and switches.
  • the scale of the supernode is larger than the node scale under the Cache-Coherent Non-Uniform Memory Access (CC-NUMA) architecture, and the interconnection bandwidth of the nodes within the supernode is larger than the Ethernet interconnection bandwidth.
  • CC-NUMA Cache-Coherent Non-Uniform Memory Access
  • High Performance Computing (HPC) cluster refers to a computer cluster system.
  • HPC clusters contain multiple computers connected together using various interconnect technologies.
  • the interconnection technology may be, for example, InfiniBand, Remote Direct Memory Access over Converged Ethernet (RoCE) or Transmission Control Protocol (TCP).
  • RoCE Remote Direct Memory Access over Converged Ethernet
  • TCP Transmission Control Protocol
  • HPC provides ultra-high floating-point computing capabilities and can be used to solve the computing needs of computing-intensive and massive data processing services.
  • the combined computing power of multiple computers connected together can handle large computing problems. For example, scientific research, weather forecasting, finance, simulation experiments, biopharmaceuticals, gene sequencing and image processing HPC clusters are used to solve large-scale computing problems and computing needs involved in industries such as management. Using HPC clusters to handle large-scale computing problems can effectively shorten the computing time for processing data and improve computing accuracy.
  • Memory operation instructions can be called memory semantics or memory operation functions.
  • Memory operation instructions include at least one of memory allocation (malloc), memory set (memset), memory copy (memcpy), memory move (memmove), memory release (memory release) and memory comparison (memcmp).
  • Memory allocation is used to allocate a section of memory to support application running.
  • Memory settings are used to set the data mode of the global memory pool, such as initialization.
  • Memory copy is used to copy the data stored in the storage space indicated by the source address (source) to the storage space indicated by the destination address (destination).
  • Memory movement is used to copy the data stored in the storage space indicated by the source address (source) to the storage space indicated by the destination address (destination), and delete the data stored in the storage space indicated by the source address (source).
  • Memory comparison is used to compare whether the data stored in two storage spaces are equal.
  • Memory release is used to release data stored in memory to improve the utilization of system memory resources and thereby improve system performance.
  • Broadcast communication refers to a transmission method in which the destination address indicates the device in the broadcast domain of the computer network when transmitting data packets in a computer network.
  • Data packets sent in broadcast communication may be called broadcast messages.
  • this application provides a data processing system including a controller connected through an interconnection device and multiple nodes.
  • the multiple nodes include a first node and the second node.
  • the controller assigns a second node identifier to the second node, and the interconnection device forwards the message from the first node to access the second node based on the second node identifier.
  • the second node identifier is used for a unique indication. Second node.
  • the second node obtains the data to be processed according to the source address indicated by the first node, and stores the processed data for processing the data to be processed according to the destination address indicated by the first node.
  • the scale of the data processing system can be flexibly expanded on demand, and the data processing requests of the nodes can be expanded to different nodes in the system, so that the accelerator resources in the system can be shared by multiple nodes to adapt to the computing needs of different application scenarios.
  • FIG. 1 is a schematic architectural diagram of a data processing system provided by this application.
  • data processing system 100 is an entity that provides high performance computing.
  • Data processing system 100 includes a plurality of nodes 110 .
  • Node 110 may be a processor, a server, a desktop computer, a controller and memory of a storage array, or the like.
  • the processor can be a central processing unit (CPU), a graphics processing unit (GPU), a data processing unit (DPU), a neural processing unit (NPU), and an embedded processor.
  • nodes 110 may include compute nodes and storage nodes.
  • the node 110 When the node 110 is an XPU for data processing such as GPU, DPU, NPU, etc. with high computing power, the node 110 can be used as an accelerator to offload the tasks of the general processor (such as CPU) to the accelerator, and the accelerator processes the calculations. Jobs with high demand (such as HPC, big data jobs, database jobs, etc.) solve the problem of insufficient floating point computing power of general-purpose processors to meet the heavy floating point computing needs of HPC, AI and other scenarios, thereby shortening the data Processing time and system energy consumption are reduced, and system performance is improved.
  • the computing power of a node can also be called the computing power of the node.
  • An accelerator can also be integrated inside node 110. Independently deployed accelerators and nodes integrating accelerators support flexible plugging and unplugging, and can flexibly expand the scale of the data processing system on demand to meet the computing needs of different application scenarios.
  • an interconnection device 120 (such as a switch) connects multiple nodes 110 based on high-speed interconnection links.
  • the interconnection device 120 connects multiple nodes 110 through fiber optics, copper cables, or copper wires.
  • Interconnect devices may be called switching chips or interconnect chips.
  • the data processing system 100 composed of multiple nodes 110 connected by the interconnection device 120 based on a high-speed interconnection link may also be called a super node.
  • Multiple supernodes are connected through a data center network.
  • the data center network includes multiple core switches and multiple aggregation switches. Data center networks can form a scale domain. Multiple supernodes can form a performance domain. Two or more super nodes can form a macro cabinet. Macro cabinets can also be connected based on the data center network.
  • data processing system 100 also includes controller 130 connected to interconnected devices 120 .
  • the controller 130 communicates with the interconnected devices 120 based on control plane links and communicates with the plurality of nodes 110 based on data plane links.
  • the controller 130 is used to control nodes to access or exit the data processing system 100 .
  • the controller 130 assigns a node identifier to the node requesting access, and configures the node identifier of the access node and the physical address of the access node to the active nodes that have been accessed in the data processing system 100 (such as multiple nodes 110). Configure the corresponding relationship between node IDs and ports on interconnected devices.
  • the node requesting access may be referred to as an access node for short.
  • the interconnection device forwards data to the access node based on the corresponding relationship between the node identifier and the port of the access node.
  • Each active node stores the global physical address of the access node, and the global physical address includes the node identification of the access node and the physical address of the access node.
  • the physical address in the global physical address is used to uniquely indicate the storage space of the access node, that is, the physical address of the access node refers to the storage space within the access node. address, enabling the active node to access the access node based on its global physical address.
  • the global physical address refers to an address that can be accessed by any active node among the multiple active nodes included in the data processing system 100 .
  • Any active node among the multiple active nodes included in the data processing system 100 stores the global physical address of other active nodes, so that any active node can access the storage space of other active nodes based on the global physical addresses of other active nodes.
  • the controller 130 ages the global physical address of the node that requested the exit.
  • the node requesting exit may be referred to as the exit node.
  • the controller 130 sends the first aging message to the interconnected device 120, and broadcasts the second aging message.
  • the first aging message is used to indicate the corresponding relationship between the node identifier and the port of the aging exit node of the interconnection device.
  • the second aging message is used to instruct the active node to age the global physical address of the exit node, that is, each active node in the data processing system 100 receives the second aging message, the node ID of the aging request exit node and the physical address of the node.
  • the controller 130 allocates a first node identifier for the first node to uniquely indicate the first node, and configures the first node identifier corresponding to the first node identifier for the interconnected device.
  • the global physical address of the first node includes the first node identifier and the physical address of the first node, so that the interconnection device is based on the first node identifier and the first node.
  • the corresponding relationship of one port forwards data to the first node.
  • Each node 110 stores the global physical address of the first node, so that each node 110 accesses the storage space of the first node according to the global physical address of the first node, that is, according to the first node.
  • a node identifier determines access to the first node, and accesses the storage space of the first node according to the physical address of the first node.
  • nodes perform memory accesses based on None-Posted Write instructions.
  • the None-Posted Write instruction is used to indicate the physical address of the memory of the first node, so that the first node accesses the memory of the first node.
  • the None-Posted Write instruction is used to indicate the node ID of the remote node and the physical address of the remote node, so that the first node accesses the memory of the remote node.
  • the data processing system 100 supports running big data, database, high-performance computing, artificial intelligence, distributed storage, cloud native and other applications.
  • the data in the embodiments of this application include business data for applications such as big data, databases, high-performance computing, artificial intelligence (Artificial Intelligence, AI), distributed storage, and cloud native.
  • the controller 130 may receive a processing request sent by a user operating the client and control the job indicated by the processing request.
  • the client can refer to a computer or a workstation.
  • Controllers and nodes can be separate physical devices.
  • a controller may also be called a control node, control device, or named node.
  • a node may be called a computing device or a data node or a storage node.
  • the storage media of the nodes 110 in the data processing system 100 are uniformly addressed to form a global memory pool, enabling memory semantic access across nodes within the supernode (referred to as: cross-node).
  • the global memory pool is a node-shared resource composed of the node's storage media through unified addressing.
  • the global memory pool provided by this application may include the storage medium of the computing node and the storage medium of the storage node in the super node.
  • the storage medium of the computing node includes at least one of a local storage medium within the computing node and an extended storage medium connected to the computing node.
  • the storage medium of the storage node includes at least one of a local storage medium within the storage node and an extended storage medium connected to the storage node.
  • the global memory pool includes local storage media within computing nodes and local storage media within storage nodes.
  • the global memory pool includes local storage media within the computing node, extended storage media connected to the computing node, and any one of local storage media within the storage node and extended storage media connected to the storage node.
  • the global memory pool includes local storage media within the computing node, extended storage media connected to the computing node, local storage media within the storage node, and extended storage media connected to the storage node.
  • the global memory pool 200 includes a storage medium 210 in each of the N computing nodes, an extended storage medium 220 connected to each of the N computing nodes, and a storage medium 230 in each of the M storage nodes.
  • An expansion storage medium 240 connected to each of the M storage nodes.
  • the storage capacity of the global memory pool may include part of the storage capacity in the storage medium of the computing node and part of the storage capacity in the storage medium of the storage node.
  • the global memory pool is a storage medium that can be accessed by both computing nodes and storage nodes in the super node through unified addressing.
  • the storage capacity of the global memory pool can be used by computing nodes or storage nodes through memory interfaces such as large memory, distributed data structures, data caches, and metadata. Compute nodes running applications can use these memory interfaces to perform memory operations on the global memory pool.
  • the global memory pool constructed based on the storage capacity of the storage medium of the computing node and the storage medium of the storage node provides a unified memory interface for the computing nodes to use in the north direction, allowing the computing nodes to use the unified memory interface to write data into the global memory pool.
  • the storage space provided by the computing node or the storage space provided by the storage node realizes the calculation and storage of data based on memory operation instructions, reduces the delay of data processing, and increases the speed of data processing.
  • the above description takes the storage medium in the computing node and the storage medium in the storage node to construct a global memory pool as an example.
  • the deployment method of the global memory pool can be flexible and changeable, and is not limited in the embodiments of this application.
  • the global memory pool is built from the storage media of the storage nodes.
  • the global memory pool is constructed from the storage media of computing nodes. Using the storage media of separate storage nodes or the storage media of computing nodes to build a global memory pool can reduce the occupation of storage resources on the storage side and provide a more flexible expansion solution.
  • the storage media of the global memory pool provided by the embodiment of this application include dynamic random access memory (Dynamic Random Access Memory, DRAM), solid state drive (Solid State Disk or Solid State Drive, SSD) and storage level Memory (storage-class-memory, SCM).
  • DRAM Dynamic Random Access Memory
  • SSD Solid State Disk or Solid State Drive
  • SCM storage level Memory
  • the global memory pool can be set according to the type of storage medium, that is, one type of storage medium is used to construct a memory pool, and different types of storage media construct different types of global memory pools, so that the global memory pool can be used in
  • the computing node selects storage media based on the access characteristics of the application, which enhances the user's control authority over the system, improves the user's system experience, and expands the applicable application scenarios of the system.
  • the DRAM in the computing node and the DRAM in the storage node are uniformly addressed to form a DRAM memory pool.
  • the DRAM memory pool is used in application scenarios that require high access performance, moderate data capacity, and no data persistence requirements.
  • the SCM in the computing node and the SCM in the storage node are uniformly addressed to form an SCM memory pool.
  • the SCM memory pool is used in application scenarios that are not sensitive to access performance, have large data capacity, and require data persistence.
  • FIG 3 is a schematic flow chart of a node access data processing system method provided by this application.
  • the scale of the data processing system 100 is expanded and the accelerator 140 requests access to the data processing system 100 as an example for explanation.
  • the method includes the following steps.
  • Step 310 The accelerator 140 sends a broadcast message, and the broadcast message is used to instruct the accelerator 140 to authenticate.
  • the accelerator 140 After the accelerator 140 establishes a physical connection with the interconnected device 120 in the data processing system 100 and is powered on, the accelerator 140 requests access to the data processing system 100 .
  • the accelerator 140 sends a broadcast message, that is, sends a broadcast message to multiple nodes 110 and interconnected devices 120 in the data processing system 100 to request access to the data processing system 100 .
  • the broadcast message includes the device identification (Device_ID) of the accelerator 140 and the physical address of the storage space of the accelerator 140 .
  • the accelerator 140 stores a device identifier, and the device identifier may be pre-configured when the accelerator 140 leaves the factory. Among them, the node 110 receives the broadcast message and discards it, the controller 130 receives the broadcast message, and authenticates the accelerator 140 according to the instructions of the broadcast message, that is, steps 320 to 340 are executed.
  • Step 320 The controller 130 assigns a node identifier to the accelerator 140.
  • the controller 130 stores a device identification table, which includes device identifications of authenticated active nodes in the data processing system 100 .
  • the controller 130 receives the broadcast message and queries the device identification table according to the device identification of the accelerator 140 . If the controller 130 determines that the device identification table includes the device identification of the accelerator 140, it means that the accelerator 140 has been connected to the data processing system 100 and the accelerator 140 is an active node. If the controller 130 determines that the device identification table does not include the same identification as the device identification of the accelerator 140, indicating that the accelerator 140 is the node requesting authentication, the controller 130 updates the device identification table, that is, writes the device identification of the accelerator 140 into the device identification table.
  • the controller 130 can assign a node identifier to the accelerator 140, and the node identifier of the accelerator 140 is used to uniquely indicate the accelerator 140.
  • the interconnection device 120 sends data to the accelerator 140 based on the node identifier of the accelerator 140, so that the node 110 communicates with the accelerator 140.
  • Step 330 The controller 130 sends the corresponding relationship between the node identifier and the port of the accelerator 140 to the interconnected device 120.
  • the controller 130 sends the corresponding relationship between the node identifier and the port of the accelerator 140 to the interconnection device 120 through physical media such as optical fiber, copper cable, or copper wire that connects the interconnection device 120 .
  • Step 340 The interconnection device 120 updates the forwarding table according to the corresponding relationship between the node identifier and the port.
  • the forwarding table is used to instruct the interconnection device 120 to forward communication traffic to the node indicated by the node identifier according to the corresponding relationship between the node identifier and the port.
  • the forwarding table includes the correspondence between node identifiers and ports.
  • the port corresponding to the node identification of the accelerator 140 may be used to instruct the interconnection device 120 to forward data to the accelerator 140 .
  • the interconnection device 120 After receiving the correspondence between the node identifier and the port of the accelerator 140, the interconnection device 120 updates the forwarding table, that is, writes the correspondence between the node identifier and the port of the accelerator 140 into the forwarding table.
  • the corresponding relationship between node identifiers and ports can be presented in the form of a table, as shown in Table 1.
  • the interconnection device 120 uses port 1 to connect to the accelerator 140.
  • the interconnection device 120 receives the node identifier 1, looks up the table 1 according to the node identifier 1, determines that the node identifier 1 corresponds to the port 1, and sends data to the accelerator 140 through the port 1.
  • Table 1 only illustrates the storage form of the corresponding relationship in the storage device in the form of a table, and does not limit the storage form of the corresponding relationship in the storage device.
  • the storage form of the corresponding relationship in the storage device The form can also be stored in other forms, which is not limited in this embodiment.
  • Step 350 The controller 130 sends the information of the accelerator 140 to multiple nodes 110.
  • the controller 130 sends the node identification of the accelerator 140 , the device identification of the accelerator 140 and the physical address of the storage space of the accelerator 140 to the plurality of nodes 110 .
  • Multiple nodes 110 refer to certified active nodes (Active Nodes) in the data processing system 100 .
  • the node identification of the accelerator 140 and the physical address of the accelerator 140 may be used as the global physical address of the accelerator 140 in the data processing system 100 .
  • the node identification of the accelerator 140 is used to uniquely indicate the accelerator 140 .
  • the physical address of the accelerator 140 refers to the address of the storage space in the accelerator 140 .
  • the global physical address of accelerator 140 is used to uniquely indicate the storage space of accelerator 140 in data processing system 100 .
  • Each active node stores the node identification of the accelerator 140 and the physical address of the storage space of the accelerator 140, so that the active node can perform a read operation or write operation on the storage space of the accelerator 140 according to the node identification of the accelerator 140 and the physical address of the accelerator 140, that is, Data is written to the storage space indicated by the physical address of the accelerator 140, or data is read from the storage space indicated by the physical address of the accelerator 140.
  • Each active node queries the device list according to the device identification of the accelerator 140, determines the software driver corresponding to the device identification of the accelerator 140, and runs the software driver corresponding to the device identification of the accelerator 140, so that multiple nodes 110 can communicate with the accelerator 140 and achieve access. Function of accelerator 140.
  • the scale of the data processing system can be flexibly expanded on demand, supporting access to accelerators, increasing the storage space of global memory, etc., so as to meet the needs of different application scenarios. Calculation requirements.
  • FIG. 4 is a schematic flow chart of a data processing method provided by this application. As shown in Figure 4, the method includes the following steps.
  • Step 410 The node 110 sends an access request, and the access request is used to instruct the accelerator 140 to perform an acceleration operation.
  • the access request may include the node identification of the accelerator 140 so that the interconnected device 120 forwards the access request based on the node identification of the accelerator 140 .
  • the access request may also include a source address, a destination address and an operation identifier, so that the accelerator 140 obtains the data to be processed according to the source address, processes the data to be processed, and stores the processed data according to the destination address.
  • the source address is used to indicate the storage location The node ID and physical address of the node that manages the data.
  • the destination address is used to indicate the node ID and physical address of the node where the processed data is stored.
  • the source address including the node identifier and the physical address and the destination address including the node identifier and the physical address described in the embodiments of this application may be a global physical address to expand the data processing request of the node to different locations within the system. Between nodes, the data to be processed is obtained or the processed data is stored according to the physical address of the node uniquely indicated by the node identifier, so that the accelerator resources in the system can be shared by multiple nodes to adapt to the computing needs of different application scenarios.
  • the embodiments of the present application do not limit the nodes that store the data to be processed and the nodes that store the processed data.
  • the source address contains the node ID and physical address of the node on which the acceleration operation is requested.
  • the destination address contains the node ID and physical address of the node performing the acceleration operation.
  • the node 110 requests the accelerator 140 to perform an acceleration operation.
  • the source address includes the node identifier of the node 110 and the physical address of the node 110 .
  • the destination address includes the node identifier of the accelerator 140 and the physical address of the accelerator 140 .
  • the physical address contained in the source address may indicate any node 110 or accelerator 140 .
  • the physical address contained in the destination address may indicate any node 110 or accelerator 140 .
  • the node 110 may reuse a descriptor (Descriptor) in a Domain Specific Architecture (DSA) architecture to indicate accelerated operations.
  • Descriptor a descriptor
  • DSA Domain Specific Architecture
  • Figure 5 it is a schematic structural diagram of a descriptor provided by this application. Among them, the descriptor includes the operation identifier, source address and destination address.
  • the source address is used to indicate the storage location of the data to be processed used by the acceleration operation.
  • the destination address is used to indicate the storage location of the result of the acceleration operation, that is, the storage location of the processed data.
  • the descriptor can be a 64-byte descriptor.
  • the length of the source address is 64 bits.
  • the length of the destination address is 64 bits.
  • the length of the node identifier can be 12 bits.
  • the length of the physical address can be 52 bits.
  • Memory configuration includes memory storage capacity and memory type. Memory types include DRAM, SSD, and SCM.
  • the descriptor may also include an operation custom field, which is used to indicate customizable operations according to different operation identifiers.
  • the node 110 generates a 64-byte descriptor through the accelerator driver and runs the None-Posted Write instruction to write the descriptor into the request queue of the accelerator 140 through the interconnection bus between the node 110 and the interconnection device 120.
  • the node 110 maps the register address of the request queue of the read and write node 110 to the accelerator 140 in a Memory Mapped IO (MMIO) manner, and the accelerator 140 accesses the register based on the interconnect bus through the None-Posted Write instruction.
  • MMIO Memory Mapped IO
  • the operation identifier can also be called an operation operator.
  • the operation ID is used to indicate the acceleration operation performed. Acceleration operations include any of the following.
  • SWAP Memory Interaction
  • Reduce refers to performing arithmetic or logical calculations on multiple local data.
  • Broadcast refers to broadcasting local data blocks to nodes in the system.
  • Search refers to the query operation of the database and returns matching results.
  • Encryption/Decryption refers to data block encryption and decryption operations.
  • Compress/Decompress refers to data block compression/decompression operations.
  • Step 420 The interconnection device 120 forwards the access request based on the port corresponding to the node identification of the accelerator 140.
  • the interconnection device 120 stores the corresponding relationship between node identifiers and ports. After receiving the access request, the interconnected device 120 obtains the node identifier of the accelerator 140, queries the forwarding table, and forwards the access request to the accelerator 140 based on the port corresponding to the node identifier of the accelerator 140 determined in the forwarding table.
  • Step 430 The accelerator 140 performs an acceleration operation according to the operation identifier.
  • the accelerator 140 reads the descriptor from the request queue, parses each field of the descriptor, and performs the acceleration operation indicated by the operation identifier.
  • the accelerator 140 allocates a section of storage space in the local memory of the accelerator 140 according to the descriptor requirements. This segment of storage space may be used to store intermediate data for the accelerator 140 to perform acceleration operations.
  • the accelerator 140 drives the local SDMA engine to write the data in the memory of the remote node (such as the node 110 that issued the access request) to the local area of the accelerator 140 Memory.
  • the descriptor indicates a request for local data of the accelerator 140
  • the data read engine is driven, and the network or local The data in memory is read into the local cache of accelerator 140.
  • the accelerator 140 reads the data to be processed according to the node identification and physical address indicated by the source address.
  • the physical address indicated by the source address includes any one of the physical address of the storage space in the node 110 that issued the access request, the physical address of the storage space in other nodes 110 in the system, and the physical address of the storage space in the accelerator 140 . That is, the location where the accelerator 140 reads the data to be processed includes any one of the storage space of the node 110 that issued the access request, the storage space of other nodes 110 in the system, and the storage space of the accelerator 140 .
  • the accelerator 140 stores the processed data of the acceleration operation according to the node identification and physical address indicated by the destination address, so that the node 110 that issues the access request reads the processed data of the acceleration operation according to the physical address indicated by the destination address.
  • the physical address indicated by the destination address includes any one of the physical address of the storage space in the node 110 that issued the access request, the physical address of the storage space in other nodes 110 in the system, and the physical address of the storage space in the accelerator 140 . That is, the accelerator 140 stores the processed data in any of the following locations, including the storage space of the node 110 that issued the access request, the storage space of other nodes 110 in the system, and the storage space of the accelerator 140 .
  • the physical address indicated by the source address and the physical address indicated by the destination address may indicate the storage space of the storage medium in the global memory pool. That is, the accelerator 140 can read the data to be processed from the global memory pool or/and store the processed data in the global memory pool to increase the speed of data processing.
  • the accelerator 140 performs an acceleration operation on the locally cached data to be processed according to the operation identifier described in the descriptor.
  • the operation identifier indicates a compression operation, and the accelerator 140 performs the compression operation on the locally cached data to be processed.
  • the operation identifier indicates an encryption operation, and the accelerator 140 performs an encryption operation on the locally cached data to be processed.
  • the accelerator 140 After the accelerator 140 completes the acceleration operation, it releases the local cache.
  • the accelerator 140 can also trigger an interrupt of the request descriptor node through a request completion interrupt (Request Completion Interrupt) message, and the node 110 that issued the access request retrieves the result of the acceleration operation.
  • Request Completion Interrupt Request Completion Interrupt
  • the jobs of general-purpose processors are offloaded to the accelerator, and the accelerator handles jobs with higher computing requirements (such as HPC, big data jobs, database jobs, etc.) to solve the problem of insufficient floating point computing power of general-purpose processors.
  • the accelerator unable to meet the heavy floating-point computing requirements of HPC, AI and other scenarios, shortening data processing time and reducing system energy consumption, and improving system performance.
  • Independently deployed accelerators and nodes integrating accelerators support flexible plugging and unplugging, and can flexibly expand the scale of the data processing system on demand to meet the computing needs of different application scenarios.
  • any active node in the data processing system can exit the system.
  • Figure 6 is a schematic flow chart of a method for node exiting the data processing system provided by this application.
  • the accelerator 140's request to exit the data processing system 100 is taken as an example for description. As shown in Figure 6, the method includes the following steps.
  • the accelerator 140 may actively exit the data processing system, that is, the accelerator 140 performs step 610.
  • Step 610 The accelerator 140 broadcasts an aging message, and the aging message is used to instruct the accelerator 140 to exit the data processing system. Perform steps 630 and 640.
  • the controller 130 receives the link failure information sent by the interconnected device 120, or the heartbeat message between the accelerator 140 and other nodes 110 times out, and step 620 is executed.
  • Step 620 The controller 130 broadcasts an aging message, and the aging message is used to instruct the accelerator 140 to exit the data processing system. Perform steps 630 and 640.
  • the aging message may include the node identification of the accelerator 140, the physical address of the accelerator 140, and the device identification of the accelerator 140.
  • Step 630 The interconnection device 120 determines the corresponding relationship between the node identifier and the port of the aging accelerator 140.
  • the interconnected device 120 deletes the forwarding table entry corresponding to the node identifier and the port of the accelerator 140 in the forwarding table.
  • Step 640 The node 110 ages the node ID of the accelerator 140 and the physical address of the accelerator 140.
  • the node 110 deletes the stored node identification and physical address of the accelerator 140 and the software driver corresponding to the device identification of the accelerator 140 in the device list.
  • controllers and interconnected devices are set up in the data processing system.
  • nodes can be flexibly added and reduced, and an elastically scalable super-node architecture is achieved, which not only solves the problem of the inability of traditional super-node architecture to dynamically expand It avoids the problems of limited scale and low bandwidth of traditional IO bus architecture, and supports a dynamic fault-tolerance mechanism in the event of node or interconnected device failure.
  • the controller includes corresponding hardware structures and/or software modules that perform each function.
  • the units and method steps of each example described in conjunction with the embodiments disclosed in this application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software driving the hardware depends on the specific application scenarios and design constraints of the technical solution.
  • Figure 7 is a schematic structural diagram of a possible control device provided in this embodiment. These control devices can be used to implement the functions of the controller in the above method embodiments, and therefore can also achieve the beneficial effects of the above method embodiments.
  • the control device may be the controller 130 shown in FIG. 3, FIG. 4, and FIG. 6, or may be a module (such as a chip) applied to the server.
  • the control device 700 includes a communication module 710 , a control module 720 and a storage module 730 .
  • the control device 700 is used to implement the functions of the controller 130 in the method embodiments shown in FIG. 3, FIG. 4, and FIG. 6.
  • the communication module 710 is configured to receive an access request sent by the second node, where the access request is used to indicate authentication of the second node.
  • the control module 720 is configured to allocate a second node identifier to the second node when the second node requests access to the data processing system, and the second node identifier is used to uniquely indicate the second node. For example, the control module 720 is used to execute step 320 in FIG. 3 .
  • the communication module 710 is also configured to send the corresponding relationship between the second node identifier and the second port to the interconnected device, and the second port corresponding to the second node identifier is used to forward the message to the second node.
  • the control module 720 is used to execute step 330 in FIG. 3 .
  • the communication module 710 is also used to send the global physical address of the second node to other nodes in the data processing system, for example, broadcast the node identifier and physical address of the second node to other nodes in the data processing system.
  • the control module 720 is used to execute step 350 in FIG. 3 .
  • the communication module 710 is also configured to send a first aging message to the interconnected device.
  • the first aging message is used to instruct the interconnected device to age the corresponding relationship between the second node identifier and the second port.
  • the control module 720 is used to execute step 620 in FIG. 6 .
  • the communication module 710 is also configured to broadcast a second aging message, where the second aging message is used to indicate the aging second node identifier and the physical address of the second node.
  • the storage module 730 is used to store the node identification of the node, so that the control module 720 controls the node to access the data processing system and exit the data processing system.
  • Figure 8 is a schematic structural diagram of a possible data processing node provided by this embodiment. These data processing nodes can be used to implement the functions of the nodes in the data processing system in the above method embodiments, and therefore can also achieve the beneficial effects of the above method embodiments.
  • the data processing node may be the node 110 or the accelerator 140 as shown in Figures 3, 4, and 6, or it may be a module (such as a chip) applied to the server.
  • the data processing node 800 includes a communication module 810 , a data processing module 820 and a storage module 830 .
  • the data processing node 800 is used to implement the functions of the node 110 or the accelerator 140 in the method embodiments shown in FIG. 3, FIG. 4, and FIG. 6.
  • the communication module 810 is configured to receive an access request sent by the first node.
  • the access request includes a source address, a destination address and an operation identifier.
  • the source address is used to indicate the node identifier and physical address of the node that stores the data to be processed.
  • the destination address is used to indicate the node identification and physical address of the node where the processed data is stored.
  • the data processing module 820 is configured to perform an acceleration operation on the data to be processed according to the operation identifier to obtain processed data, and store the processed data according to the destination address indicated by the first node. For example, the data processing module 820 is used to perform step 430 in FIG. 4 .
  • the storage module 830 is used to store memory operation instructions, data to be processed or processed data, so that the data processing module 820 can perform acceleration operations.
  • control device 700 and the data processing node 800 in the embodiment of the present application can be implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD It can be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL) or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • the respective modules thereof can also be software modules.
  • the control device 700, the data processing node 800, and their respective modules can also be software modules.
  • control device 700 and the data processing node 800 may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of the respective units in the control device 700 and the data processing node 800 are respectively for The corresponding processes for implementing each method in Figures 3, 4, and 6 will not be described again here for the sake of simplicity.
  • FIG. 9 is a schematic structural diagram of a computing device 900 provided in this embodiment.
  • computing device 900 includes a processor 910, a bus 920, a memory 930, a communication interface 940, and a memory unit 950 (which may also be referred to as a main memory unit).
  • the processor 910, the memory 930, the memory unit 950 and the communication interface 940 are connected through a bus 920.
  • the processor 910 can be a CPU, and the processor 910 can also be other general-purpose processors, digital signal processors (digital signal processing, DSP), ASICs, FPGAs or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSP digital signal processor
  • a general-purpose processor can be a microprocessor or any conventional processor, etc.
  • the processor can also be a graphics processing unit (GPU), a neural network processing unit (NPU), a microprocessor, an ASIC, or one or more integrations used to control the execution of the program of this application. circuit.
  • GPU graphics processing unit
  • NPU neural network processing unit
  • ASIC application specific integrated circuit
  • the communication interface 940 is used to implement communication between the computing device 900 and external devices or devices.
  • the communication interface 940 is used to obtain the broadcast message, the processor 910 assigns a node identifier to the node 110, broadcasts the node identifier and the physical location of the node. address.
  • the communication interface 940 is used to broadcast the aging message.
  • Bus 920 may include a path for communicating information between the components described above, such as processor 910, memory unit 950, and storage 930.
  • the bus 920 may also include a power bus, a control bus, a status signal bus, etc.
  • the various buses are labeled bus 920 in the figure.
  • the bus 920 may be a Peripheral Component Interconnect Express (PCIe) bus, an extended industry standard architecture (EISA) bus, a unified bus (unified bus, Ubus or UB), or a computer quick link ( compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc.
  • PCIe Peripheral Component Interconnect Express
  • EISA extended industry standard architecture
  • CXL compute express link
  • CIX cache coherent interconnect for accelerators
  • the bus 920 can be divided into an address bus, a data bus, a control bus, etc.
  • computing device 900 may include multiple processors.
  • the processor may be a multi-CPU processor.
  • a processor here may refer to one or more devices, circuits, and/or computing units for processing data (eg, computer program instructions).
  • the processor 910 is also used to obtain the data to be processed according to the source address indicated by the first node, and according to the operation identifier indicated by the first node Perform an acceleration operation on the data to be processed to obtain the processed data, and store the processed data according to the destination address indicated by the first node.
  • the source address is used to indicate the node ID and physical address of the node storing the data to be processed; the destination address of the second node Indicates the node ID and physical address of the node where the processed data is stored.
  • the processor 910 is also used to assign a node identifier to the node when the node requests access to the data processing system, and change the global physical address of the node to the node.
  • Send to the node in the data processing system send the corresponding relationship between the node identifier and the port to the interconnection device, the port corresponding to the node identifier is used to forward the message to the node, and exit the data processing system at the node
  • the corresponding relationship between the global physical address of the node and the node identifier and the port is aged.
  • the processor 910 is also used to forward data communicated between nodes based on the corresponding relationship between the node identifier and the port.
  • FIG. 9 only takes the computing device 900 including a processor 910 and a memory 930 as an example.
  • the processor 910 and the memory 930 are respectively used to indicate a type of device or device.
  • the quantity of each type of device or equipment can be determined based on business needs.
  • the memory unit 950 may correspond to the global memory pool used to store information such as data to be processed and processed data in the above method embodiment.
  • Memory unit 950 may be a pool of volatile or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate Synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the memory 930 may correspond to the storage medium used to store computer instructions, memory operation instructions, node identifiers and other information in the above method embodiments, for example, a magnetic disk, such as a mechanical hard disk or a solid state hard disk.
  • the above-mentioned computing device 900 may be a general-purpose device or a special-purpose device.
  • computing device 900 may be an edge device (eg, a box carrying a chip with processing power), or the like.
  • the computing device 900 may also be a server or other device with computing capabilities.
  • the computing device 900 may correspond to the control device 700 and the data processing node 800 in this embodiment, and may correspond to the corresponding subject executing any method according to FIG. 3, FIG. 4, and FIG. 6.
  • the above and other operations and/or functions of each module in the control device 700 and the data processing node 800 are respectively to implement the corresponding processes of each method in Figure 3, Figure 4, and Figure 6. For the sake of simplicity, they will not be described again here. .
  • the method steps in this embodiment can be implemented by hardware or by a processor executing software instructions.
  • Software instructions can be composed of corresponding software modules, which can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or other well-known in the art any other form of storage media.
  • An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage media may be located in an ASIC. Additionally, the ASIC can be located in a computing device. Of course, the processor and storage medium may also exist as discrete components in a computing device.
  • the computer program product includes one or more computer programs or instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user equipment, or other programmable device.
  • the computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.
  • the computer program or instructions may be transmitted from a website, computer, A server or data center transmits via wired or wireless means to another website site, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center that integrates one or more available media.
  • the available media may be magnetic media, such as floppy disks, hard disks, and magnetic tapes; they may also be optical media, such as digital video discs (DVDs); they may also be semiconductor media, such as solid state drives (solid state drives). ,SSD).
  • SSD solid state drives

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Multi Processors (AREA)
  • Hardware Redundancy (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

公开了数据处理***、方法、装置和控制器,涉及数据处理领域。数据处理***包括通过互联设备连接的控制器和多个节点,多个节点包括第一节点和第二节点。当第二节点请求接入数据处理***时,控制器为第二节点分配第二节点标识,互联设备基于第二节点标识转发第一节点访问第二节点的消息,第二节点标识用于唯一指示第二节点。第二节点根据第一节点指示的源地址获取待处理数据,根据第一节点指示的目的地址存储处理待处理数据的处理后数据。如此,按需弹性扩展数据处理***的规模,将节点的数据处理请求扩充到***内不同节点间,使***内的加速器资源可以为多个节点共享,适应不同的应用场景下的计算需求。

Description

数据处理***、方法、装置和控制器
本申请要求于2022年06月27日提交国家知识产权局、申请号为202210733448.9申请名称为“一种内存数据访问的方法和计算***”的中国专利申请的优先权,本申请要求于2022年10月14日提交国家知识产权局、申请号为202211260921.2申请名称为“数据处理***、方法、装置和控制器”的中国专利申请的优先权,这些全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理领域,尤其涉及一种数据处理***、方法、装置和控制器。
背景技术
目前,通过高带宽、低时延的片间互联总线和交换机将多个节点互联成一个高性能集群,俗称超节点(Super Node)。相对异构计算服务器架构(如:中央处理器(central processing unit,CPU)+领域特定结构(Domain Specific Architecture,DSA)),超节点能够提供更高的计算能力;相对以太网络互联带宽,超节点能够提供更大的带宽。但是,超节点通常以静态模式进行配置,无法灵活扩展,无法适应不同的应用场景下的计算需求。
发明内容
本申请提供了数据处理***、方法、装置和控制器,由此实现灵活扩展超节点的规模,适应不同的应用场景下的计算需求。
第一方面,提供了一种数据处理***。数据处理***包括多个节点和控制器,多个节点包括第一节点和第二节点,多个节点和控制器通过高速互联链路连接。控制器,用于当第二节点请求接入数据处理***时,为第二节点分配第二节点标识,其中,第二节点在数据处理***中的全局物理地址为第二节点的节点标识和第二节点的物理地址;控制器,还用于将第二节点的全局物理地址发送给第一节点。
相对于在数据处理***的启动阶段以静态模式进行配置,本申请通过控制器控制接入新节点,可以按需弹性扩展数据处理***的规模,例如,支持接入加速器、增大全局内存的存储空间等,从而满足不同的应用场景下的计算需求。
结合第一方面,在一种可能的实现方式中,数据处理***还包括互联设备,互联设备基于高速互联链路连接多个节点;控制器,还用于向互联设备发送第二节点标识与端口的对应关系,第二节点标识对应的端口用于向第二节点转发消息。
其中,互联设备也可以称为互联芯片或交换芯片。互联设备用于基于第二节点标识与端口的对应关系转发第一节点访问第二节点的消息。例如,互联设备存储有第一节点标识与第一端口的对应关系,互联设备基于第一节点标识与第一端口的对应关系转发第二节点访问第一节点的消息。互联设备存储有第二节点标识与第二端口的对应关系,互联设备基于第二节点标识与第二端口的对应关系转发第一节点访问第二节点的消息。
从而,使互联设备基于节点标识转发节点间进行通信的数据,使***内的加速器资源可以为多个节点共享,适应不同的应用场景下的计算需求。
需要说明的是,全局物理地址是指由数据处理***包含的多个节点中任意一个节点可以访问的地址。数据处理***包含的多个节点中任意一个节点存储有其他节点的全局物理地址,以便于任意一个节点根据其他节点的全局物理地址访问其他节点的存储空间。全局物理地址用于唯一指示数据处理***中的一个节点的存储空间。可理解地,全局物理地址包含节点标识和节点的物理地址。由于节点标识用于唯一指示数据处理***中的一个节点,则全局物理地址中的物理地址用于唯一指示数据处理***中的一个节点的存储空间。节点的物理地址是指节点内存储空间的地址。虽然不同的节点内存储空间的物理地址可以是相同的,但是数据处理***中的任一节点根据全局物理地址中的节点标识区分不同节点内的存储空间。
例如,第一节点的全局物理地址包括第一节点标识和第一节点的物理地址。由于第一节点标识用于唯一指示第一节点,第一节点的物理地址用于唯一指示第一节点的存储空间,则第一节点的全局物理地址可以用于指示第一节点的存储空间。
第二节点的全局物理地址包括第二节点标识和第二节点的物理地址。由于第二节点标识用于唯一指示第二节点,第二节点的物理地址用于唯一指示第二节点的存储空间,则第二节点的全局物理地址可以用于指示第二节点的存储空间。
由此,第一节点可以根据第二节点的全局物理地址访问第二节点的存储空间。第二节点可以根据第一节点的全局物理地址访问第一节点的存储空间。
结合第一方面,在另一种可能的实现方式中,第二节点用于根据第一节点指示的源地址获取待处理数据,源地址用于指示存储待处理数据的节点的节点标识和物理地址;第二节点还用于处理待处理数据,以及根据第一节点指示的目的地址存储处理后数据,目的地址用于指示存储处理后数据的节点的节点标识和物理地址。
如此,将节点的数据处理请求扩充到***内不同节点间,根据全局物理地址中节点标识唯一指示的节点的物理地址,获取待处理数据或存储处理后数据,从而使***内的加速器资源可以为多个节点共享,适应不同的应用场景下的计算需求。
示例地,源地址用于指示第一节点的全局物理地址。目的地址用于指示第二节点的全局物理地址。
结合第一方面,在另一种可能的实现方式中,第二节点具体用于根据第一节点指示的操作标识对待处理数据执行加速操作得到处理后数据。其中,第二节点包括处理器、加速器和内存控制器中任一个。例如将通用处理器(如:CPU)的作业卸载到加速器,由加速器处理计算需求较高的作业(如:HPC、大数据作业、数据库作业等),解决由于通用处理器浮点算力不足,无法满足HPC、AI等场景的重浮点计算需求的问题,从而,缩短数据处理时长以及降低***能耗,提升***性能。节点内部也可以集成加速器。独立部署的加速器和集成加速器的节点支持灵活插拔,可以按需弹性扩展数据处理***的规模,从而满足不同的应用场景下的计算需求。
结合第一方面,在另一种可能的实现方式中,多个节点的存储介质经过统一编址构成全局内存池。例如,全局内存池包括源地址指示的存储空间或/和目的地址指示的存储空间。从而,节点执行数据处理的过程中从全局内存池读取数据或对全局内存池写入数据,以提升数据处理的速度。
结合第一方面,在另一种可能的实现方式中,第一节点还用于根据第一节点的物理地址访问第一节点的存储空间。第二节点还用于根据第二节点的物理地址访问第一节点的存储空间。
结合第一方面,在另一种可能的实现方式中,控制器还用于当第二节点退出数据处理***时,控制第一节点老化第二节点的全局物理地址,以及控制互联设备老化第二节点标识与端口的对应关系。
如此,在数据处理***中设置控制器和互联设备,基于节点接入机制和退出机制,可弹性增加及减少节点,实现了可弹性扩展的超节点架构,既解决了传统超节点架构无法动态扩展的问题,又避免了传统IO总线架构规模受限和带宽低问题,并支持在节点或者互联设备故障情况下的动态容错机制。
第二方面,提供了一种数据处理方法,数据处理***包括多个节点,多个节点包括第一节点和第二节点,多个节点和控制器通过高速互联链路连接。方法包括:当第二节点请求接入数据处理***时,控制器为第二节点分配第二节点标识,其中,第二节点在数据处理***中的全局物理地址为第二节点的节点标识和第二节点的物理地址;控制器将第二节点的全局物理地址发送给第一节点。
结合第二方面,在一种可能的实现方式中,数据处理***还包括互联设备,互联设备基于高速互联链路连接多个节点。方法还包括:控制器向互联设备发送第二节点标识与端口的对应关系,第二节点标识对应的端口用于向第二节点转发消息。
其中,互联设备基于第二节点标识与端口的对应关系转发第一节点访问第二节点的消息,第二节点标识用于唯一指示第二节点。
结合第二方面,在另一种可能的实现方式中,方法还包括:当第二节点退出数据处理***时, 控制器控制第一节点老化第二节点的全局物理地址,以及控制互联设备老化第二节点标识与端口的对应关系。
结合第二方面,在另一种可能的实现方式中,方法还包括:第二节点根据第一节点指示的源地址获取待处理数据,进而,处理待处理数据,以及根据第一节点指示的目的地址存储处理后数据。其中,目的地址用于指示存储处理后数据的节点的节点标识和物理地址。源地址用于指示存储待处理数据的节点的节点标识和物理地址。
第三方面,提供了一种控制装置,所述装置包括用于执行第二方面或第二方面任一种可能设计中的控制器执行的方法的各个模块。
第四方面,提供了一种数据传输装置,所述装置包括用于执行第二方面或第二方面任一种可能设计中的互联设备执行的方法的各个模块。
第五方面,提供了一种数据处理节点,所述节点包括用于执行第二方面或第二方面任一种可能设计中的节点执行的方法的各个模块。
第六方面,提供一种控制器,该控制器包括至少一个处理器和存储器,存储器用于存储一组计算机指令;当处理器作为第二方面或第二方面任一种可能实现方式中的控制器执行所述一组计算机指令时,执行第二方面或第二方面任一种可能实现方式中的数据处理方法的操作步骤。
第七方面,提供一种芯片,包括:处理器和供电电路;其中,所述供电电路用于为所述处理器供电;所述处理器用于执行第二方面或第二方面任一种可能实现方式中的数据处理方法的操作步骤。
第八方面,提供一种计算机可读存储介质,包括:计算机软件指令;当计算机软件指令在计算设备中运行时,使得计算设备执行如第二方面或第二方面任意一种可能的实现方式中所述方法的操作步骤。
第九方面,提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算设备执行如第二方面或第二方面任意一种可能的实现方式中所述方法的操作步骤。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1为本申请提供的一种数据处理***的架构示意图;
图2为本申请提供的一种全局内存池的部署场景示意图;
图3为本申请提供的一种节点接入数据处理***方法的流程示意图;
图4为本申请提供的一种数据处理方法的流程示意图;
图5为本申请提供的一种描述符的结构示意图;
图6为本申请提供的一种节点退出数据处理***方法的流程示意图;
图7为本申请提供的一种控制装置的结构示意图;
图8为本申请提供的一种数据处理节点的结构示意图;
图9为本申请提供的一种计算设备的结构示意图。
具体实施方式
为了便于描述,首先对本申请涉及的术语进行简单介绍。
超节点(Super Node),指通过高带宽、低时延的片间互联总线和交换机将多个节点互联成一个高性能集群。超节点的规模大于缓存一致非统一内存寻址(Cache-Coherent Non Uniform Memory Access,CC-NUMA)架构下的节点规模,超节点内节点的互联带宽大于以太网络互联带宽。
高性能计算(High Performance Computing,HPC)集群,指一个计算机集群***。HPC集群包含利用各种互联技术连接在一起的多个计算机。互联技术例如可以是InfiniBand、基于聚合以太网的远程直接内存访问(Remote Direct Memory Access over Converged Ethernet,RoCE)或传输控制协议(Transmission Control Protocol,TCP)。HPC提供了超高浮点计算能力,可用于解决计算密集型和海量数据处理等业务的计算需求。连接在一起的多个计算机的综合计算能力可以来处理大型计算问题。例如,科学研究、气象预报、金融、仿真实验、生物制药、基因测序和图像处 理等行业涉及的利用HPC集群来解决的大型计算问题和计算需求。利用HPC集群处理大型计算问题可以有效地缩短处理数据的计算时间,以及提高计算精度。
内存操作指令,可以称为内存语义或内存操作函数。内存操作指令包括内存分配(malloc)、内存设置(memset)、内存复制(memcpy)、内存移动(memmove)、内存释放(memory release)和内存比较(memcmp)中至少一种。
内存分配用于支持应用程序运行分配一段内存。
内存设置用于设置全局内存池的数据模式,例如初始化。
内存复制用于将源地址(source)指示的存储空间存储的数据复制到目的地址(destination)指示的存储空间。
内存移动用于将源地址(source)指示的存储空间存储的数据复制到目的地址(destination)指示的存储空间,并删除源地址(source)指示的存储空间存储的数据。
内存比较用于比较两个存储空间存储的数据是否相等。
内存释放用于释放内存中存储的数据,以提高***内存资源的利用率,进而提升***性能。
广播(broadcast)通信,指在计算机网络中传输数据包时,目的地址指示计算机网络中广播域的设备的一种传输方式。以广播通信方式发送的数据包可以称为广播消息。
为了解决超节点的规模无法灵活扩展,无法适应不同的应用场景下的计算需求的问题,本申请提供一种数据处理***包括通过互联设备连接的控制器和多个节点,多个节点包括第一节点和第二节点。当第二节点请求接入数据处理***时,控制器为第二节点分配第二节点标识,互联设备基于第二节点标识转发第一节点访问第二节点的消息,第二节点标识用于唯一指示第二节点。第二节点根据第一节点指示的源地址获取待处理数据,根据第一节点指示的目的地址存储处理待处理数据的处理后数据。如此,按需弹性扩展数据处理***的规模,将节点的数据处理请求扩充到***内不同节点间,使***内的加速器资源可以为多个节点共享,适应不同的应用场景下的计算需求。
图1为本申请提供的一种数据处理***的架构示意图。如图1所示,数据处理***100是一种提供高性能计算的实体。数据处理***100包括多个节点110。
节点110可以是处理器、服务器、台式计算机、存储阵列的控制器和存储器等。处理器可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、数据处理单元(data processing unit,DPU)、神经处理单元(neural processing unit,NPU)和嵌入式神经网络处理器(neural-network processing unit,NPU)等用于数据处理的XPU。例如,节点110可以包括计算节点和存储节点。
当节点110是计算能力(Computing Power)较高的GPU、DPU、NPU等数据处理的XPU时,节点110可以作为加速器,将通用处理器(如:CPU)的作业卸载到加速器,由加速器处理计算需求较高的作业(如:HPC、大数据作业、数据库作业等),解决由于通用处理器浮点算力不足,无法满足HPC、AI等场景的重浮点计算需求的问题,从而,缩短数据处理时长以及降低***能耗,提升***性能。节点的计算能力也可以称为节点的计算算力。节点110内部也可以集成加速器。独立部署的加速器和集成加速器的节点支持灵活插拔,可以按需弹性扩展数据处理***的规模,从而满足不同的应用场景下的计算需求。
多个节点110基于具有高带宽、低时延的高速互联链路连接。示例地,如图1所示,互联设备120(如:交换机)基于高速互联链路连接多个节点110。例如,互联设备120通过光纤、铜缆或铜线连接多个节点110。互联设备可称为交换芯片或互联芯片。
互联设备120基于高速互联链路连接的多个节点110组成的数据处理***100也可以称为超节点。多个超节点通过数据中心网络进行连接。数据中心网络包括多个核心交换机和多个汇聚交换机。数据中心网络可以组成一个规模域。多个超节点可以组成一个性能域。两个以上超节点可以组成宏机柜。宏机柜之间也可以基于数据中心网络连接。
如图1所示,数据处理***100还包括连接互联设备120的控制器130。控制器130基于控制平面链路与互联设备120进行通信,基于数据平面链路与多个节点110进行通信。控制器130用于控制节点接入或退出数据处理***100。
例如,控制器130为请求接入的节点分配节点标识,向数据处理***100中已接入的活跃节点(如:多个节点110)配置接入节点的节点标识和接入节点的物理地址,向互联设备配置节点标识与端口的对应关系。请求接入的节点可以简称为接入节点。可理解地,互联设备基于接入节点的节点标识与端口的对应关系向接入节点转发数据。每个活跃节点存储接入节点的全局物理地址,全局物理地址包括接入节点的节点标识和接入节点的物理地址。由于接入节点的节点标识用于唯一指示接入节点,则全局物理地址中的物理地址用于唯一指示接入节点的存储空间,即接入节点的物理地址是指接入节点内存储空间的地址,使活跃节点基于接入节点的全局物理地址访问接入节点。
需要说明的是,全局物理地址是指由数据处理***100包含的多个活跃节点中任意一个活跃节点可以访问的地址。数据处理***100包含的多个活跃节点中任意一个活跃节点存储有其他活跃节点的全局物理地址,以便于任意一个活跃节点根据其他活跃节点的全局物理地址访问其他活跃节点的存储空间。
又如,当活跃节点退出数据处理***100时,控制器130老化请求退出的节点的全局物理地址。请求退出的节点可以简称为退出节点。控制器130向互联设备120发送第一老化消息,以及广播第二老化消息。第一老化消息用于指示互联设备老化退出节点的节点标识与端口的对应关系。第二老化消息用于指示活跃节点老化退出节点的全局物理地址,即数据处理***100中每个活跃节点接收到第二老化消息,老化请求退出节点的节点标识和节点的物理地址。
示例地,当第一节点请求接入数据处理***100时,控制器130为第一节点分配用于唯一指示第一节点的第一节点标识,并对互联设备配置第一节点标识对应的第一端口的对应关系,以及对多个节点110配置第一节点的全局物理地址,第一节点的全局物理地址包括第一节点标识和第一节点的物理地址,使互联设备基于第一节点标识与第一端口的对应关系向第一节点转发数据,每个节点110存储有第一节点的全局物理地址,使每个节点110根据第一节点的全局物理地址访问第一节点的存储空间,即根据第一节点标识确定访问第一节点,根据第一节点的物理地址访问第一节点的存储空间。当第一节点退出数据处理***100时,控制互联设备老化第一节点标识与第一端口的对应关系;控制多个节点110老化第一节点标识和第一节点的物理地址。
在一种可能的示例中,节点根据None-Posted Write指令进行内存访问。当第一节点访问第一节点的内存时,None-Posted Write指令用于指示第一节点的内存的物理地址,使第一节点访问第一节点的内存。当第一节点访问***中的远端节点的内存时,None-Posted Write指令用于指示远端节点的节点标识和远端节点的物理地址,使第一节点访问远端节点的内存。
其中,数据处理***100支持运行大数据、数据库、高性能计算、人工智能、分布式存储和云原生等应用。本申请实施例中数据包括大数据、数据库、高性能计算、人工智能(Artificial Intelligence,AI)、分布式存储和云原生等应用的业务数据。在一些实施例中,控制器130可以接收用户操作客户端发送的处理请求,对处理请求指示的作业进行控制。客户端可以是指计算机,也可称为工作站(workstation)。
控制器和节点可以是独立的物理设备。控制器也可称为控制节点、控制设备或命名节点。节点可以称为计算设备或数据节点或存储节点。
在一些实施例中,数据处理***100中节点110的存储介质经过统一编址构成全局内存池,实现跨超节点内节点(简称:跨节点)的内存语义访问。全局内存池为由节点的存储介质经过统一编址构成的节点共享的资源。
本申请提供的全局内存池可以包括超节点中计算节点的存储介质和存储节点的存储介质。计算节点的存储介质包括计算节点内的本地存储介质和计算节点连接的扩展存储介质中至少一种。存储节点的存储介质包括存储节点内的本地存储介质和存储节点连接的扩展存储介质中至少一种。
例如,全局内存池包括计算节点内的本地存储介质和存储节点内的本地存储介质。
又如,全局内存池包括计算节点内的本地存储介质、计算节点连接的扩展存储介质,以及存储节点内的本地存储介质和存储节点连接的扩展存储介质中任意一种。
又如,全局内存池包括计算节点内的本地存储介质、计算节点连接的扩展存储介质、存储节点内的本地存储介质和存储节点连接的扩展存储介质。
示例地,如图2所示,为本申请提供的一种全局内存池的部署场景示意图。全局内存池200包括N个计算节点中每个计算节点内的存储介质210、N个计算节点中每个计算节点连接的扩展存储介质220、M个存储节点中每个存储节点内的存储介质230和M个存储节点中每个存储节点连接的扩展存储介质240。
应理解,全局内存池的存储容量可以包括计算节点的存储介质中的部分存储容量和存储节点的存储介质中的部分存储容量。全局内存池是经过统一编址的超节点内计算节点和存储节点均可以访问的存储介质。全局内存池的存储容量可以通过大内存、分布式数据结构、数据缓存、元数据等内存接口供计算节点或存储节点使用。计算节点运行应用程序可以使用这些内存接口对全局内存池进行内存操作。如此,基于计算节点的存储介质的存储容量和存储节点的存储介质构建的全局内存池北向提供了统一的内存接口供计算节点使用,使计算节点使用统一的内存接口将数据写入全局内存池的计算节点提供的存储空间或存储节点提供的存储空间,实现基于内存操作指令的数据的计算和存储,以及降低数据处理的时延,提升数据处理的速度。
上述是以计算节点内的存储介质和存储节点内的存储介质构建全局内存池为例进行说明。全局内存池的部署方式可以灵活多变,本申请实施例不予限定。例如,全局内存池由存储节点的存储介质构建。又如,全局内存池由计算节点的存储介质构建。使用单独的存储节点的存储介质或计算节点的存储介质构建全局内存池可以减少存储侧的存储资源的占用,以及提供更灵活的扩展方案。
依据存储介质的类型划分,本申请实施例提供的全局内存池的存储介质包括动态随机存取存储器(Dynamic Random Access Memory,DRAM)、固态驱动器(Solid State Disk或Solid State Drive,SSD)和存储级内存(storage-class-memory,SCM)。
在一些实施例中,可以根据存储介质的类型设置全局内存池,即利用一种类型的存储介质构建一种内存池,不同类型的存储介质构建不同类型的全局内存池,使全局内存池应用于不同的场景,计算节点根据应用的访问特征选择存储介质,增强了用户对***控制权限,提升了用户的***体验又扩展了***适用的应用场景。例如,将计算节点中的DRAM和存储节点中的DRAM进行统一编址构成DRAM内存池。DRAM内存池用于对访问性能要求高,数据容量适中,无数据持久化诉求的应用场景。又如,将计算节点中的SCM和存储节点中的SCM进行统一编址构成SCM内存池。SCM内存池则用于对访问性能不敏感,数据容量大,对数据持久化有诉求的应用场景。
接下来,结合图3至图6对本申请提供的数据处理方法的实施方式进行详细描述。
图3为本申请提供的一种节点接入数据处理***方法的流程示意图。在这里以扩展数据处理***100的规模,加速器140请求接入数据处理***100为例进行说明。如图3所示,该方法包括以下步骤。
步骤310、加速器140发送广播消息,广播消息用于指示对加速器140进行认证。
加速器140与数据处理***100中的互联设备120建立物理连接并上电后,加速器140请求接入数据处理***100。例如加速器140发送广播消息,即向数据处理***100中的多个节点110和互联设备120发送广播消息,请求接入数据处理***100。广播消息包括加速器140的设备标识(Device_ID)和加速器140的存储空间的物理地址。加速器140存储有设备标识,设备标识可以是加速器140出厂时预先配置的。其中,节点110接收到广播消息丢弃,控制器130接收到广播消息,根据广播消息的指示对加速器140进行认证,即执行步骤320至步骤340。
步骤320、控制器130为加速器140分配节点标识。
控制器130存储有设备标识表,设备标识表包括数据处理***100中已认证的活跃节点的设备标识。控制器130接收到广播消息,根据加速器140的设备标识查询设备标识表。如果控制器130确定设备标识表包括加速器140的设备标识,表示加速器140已接入数据处理***100,加速器140为活跃节点。如果控制器130确定设备标识表未包括与加速器140的设备标识相同的标识,表示加速器140为请求认证的节点,则控制器130更新设备标识表,即将加速器140的设备标识写入设备标识表。进而,控制器130可以为加速器140分配节点标识,加速器140的节点标识用于唯一指示加速器140,互联设备120基于加速器140的节点标识向加速器140发送数据,使节点110与加速器140进行通信。
步骤330、控制器130向互联设备120发送加速器140的节点标识与端口的对应关系。
控制器130通过连接互联设备120的光纤、铜缆或铜线等物理介质,向互联设备120发送加速器140的节点标识与端口的对应关系。
步骤340、互联设备120根据节点标识与端口的对应关系更新转发表。
转发表用于指示互联设备120根据节点标识与端口的对应关系将通信流量转发到节点标识指示的节点。转发表包括节点标识与端口的对应关系。
例如,加速器140的节点标识对应的端口可以用于指示互联设备120向加速器140转发数据。互联设备120接收到加速器140的节点标识与端口的对应关系后,更新转发表,即将加速器140的节点标识与端口的对应关系写入转发表。
在一种示例中,节点标识与端口的对应关系可以以表格的形式呈现,如表1所示。
表1
如表1所示,假设加速器140的节点标识为节点标识1,互联设备120采用端口1连接加速器140。互联设备120接收到节点标识1,根据节点标识1查询表1,确定节点标识1对应端口1,通过端口1向加速器140发送数据。
需要说明的是,表1只是以表格的形式示意对应关系在存储设备中的存储形式,并不是对该对应关系在存储设备中的存储形式的限定,当然,该对应关系在存储设备中的存储形式还可以以其他的形式存储,本实施例对此不做限定。
步骤350、控制器130向多个节点110发送加速器140的信息。
控制器130向多个节点110发送加速器140的节点标识、加速器140的设备标识和加速器140的存储空间的物理地址。多个节点110是指数据处理***100中的已认证的活跃节点(Active Nodes)。
加速器140的节点标识和加速器140的物理地址可以作为加速器140在数据处理***100中的全局物理地址。加速器140的节点标识用于唯一指示加速器140。加速器140的物理地址是指加速器140内存储空间的地址。加速器140的全局物理地址用于唯一指示数据处理***100中加速器140的存储空间。每个活跃节点存储加速器140的节点标识和加速器140的存储空间的物理地址,以便于活跃节点根据加速器140的节点标识和加速器140的物理地址对加速器140的存储空间进行读操作或写操作,即向加速器140的物理地址指示的存储空间写入数据,或者,从加速器140的物理地址指示的存储空间读取数据。
每个活跃节点根据加速器140的设备标识查询设备列表,确定加速器140的设备标识对应的软件驱动,运行加速器140的设备标识对应的软件驱动,以便于多个节点110与加速器140进行通信,实现访问加速器140的功能。
如此,相对于在数据处理***的启动阶段以静态模式进行配置,可以按需弹性扩展数据处理***的规模,支持接入加速器、增大全局内存的存储空间等,从而满足不同的应用场景下的计算需求。
在加速器140接入数据处理***100后,数据处理***100中活跃节点可以基于加速器140的节点标识和加速器140的物理地址与加速器140进行通信。假设第一节点为节点110,第二节点为加速器140,节点110请求加速器140执行加速操作。图4为本申请提供的一种数据处理方法的流程示意图。如图4所示,该方法包括以下步骤。
步骤410、节点110发送访问请求,访问请求用于指示请求加速器140执行加速操作。
访问请求可以包括加速器140的节点标识,以便于互联设备120基于加速器140的节点标识转发访问请求。
访问请求还可以包括源地址、目的地址和操作标识,以便于加速器140根据源地址获取待处理数据,处理待处理数据,以及根据目的地址存储处理后数据。其中,源地址用于指示存储待处 理数据的节点的节点标识和物理地址。目的地址用于指示存储处理后数据的节点的节点标识和物理地址。
需要说明的是,本申请实施例所述的包括节点标识和物理地址的源地址和包括节点标识和物理地址的目的地址可以是一种全局物理地址,将节点的数据处理请求扩充到***内不同节点间,根据节点标识唯一指示的节点的物理地址,获取待处理数据或存储处理后数据,从而使***内的加速器资源可以为多个节点共享,适应不同的应用场景下的计算需求。
另外,本申请实施例对存储待处理数据的节点和存储处理后数据的节点不予限定。例如,源地址包含请求执行加速操作的节点的节点标识和物理地址。目的地址包含执行加速操作的节点的节点标识和物理地址。节点110请求加速器140执行加速操作,源地址包括节点110的节点标识和节点110的物理地址,目的地址包括加速器140的节点标识和加速器140的物理地址。又如,源地址包含的物理地址可以指示任意一个节点110或加速器140。目的地址包含的物理地址可以指示任意一个节点110或加速器140。
在一些实施例中,节点110可以复用领域特定结构(Domain Specific Architecture,DSA)架构中的描述符(Descriptor)指示加速操作。示例地,如图5所示,为本申请提供的一种描述符的结构示意图。其中,描述符包括操作标识、源地址和目的地址。
源地址用于指示加速操作所使用的待处理数据的存储位置。目的地址用于指示加速操作的结果的存储位置,即处理后数据的存储位置。
例如,描述符可以是一个64字节的描述符。源地址的长度为64比特。目的地址的长度为64比特。节点标识的长度可以是12比特。物理地址的长度可以是52比特。根据节点中不同的内存配置,自适应配置描述符中节点标识的长度和物理地址的长度。内存配置包括内存的存储容量大小和内存类型。内存类型包括DRAM、SSD和SCM。
描述符还可以包括操作自定义域,操作自定义域用于指示按照不同的操作标识可自定义的操作。
节点110通过加速器驱动生成64字节描述符,并运行None-Posted Write指令,通过节点110与互联设备120的互联总线将描述符写入加速器140的请求队列。节点110以内存映射输入输出(Memory Mapped IO,MMIO)方式将读写节点110的请求队列的寄存器地址映射给加速器140,而加速器140通过None-Posted Write指令基于互联总线访问该寄存器。
另外,操作标识也可以称为操作算子。操作标识用于指示所执行的加速操作。加速操作包括以下任一项。
内存交互(SWAP):指示节点的内存之间的数据块交换。
归约(Reduce):指针对多份本地数据执行算术或逻辑计算。
广播(Broadcast):指将本地数据块广播给***中的节点。
查询(Search):指数据库的查询操作,返回匹配结果。
加密/解密(Encryption/Decryption):指数据块加解密操作。
压缩/解压缩(Compress/Decompress):指数据块压缩/解压缩操作。
步骤420、互联设备120基于加速器140的节点标识对应的端口转发访问请求。
互联设备120存储有节点标识与端口的对应关系。互联设备120接收到访问请求后,获取到加速器140的节点标识,查询转发表,根据转发表所确定的加速器140的节点标识对应的端口向加速器140转发访问请求。
步骤430、加速器140根据操作标识执行加速操作。
加速器140从请求队列中读取描述符,并解析描述符的各字段,执行操作标识指示的加速操作。
例如,加速器140根据描述符需求在加速器140的本地内存中分配一段存储空间。该段存储空间可以用于存储加速器140执行加速操作的中间数据。
又如,如果操作标识用于指示对远端节点的数据处理,加速器140驱动本地的SDMA引擎,将远端节点(如:发出访问请求的节点110)的内存的数据写入到加速器140的本地内存。
又如,如果描述符指示对加速器140的本地数据的请求,驱动数据读引擎,将网络或者本地 内存中的数据读取到加速器140的本地缓存中。
可理解地,加速器140根据源地址指示的节点标识和物理地址读取待处理数据。源地址指示的物理地址包括发出访问请求的节点110中存储空间的物理地址、***中其他的节点110中存储空间的物理地址和加速器140中存储空间的物理地址中任意一个。即,加速器140读取待处理数据的位置包括发出访问请求的节点110的存储空间、***中其他的节点110的存储空间和加速器140的存储空间中任意一个。
加速器140根据目的地址指示的节点标识和物理地址存储加速操作的处理后数据,以便于发出访问请求的节点110根据目的地址指示的物理地址读取加速操作的处理后数据。目的地址指示的物理地址包括发出访问请求的节点110中存储空间的物理地址、***中其他的节点110中存储空间的物理地址和加速器140中存储空间的物理地址中任意一个。即,加速器140将处理后数据存储到以下任意一个位置,包括发出访问请求的节点110的存储空间、***中其他的节点110的存储空间、加速器140的存储空间。
可选地,源地址指示的物理地址和目的地址指示的物理地址可以指示全局内存池中存储介质的存储空间。即加速器140可以从全局内存池中读取待处理数据或/和将处理后数据存储到全局内存池,以提升数据处理的速度。
加速器140按照描述符中描述的操作标识对本地缓存的待处理数据执行加速操作。例如,操作标识指示压缩操作,加速器140对本地缓存的待处理数据执行压缩操作。又如,操作标识指示加密操作,加速器140对本地缓存的待处理数据执行加密操作。
加速器140执行完加速操作后,释放本地缓存。加速器140还可以通过请求完成中断(Request Completion Interrupt)消息触发请求描述符节点的中断,由发出访问请求的节点110取回加速操作的结果。
从而,将通用处理器(如:CPU)的作业卸载到加速器,由加速器处理计算需求较高的作业(如:HPC、大数据作业、数据库作业等),解决由于通用处理器浮点算力不足,无法满足HPC、AI等场景的重浮点计算需求的问题,缩短数据处理时长以及降低***能耗,提升***性能。独立部署的加速器和集成加速器的节点支持灵活插拔,可以按需弹性扩展数据处理***的规模,从而满足不同的应用场景下的计算需求。
在另一些实施例中,数据处理***中的任一个活跃节点可以退出***。图6为本申请提供的一种节点退出数据处理***方法的流程示意图。在这里以加速器140请求退出数据处理***100为例进行说明。如图6所示,该方法包括以下步骤。
在一些实施例中,加速器140可以主动退出数据处理***,即加速器140执行步骤610。
步骤610、加速器140广播老化消息,老化消息用于指示加速器140退出数据处理***。执行步骤630和步骤640。
在另一些实施例中,控制器130接收到互联设备120发送的链路故障信息,或者加速器140与其他节点110的心跳消息超时,执行步骤620。
步骤620、控制器130广播老化消息,老化消息用于指示加速器140退出数据处理***。执行步骤630和步骤640。
老化消息可以包括加速器140的节点标识、加速器140的物理地址和加速器140的设备标识。
步骤630、互联设备120老化加速器140的节点标识与端口的对应关系。
例如,互联设备120接收到第一老化消息,删除转发表中加速器140的节点标识与端口对应的转发表项。
步骤640、节点110老化加速器140的节点标识和加速器140的物理地址。
例如,节点110接收到第二老化消息后,删除存储的加速器140的节点标识和物理地址和设备列表中加速器140的设备标识对应的软件驱动。
如此,在数据处理***中设置控制器和互联设备,基于节点接入机制和退出机制,可弹性增加及减少节点,实现了可弹性扩展的超节点架构,既解决了传统超节点架构无法动态扩展的问题,又避免了传统IO总线架构规模受限和带宽低问题,并支持在节点或者互联设备故障情况下的动态容错机制。
可以理解的是,为了实现上述实施例中的功能,控制器包括了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本申请中所公开的实施例描述的各示例的单元及方法步骤,本申请能够以硬件或硬件和计算机软件相结合的形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用场景和设计约束条件。
上文中结合图1至图6,详细描述了根据本实施例所提供的数据处理方法,下面将结合图7,描述根据本实施例所提供的控制装置和数据处理节点。
图7为本实施例提供的可能的控制装置的结构示意图。这些控制装置可以用于实现上述方法实施例中控制器的功能,因此也能实现上述方法实施例所具备的有益效果。在本实施例中,该控制装置可以是如图3、图4、图6所示的控制器130,还可以是应用于服务器的模块(如芯片)。
如图7所示,控制装置700包括通信模块710、控制模块720和存储模块730。控制装置700用于实现上述图3、图4、图6中所示的方法实施例中控制器130的功能。
通信模块710用于接收第二节点发送的接入请求,接入请求用于指示对第二节点进行认证。
控制模块720,用于当第二节点请求接入数据处理***时,为第二节点分配第二节点标识,第二节点标识用于唯一指示第二节点。例如,控制模块720用于执行图3中步骤320。
通信模块710,还用于向互联设备发送第二节点标识与第二端口的对应关系,第二节点标识对应的第二端口用于向第二节点转发消息。例如,控制模块720用于执行图3中步骤330。
通信模块710,还用于将第二节点的全局物理地址发送给数据处理***中的其他节点,例如,向数据处理***中的其他节点广播第二节点的节点标识和物理地址。例如,控制模块720用于执行图3中步骤350。
通信模块710,还用于向互联设备发送第一老化消息,第一老化消息用于指示互联设备老化第二节点标识与第二端口的对应关系。例如,控制模块720用于执行图6中步骤620。
通信模块710,还用于广播第二老化消息,第二老化消息用于指示老化第二节点标识和第二节点的物理地址。
存储模块730用于存储节点的节点标识,以便于控制模块720控制节点接入数据处理***和退出数据处理***。
图8为本实施例提供的可能的数据处理节点的结构示意图。这些数据处理节点可以用于实现上述方法实施例中数据处理***中节点的功能,因此也能实现上述方法实施例所具备的有益效果。在本实施例中,该数据处理节点可以是如图3、图4、图6所示的节点110或加速器140,还可以是应用于服务器的模块(如芯片)。
如图8所示,数据处理节点800包括通信模块810、数据处理模块820和存储模块830。数据处理节点800用于实现上述图3、图4、图6中所示的方法实施例中节点110或加速器140的功能。
通信模块810用于接收所述第一节点发送的访问请求,所述访问请求包括源地址、目的地址和操作标识,所述源地址用于指示存储待处理数据的节点的节点标识和物理地址,所述目的地址用于指示存储处理后数据的节点的节点标识和物理地址。
数据处理模块820,用于根据所述操作标识对所述待处理数据执行加速操作得到处理后数据,以及根据所述第一节点指示的目的地址存储所述处理后数据。例如,数据处理模块820用于执行图4中步骤430。
存储模块830用于存储内存操作指令、待处理数据或处理后数据,以便于数据处理模块820执行加速操作。
应理解的是,本申请实施例的控制装置700和数据处理节点800可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图3、图4、图6所示的数据处理方法时,及其各个模块也可以为软件模块,控制装置700和数据处理节点800及其各个模块也可以为软件模块。
根据本申请实施例的控制装置700和数据处理节点800可对应于执行本申请实施例中描述的方法,并且控制装置700和数据处理节点800中的各个单元的上述和其它操作和/或功能分别为了 实现图3、图4、图6中的各个方法的相应流程,为了简洁,在此不再赘述。
图9为本实施例提供的一种计算设备900的结构示意图。如图所示,计算设备900包括处理器910、总线920、存储器930、通信接口940和内存单元950(也可以称为主存(main memory)单元)。处理器910、存储器930、内存单元950和通信接口940通过总线920相连。
应理解,在本实施例中,处理器910可以是CPU,该处理器910还可以是其他通用处理器、数字信号处理器(digital signal processing,DSP)、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
处理器还可以是图形处理器(graphics processing unit,GPU)、神经网络处理器(neural network processing unit,NPU)、微处理器、ASIC、或一个或多个用于控制本申请方案程序执行的集成电路。
通信接口940用于实现计算设备900与外部设备或器件的通信。在本实施例中,计算设备900用于实现图3所示的控制器130的功能时,通信接口940用于获取广播消息,处理器910为节点110分配节点标识,广播节点标识和节点的物理地址。计算设备900用于实现图6所示的控制器130的功能时,通信接口940用于广播老化消息。
总线920可以包括一通路,用于在上述组件(如处理器910、内存单元950和存储器930)之间传送信息。总线920除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线920。总线920可以是快捷***部件互连标准(Peripheral Component Interconnect Express,PCIe)总线,或扩展工业标准结构(extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。总线920可以分为地址总线、数据总线、控制总线等。
作为一个示例,计算设备900可以包括多个处理器。处理器可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的计算单元。在本实施例中,计算设备900用于实现图4所示的加速器140的功能时,处理器910还用于根据第一节点指示的源地址获取待处理数据,根据第一节点指示的操作标识对待处理数据执行加速操作得到处理后数据,以及根据第一节点指示的目的地址存储处理后数据,源地址用于指示存储待处理数据的节点的节点标识和物理地址;第二节点所述目的地址用于指示存储处理后数据的节点的节点标识和物理地址。
计算设备900用于实现图4所示的控制器130的功能时,处理器910还用于在节点请求接入数据处理***时,为所述节点分配节点标识,将所述节点的全局物理地址发送给数据处理***中的节点,向互联设备发送所述节点标识与端口的对应关系,所述节点标识对应的端口用于向所述节点转发消息,以及在所述节点退出所述数据处理***时,老化所述节点的全局物理地址和节点标识与端口的对应关系。
计算设备900用于实现图4所示的互联设备120的功能时,处理器910还用于基于节点标识与端口的对应关系转发节点间进行通信的数据。
值得说明的是,图9中仅以计算设备900包括1个处理器910和1个存储器930为例,此处,处理器910和存储器930分别用于指示一类器件或设备,具体实施例中,可以根据业务需求确定每种类型的器件或设备的数量。
内存单元950可以对应上述方法实施例中用于存储待处理数据和处理后数据等信息的全局内存池。内存单元950可以是易失性存储器池或非易失性存储器池,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率 同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
存储器930可以对应上述方法实施例中用于存储计算机指令、内存操作指令、节点标识等信息的存储介质,例如,磁盘,如机械硬盘或固态硬盘。
上述计算设备900可以是一个通用设备或者是一个专用设备。例如,计算设备900可以是边缘设备(例如,携带具有处理能力芯片的盒子)等。可选地,计算设备900也可以是服务器或其他具有计算能力的设备。
应理解,根据本实施例的计算设备900可对应于本实施例中的控制装置700和数据处理节点800,并可以对应于执行根据图3、图4、图6中任一方法中的相应主体,并且控制装置700和数据处理节点800中的各个模块的上述和其它操作和/或功能分别为了实现图3、图4、图6中的各个方法的相应流程,为了简洁,在此不再赘述。
本实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于计算设备中。当然,处理器和存储介质也可以作为分立组件存在于计算设备中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘(digital video disc,DVD);还可以是半导体介质,例如,固态硬盘(solid state drive,SSD)。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (25)

  1. 一种数据处理***,其特征在于,所述数据处理***包括多个节点和控制器,所述多个节点包括第一节点和第二节点,所述多个节点和所述控制器通过高速互联链路连接;
    所述控制器,用于当所述第二节点请求接入所述数据处理***时,为所述第二节点分配第二节点标识,其中,所述第二节点在所述数据处理***中的全局物理地址为所述第二节点的节点标识和所述第二节点的物理地址;
    所述控制器,还用于将所述第二节点的全局物理地址发送给所述第一节点。
  2. 根据权利要求1所述的***,其特征在于,所述数据处理***还包括互联设备,所述互联设备基于高速互联链路连接所述多个节点;
    所述控制器,还用于向所述互联设备发送所述第二节点标识与端口的对应关系,所述第二节点标识对应的端口用于向所述第二节点转发消息。
  3. 根据权利要求2所述的***,其特征在于,
    所述互联设备用于基于所述第二节点标识与端口的对应关系转发所述第一节点访问所述第二节点的消息,所述第二节点标识用于唯一指示所述第二节点。
  4. 根据权利要求1-3中任一项所述的***,其特征在于,所述多个节点的存储介质经过统一编址构成全局内存池。
  5. 根据权利要求1-4中任一项所述的***,其特征在于,
    所述第二节点用于根据所述第一节点指示的源地址获取待处理数据,所述源地址用于指示存储所述待处理数据的节点的节点标识和物理地址;
    所述第二节点还用于处理所述待处理数据,以及根据所述第一节点指示的目的地址存储处理后数据,所述目的地址用于指示存储所述处理后数据的节点的节点标识和物理地址。
  6. 根据权利要求5所述的***,其特征在于,所述多个节点的存储介质构成的全局内存池包括所述源地址指示的存储空间或/和所述目的地址指示的存储空间。
  7. 根据权利要求5或6所述的***,其特征在于,所述第二节点具体用于根据所述第一节点指示的操作标识对所述待处理数据执行加速操作得到所述处理后数据。
  8. 根据权利要求5-7中任一项所述的***,其特征在于,
    所述第一节点还用于根据所述第一节点的物理地址访问所述第一节点的存储空间。
  9. 根据权利要求1-8中任一项所述的***,其特征在于,所述第二节点包括处理器、加速器和内存控制器中任一个。
  10. 根据权利要求1-9所述的***,其特征在于,
    所述控制器还用于当所述第二节点退出所述数据处理***时,控制所述第一节点老化所述第二节点的全局物理地址,以及控制互联设备老化所述第二节点标识与端口的对应关系。
  11. 一种数据处理方法,其特征在于,数据处理***包括多个节点,所述多个节点包括第一节点和第二节点,所述多个节点和所述控制器通过高速互联链路连接;所述方法包括:
    当所述第二节点请求接入所述数据处理***时,所述控制器为所述第二节点分配第二节点标识,其中,所述第二节点在所述数据处理***中的全局物理地址为所述第二节点的节点标识和所述第二节点的物理地址;
    所述控制器将所述第二节点的全局物理地址发送给所述第一节点。
  12. 根据权利要求11所述的方法,其特征在于,所述数据处理***还包括互联设备,所述互联设备基于高速互联链路连接所述多个节点;所述方法还包括:
    所述控制器向所述互联设备发送所述第二节点标识与端口的对应关系,所述第二节点标识对应的端口用于向所述第二节点转发消息。
  13. 根据权利要求12所述的方法,其特征在于,所述方法还包括:
    所述互联设备基于所述第二节点标识与端口的对应关系转发所述第一节点访问所述第二节点的消息,所述第二节点标识用于唯一指示所述第二节点。
  14. 根据权利要求11-13中任一项所述的方法,其特征在于,所述多个节点的存储介质经过统一编址构成全局内存池。
  15. 根据权利要求11-14中任一项所述的方法,其特征在于,所述方法还包括:
    所述第二节点根据所述第一节点指示的源地址获取待处理数据,所述源地址用于指示存储所述待处理数据的节点的节点标识和物理地址;
    所述第二节点处理所述待处理数据,以及根据所述第一节点指示的目的地址存储处理后数据,所述目的地址用于指示存储所述处理后数据的节点的节点标识和物理地址。
  16. 根据权利要求15所述的方法,其特征在于,所述多个节点的存储介质构成的全局内存池包括所述源地址指示的存储空间或/和所述目的地址指示的存储空间。
  17. 根据权利要求15或16所述的方法,其特征在于,所述第二节点处理所述待处理数据包括:
    所述第二节点根据所述第一节点指示的操作标识对所述待处理数据执行加速操作得到所述处理后数据。
  18. 根据权利要求15-17中任一项所述的方法,其特征在于,所述方法还包括:
    所述第一节点根据所述第一节点的物理地址访问所述第一节点的存储空间。
  19. 根据权利要求11-18中任一项所述的方法,其特征在于,所述第二节点包括处理器、加速器和内存控制器中任一个。
  20. 根据权利要求11-19所述的方法,其特征在于,所述方法还包括:
    当所述第二节点退出所述数据处理***时,所述控制器控制所述第一节点老化所述第二节点的全局物理地址,以及控制互联设备老化所述第二节点标识与端口的对应关系。
  21. 一种控制装置,其特征在于,所述控制装置应用于数据处理***,所述数据处理***包括基于高速互联技术连接的多个节点,所述多个节点包括第一节点和第二节点,所述装置包括:
    控制模块,用于当所述第二节点请求接入所述数据处理***时,为所述第二节点分配第二节点标识,其中,所述第二节点标识用于唯一指示所述第二节点,所述第二节点在所述数据处理***中的全局物理地址为所述第二节点的节点标识和所述第二节点的物理地址;
    所述控制模块,还用于将所述第二节点的全局物理地址发送给所述第一节点。
  22. 根据权利要求21所述的装置,其特征在于,所述数据处理***还包括互联设备,所述互联设备基于高速互联链路连接所述多个节点;所述装置还包括通信模块;
    所述通信模块,还用于向所述互联设备发送所述第二节点标识与端口的对应关系,所述第二节点标识对应的端口用于向所述第二节点转发消息。
  23. 根据权利要求22所述的装置,其特征在于,
    所述通信模块,还用于当所述第二节点退出所述数据处理***时,向所述互联设备发送第一老化消息,所述第一老化消息用于指示所述互联设备老化所述第二节点标识与端口的对应关系;
    所述通信模块,还用于向所述第一节点发送第二老化消息,所述第二老化消息用于指示所述第二节点的全局物理地址。
  24. 一种数据处理节点,其特征在于,所述数据处理节点为数据处理***中基于高速互联链路连接的多个节点中的其中一个节点,所述装置包括:
    通信模块,用于接收所述多个节点中的其他节点发送的访问请求,所述访问请求包括源地址、目的地址和操作标识,所述源地址用于指示存储待处理数据的节点的节点标识和物理地址,所述目的地址用于指示存储处理后数据的节点的节点标识和物理地址;
    数据处理模块,用于根据所述操作标识对所述待处理数据执行加速操作得到处理后数据,以及根据所述目的地址存储所述处理后数据。
  25. 一种控制器,其特征在于,所述控制器包括存储器和至少一个处理器,所述存储器用于存储一组计算机指令;当所述处理器执行所述一组计算机指令时,控制器执行如权利要求11、12和20中任一所述的方法。
PCT/CN2023/101171 2022-06-27 2023-06-19 数据处理***、方法、装置和控制器 WO2024001850A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210733448 2022-06-27
CN202210733448.9 2022-06-27
CN202211260921.2 2022-10-14
CN202211260921.2A CN117312224A (zh) 2022-06-27 2022-10-14 数据处理***、方法、装置和控制器

Publications (1)

Publication Number Publication Date
WO2024001850A1 true WO2024001850A1 (zh) 2024-01-04

Family

ID=89285447

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101171 WO2024001850A1 (zh) 2022-06-27 2023-06-19 数据处理***、方法、装置和控制器

Country Status (2)

Country Link
CN (1) CN117312224A (zh)
WO (1) WO2024001850A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200183854A1 (en) * 2018-12-10 2020-06-11 International Business Machines Corporation Identifying location of data granules in global virtual address space
CN111654519A (zh) * 2017-09-06 2020-09-11 华为技术有限公司 用于传输数据处理请求的方法和装置
CN112165505A (zh) * 2020-08-21 2021-01-01 杭州安恒信息技术股份有限公司 去中心化的数据处理方法、电子装置和存储介质
CN113157611A (zh) * 2021-02-26 2021-07-23 山东英信计算机技术有限公司 一种数据传输控制方法、装置、设备及可读存储介质
CN113568562A (zh) * 2020-04-28 2021-10-29 华为技术有限公司 一种存储***、内存管理方法和管理节点

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111654519A (zh) * 2017-09-06 2020-09-11 华为技术有限公司 用于传输数据处理请求的方法和装置
US20200183854A1 (en) * 2018-12-10 2020-06-11 International Business Machines Corporation Identifying location of data granules in global virtual address space
CN113568562A (zh) * 2020-04-28 2021-10-29 华为技术有限公司 一种存储***、内存管理方法和管理节点
CN112165505A (zh) * 2020-08-21 2021-01-01 杭州安恒信息技术股份有限公司 去中心化的数据处理方法、电子装置和存储介质
CN113157611A (zh) * 2021-02-26 2021-07-23 山东英信计算机技术有限公司 一种数据传输控制方法、装置、设备及可读存储介质

Also Published As

Publication number Publication date
CN117312224A (zh) 2023-12-29

Similar Documents

Publication Publication Date Title
US11929927B2 (en) Network interface for data transport in heterogeneous computing environments
US10365830B2 (en) Method, device, and system for implementing hardware acceleration processing
US9977618B2 (en) Pooling of memory resources across multiple nodes
CN115543204A (zh) 用于针对加速器板提供共享存储器的技术
US10248346B2 (en) Modular architecture for extreme-scale distributed processing applications
CN117348976A (zh) 用于流处理的数据处理单元
JP2022526660A (ja) コヒーレントアクセラレーションのためのドメイン支援プロセッサピア
US11169846B2 (en) System and method for managing tasks and task workload items between address spaces and logical partitions
WO2013082809A1 (zh) 协处理加速方法、装置及***
WO2019233322A1 (zh) 资源池的管理方法、装置、资源池控制单元和通信设备
US10404800B2 (en) Caching network fabric for high performance computing
US10873630B2 (en) Server architecture having dedicated compute resources for processing infrastructure-related workloads
US20240061802A1 (en) Data Transmission Method, Data Processing Method, and Related Product
WO2024082985A1 (zh) 一种安装有加速器的卸载卡
WO2023104194A1 (zh) 一种业务处理方法及装置
US11029847B2 (en) Method and system for shared direct access storage
WO2024051292A1 (zh) 数据处理***、内存镜像方法、装置和计算设备
WO2024037239A1 (zh) 一种加速器调度方法及相关装置
WO2024001850A1 (zh) 数据处理***、方法、装置和控制器
TW202416145A (zh) 用於控制池化記憶體裝置或記憶體擴展器的設備和方法
US20240061805A1 (en) Host endpoint adaptive compute composability

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23830030

Country of ref document: EP

Kind code of ref document: A1