CN116610630B

CN116610630B - Multi-core system and data transmission method based on network-on-chip

Info

Publication number: CN116610630B
Application number: CN202310870019.0A
Authority: CN
Inventors: 周华民
Original assignee: Shanghai Xinfeng Microelectronics Co ltd
Current assignee: Shanghai Xinfeng Microelectronics Co ltd
Priority date: 2023-07-14
Filing date: 2023-07-14
Publication date: 2023-11-03
Anticipated expiration: 2043-07-14
Also published as: CN116610630A

Abstract

The present disclosure provides a multi-core system and a data transmission method based on a network-on-chip. The multi-core system includes a plurality of network-on-chip units, each of the network-on-chip units including: a router; a processor connected to the router; the storage module is connected with the router and comprises an on-chip memory and a direct memory accessor which are connected with each other; wherein different network-on-chip units are connected with each other through a router in the network-on-chip unit; the direct memory access device realizes data handling between different storage spaces in the multi-core system through the router. In the embodiment of the disclosure, the processor can obtain the required operation data by accessing the on-chip memory in the same network unit on the same chip, so as to shorten the access distance of the processor and solve the problem of access delay.

Description

Multi-core system and data transmission method based on network-on-chip

Technical Field

The present disclosure relates to the field of semiconductor technologies, and in particular, to a network-on-chip-based multi-core system and a data transmission method.

Background

As the integration level of the chip is higher and the number of processors and other hardware modules connected in the chip is higher and higher, the communication bandwidth between the modules is also larger and the access on the bus in the chip is very crowded. Network-on-Chip (NOC) is an interconnection structure with good expansibility, the NOC architecture acquires inspiration from a computer Network, a Network-like architecture is implemented on a Chip, each module is connected to a router, and data transmitted by a component is in the form of data packets, and the data packets reach a target module through the router.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a multi-core system and a data transmission method based on a network on chip.

In a first aspect, an embodiment of the present disclosure provides a network-on-chip-based multi-core system, the multi-core system including a plurality of network-on-chip units, each of the network-on-chip units including:

a router;

a processor connected to the router;

the storage module is connected with the router and comprises an on-chip memory and a direct memory accessor which are connected with each other; wherein different network-on-chip units are connected with each other through a router in the network-on-chip unit; the direct memory access device realizes data handling between different storage spaces in the multi-core system through the router.

In some embodiments, the processor is configured to: transmitting a data carrying instruction;

the direct memory access is configured to: and receiving the data handling instruction, reading data from a first storage space in the multi-core system, and writing the read data into a second storage space in the multi-core system.

In some embodiments, the multi-core system further comprises off-chip memory;

Data handling between different memory spaces within the multi-core system includes at least one of:

data handling between on-chip memories within different ones of the network-on-chip units;

data handling between on-chip memory within the network-on-chip unit and the off-chip memory.

In some embodiments, the on-chip memory is configured to: storing first operation data;

the processor is configured to: and acquiring the first operation data from the on-chip memory of the same network-on-chip unit, and performing data operation according to the first operation data.

In some embodiments, the direct memory access is configured to: carrying second operation data from the on-chip memories of other on-chip network units except the on-chip network unit to which the second operation data belongs to the on-chip memories of the on-chip network unit to which the second operation data belongs;

the on-chip memory is further configured to: storing the second operation data;

the processor is further configured to: acquiring the second operation data from an on-chip memory of the same network-on-chip unit, and performing data operation according to the second operation data;

the process of acquiring the first operation data by the processor in the same network-on-chip unit and performing data operation according to the first operation data and the process of carrying the second operation data by the direct memory access device overlap in time.

In some embodiments, the multi-core system further comprises off-chip memory;

the direct memory access is configured to: carrying third operation data from the off-chip memory to an on-chip memory of the network-on-chip unit to which the third operation data belongs;

the on-chip memory is further configured to: storing the third operation data;

the processor is further configured to: acquiring the third operation data from an on-chip memory of the same network-on-chip unit, and performing data operation according to the third operation data;

the process of acquiring the first operation data and performing data operation according to the first operation data by the processor in the same network-on-chip unit and the process of carrying the third operation data by the direct memory access device overlap in time.

In some embodiments, the on-chip memory comprises dynamic random access memory, DRAM.

In a second aspect, an embodiment of the present disclosure provides a data transmission method, where the data transmission method is applied to a network-on-chip-based multi-core system, the multi-core system including a plurality of network-on-chip units, and each network-on-chip unit includes: a router; a processor connected to the router; the storage module is connected with the router and comprises an on-chip memory and a direct memory accessor which are connected with each other; wherein different network-on-chip units are connected with each other through a router in the network-on-chip unit;

The data transmission method comprises the following steps:

the processor sends a data handling instruction;

and the direct memory access device receives the data handling instruction and realizes data handling among different storage spaces in the multi-core system through the router.

In some embodiments, the data transmission method includes:

the on-chip memory stores first operation data;

the processor acquires first operation data from an on-chip memory of the same network-on-chip unit, and performs data operation according to the first operation data.

In some embodiments, the data transmission method further comprises:

the direct memory access device carries second operation data from the on-chip memories of other on-chip network units except the on-chip network unit to which the direct memory access device belongs to the on-chip memories of the on-chip network unit;

the on-chip memory stores the second operation data;

the processor acquires the second operation data from an on-chip memory of the same network-on-chip unit, and performs data operation according to the second operation data;

In some embodiments, the multi-core system further comprises off-chip memory; the data transmission method further comprises the following steps:

the direct memory access device carries third operation data from the off-chip memory to the on-chip memory of the affiliated network-on-chip unit;

the on-chip memory stores the third operation data;

the processor acquires the third operation data from an on-chip memory of the same on-chip network unit, and performs data operation according to the third operation data;

The embodiment of the disclosure provides a multi-core system and a data transmission method based on a network-on-chip. In an embodiment of the disclosure, the multi-core system includes a plurality of network-on-chip units, each network-on-chip unit includes a router, a processor connected with the router, and a memory module connected with the router, where the memory module includes an on-chip memory and a direct memory access device that are connected with each other; the large-capacity cache is realized by using the on-chip memory, and the data handling between different storage spaces in the multi-core system can be realized by using the direct memory accessor, so that a processor can obtain required operation data by accessing the on-chip memory in the same network unit on the same chip, thereby shortening the access distance of the processor and solving the access delay problem.

Drawings

FIG. 1 is a graph of performance versus year for a processor, microprocessor, and dynamic random access memory;

FIG. 2 is a schematic diagram of a chip with an extensible mesh connection architecture according to some embodiments;

FIG. 3 is a schematic diagram of a multi-core system provided by some embodiments;

FIG. 4 is a schematic diagram of a multi-core system according to an embodiment of the disclosure;

fig. 5 is a flowchart of a data transmission method according to an embodiment of the disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the embodiments of the present disclosure and the accompanying drawings, it being apparent that the described embodiments are only some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present disclosure. However, it will be apparent to one skilled in the art that the present disclosure may be practiced without one or more of these details. In other instances, well-known features have not been described in order to avoid obscuring the present disclosure; that is, not all features of an actual implementation are described in detail herein, and well-known functions and constructions are not described in detail.

In the drawings, the size of layers, regions, elements and their relative sizes may be exaggerated for clarity. Like numbers refer to like elements throughout.

It will be understood that when an element or layer is referred to as being "on" … …, "" adjacent to "… …," "connected to" or "coupled to" another element or layer, it can be directly on, adjacent to, connected to or coupled to the other element or layer, or intervening elements or layers may be present. In contrast, when an element is referred to as being "directly on" … …, "" directly adjacent to "… …," "directly connected to" or "directly coupled to" another element or layer, there are no intervening elements or layers present. It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed below could be termed a second element, component, region, layer or section without departing from the teachings of the present disclosure. When a second element, component, region, layer or section is discussed, it does not necessarily mean that the first element, component, region, layer or section is present in the present disclosure.

Spatially relative terms, such as "under … …," "under … …," "below," "under … …," "above … …," "above," and the like, may be used herein for ease of description to describe one element or feature's relationship to another element or feature as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use and operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements or features described as "under" or "beneath" other elements would then be oriented "on" the other elements or features. Thus, the exemplary terms "under … …" and "under … …" may include both an upper and a lower orientation. The device may be otherwise oriented (rotated 90 degrees or other orientations) and the spatially relative descriptors used herein interpreted accordingly.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes any and all combinations of the associated listed items.

For a thorough understanding of the present disclosure, detailed steps and detailed structures will be presented in the following description in order to illustrate the technical aspects of the present disclosure. Preferred embodiments of the present disclosure are described in detail below, however, the present disclosure may have other implementations in addition to these detailed descriptions.

Before explaining the embodiments of the present disclosure in detail, terms and terminology that may be involved in the embodiments of the present disclosure are explained.

Dynamic random access memory (Dynamic Random Access Memory, DRAM) stores 1 Bit (Bit) data in 1 transistor and 1 capacitor (1 Transistor 1 Capacitor,1T1C), and needs to be periodically refreshed to maintain the stored content when in use.

A low power consumption double data rate synchronous dynamic random access memory (Low Power Double Data Rate Synchronous DRAM, LPDDR) transfers data once each of a rising period and a falling period of one clock cycle, and the data transfer rate is fast.

Static random access memory (Static Random Access Memory, SRAM), typically stores 1 bit of data in 6 transistors, does not need to be periodically refreshed to retain the stored contents when in use.

Three-dimensional integrated circuits (Three Dimensional Integrated Circuit,3 DIC) package various heterogeneous chips into the same packaged device by means of vertical stack bump bonding, thereby achieving advantages in terms of performance, power consumption and area.

Through silicon vias (Through Silicon Via, TSVs) are a circuit interconnect technology that enables interconnection between chips by making vertical vias between chips, between wafers, and between wafers. Through silicon vias can maximize the density of chip stacking in three dimensions, minimize the physical dimensions, and greatly improve chip speed and performance.

A distributed shared memory (Distributed Shared Memory, DSM), in a distributed operating system, multiple nodes may access one another to a block of address space on other nodes, referred to as shared memory. The distributed shared memory can facilitate the sharing operation among a plurality of nodes and effectively improve the performance and reusability of the application program.

NOC systems include a variety of architectural approaches, with conventional two-dimensional Mesh (2D Mesh) architectures including resource nodes, communication nodes (i.e., routing nodes), resource network interfaces, and channels. The resource node comprises a computing node and a storage node, wherein the computing node comprises a processor, and the storage node comprises a memory. The communication nodes are mainly used for realizing data communication tasks among different resource nodes and converting the data communication among the resource nodes into the data communication among the routing nodes. The resource network interface is mainly used as an interface between the routing node and the local resource node; for a fixed routing node, the resource node directly connected with the routing node is the local resource node; that is, for a fixed routing node, the resource node closest to the routing node is the local resource node. The channel is a bidirectional metal link, and is used for ensuring data transmission between nodes, and is divided into an internal channel and an external channel, wherein the internal channel is a metal link between a communication node and a local resource node, and the external channel is a metal link between the communication nodes.

The technical scheme of the present disclosure will be described in detail with reference to examples and drawings.

With the increase of the computing power of processing units such as a central processing unit (Central Processing Unit, CPU), a neural network processor (Neural Processing Unit, NPU), and a graphics processor (Graphics Processing Unit, GPU), the computing power and the memory are gradually mismatched. Referring to FIG. 1, FIG. 1 is a graph of performance versus year for a processor, microprocessor, and dynamic random access memory. As shown in fig. 1, the performance of processors, microprocessors, and DRAMs has increased with the years. However, the speed of performance of processors and microprocessors is far greater than the speed of memory of DRAMs. Thus, there is a large rate difference between the processor (or microprocessor) and the DRAM, which can significantly reduce the performance of the processor if the data that each instruction needs to access requires the processor to be retrieved from the DRAM outside the NOC chip.

NOC architecture is mainly applied to many-core scenarios, which require high memory bandwidth and low latency. To solve such a problem, it is common practice to build a Memory Hierarchy (Memory Hierarchy) of more levels between the CPU and the DRAM outside the NOC chip.

In many-core chips, memory bandwidth and data access latency limit chip performance. NOC buses are the most commonly used among many-core interconnect applications, most of which are the consistent mesh networks (Coherent Mesh Network, CMN) 600/700 developed by the ann corporation (ARM). Currently, processors for many types of high performance computers (High Performance Computing, HPC) employ a CMN600/700 interconnect bus.

Since DRAMs are typically connected at the outermost periphery of the NOC chip, the latency of processors in the NOC architecture to access DRAMs is greatly increased and bandwidth is also limited by the network throughput of the NOC architecture. It is common practice to add multiple levels and large capacity SRAM as a cache between the processor and DRAM to alleviate this problem. An Mougong will incorporate much SRAM in the NOC architecture as a system level cache (System Level Cache, SLC), thereby improving bandwidth and latency and improving processor performance.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a chip of an expandable mesh connection architecture according to some embodiments. As shown in fig. 2, the chip 100 of the scalable mesh connection architecture (Scalable Coherency Fabric, SCF) may be a Grace chip developed by Inwinda (NVIDIA), the chip 100 comprising: a Cache Switch Node (CSN) 101, a processor Core (Core) 102, and an extensible coherence Cache (Scalable Coherent Cache, SCC) 103; wherein the processor core 102 is connected with the cache switching node 101, and the scalable coherent cache 103 is used as a cache partition of a chip and is also connected with the cache switching node 101; the cache switching node 101 acts as an interface between the processor core, the cache and the rest of the chip.

The chip 100 further includes: a first communication interface 104 and a second communication interface 105; wherein the first communication interface 104 may be used to implement data communication between a Chip and a Chip (Chip to Chip); the second communication interface 105 may use a bus developed and developed by the inflight corporation and its communication protocol (i.e., NVLink) for implementing a connection between a CPU and a GPU or for interconnecting a plurality of GPUs. The chip 100 further includes: an external memory 106 connected to the cache switching node 101; wherein the processor core 102 may also retrieve data from the external memory 106 via the cache switching node 101.

In a specific example, the SCC may include SRAM; the external memory may include an LPDDR.

It should be noted that SRAM may be used as a buffer memory in the NOC chip, and the storage capacity of SRAM is small and the cost of SRAM is high. Also, the larger the memory capacity of the SRAM, the larger the area of the SRAM, and therefore, it is difficult to increase the memory capacity of the SRAM to improve the performance of the chip in view of area and cost.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a multi-core system provided in some embodiments. As shown in fig. 3, the multi-core system 20 includes: a plurality of interconnected routers 210; a plurality of processors 220, each processor 220 being coupled to a corresponding router 210; a plurality of system level caches 230, each system level cache 230 being coupled to a corresponding router 210. As shown by the dashed boxes in fig. 3, multiple routers 210, multiple processors 220, and multiple system level caches 230 are all located on the same NOC chip.

In some embodiments, the processor may be an NPU, which is a processor specifically applied to network application data packets, and adopts a "data-driven parallel computing" architecture, which is good at processing massive multimedia data such as video and image. Of course, the processor may also be a CPU, GPU, or the like.

As also shown in FIG. 3, multi-core system 20 further includes: an off-chip memory controller 241 connected to the router 210; and a memory device 240 of an off-chip memory located outside the NOC chip; wherein the off-chip memory includes a memory device and an off-chip memory controller coupled to the memory device, the off-chip memory controller 241 is for controlling the memory device 240 of the off-chip memory.

In a specific example, the system level cache 230 may comprise SRAM; the off-chip memory may include DRAM. Although each processor illustrated in fig. 3 is connected to a corresponding system level cache (e.g., SRAM), the memory capacity of the SRAM is limited and the processor still needs to retrieve the operational data from the DRAM external to the NOC chip.

In the multi-core system illustrated in fig. 3, the processor 220 has three data transmission routes, namely, a first data transmission route, a second data transmission route, and a third data transmission route. In the first data transmission path, the processor 220 may directly obtain the operation data from the SRAM corresponding to the processor 220 through the router 210 connected thereto, and perform the data operation according to the operation data. In the second data transmission path, the processor 220 may obtain operation data from the SRAM corresponding to the other processor through the plurality of routers 210 connected to each other, and perform data operation according to the operation data. In the third data transmission path, the processor 220 may acquire operation data from an off-chip memory (e.g., DRAM) through the router 210 and other devices, and perform data operations according to the operation data.

Compared with the first data transmission route, the second data transmission route directly acquires the operation data from the corresponding SRAM, and more routers are needed to be used for acquiring the operation data from the SRAMs corresponding to other processors. Therefore, the time required for the second data transmission route is longer than the time required for the first data transmission route; and, the power consumption of the second data transmission line is also greater than that of the first data transmission line.

In the third data transmission line, more routers and other devices are required to obtain the operation data from the off-chip memory (e.g., DRAM) than the first data transmission line and the second data transmission line to obtain the data from the SRAM inside the NOC chip. Therefore, the time required for the third data transmission route is greater than the time required for the first data transmission route and the second data transmission route, respectively; and the power consumption of the second data transmission route is also greater than that of the first data transmission route and the second data transmission route respectively.

It should be noted that, the processor preferentially acquires the operation data from the corresponding SRAM, when the acquisition of the operation data fails, the processor starts to acquire the operation data from the SRAM corresponding to the other processor inside the NOC chip, and when the acquisition of the operation data fails again, the processor starts to acquire the operation data from the off-chip memory (for example, DRAM). That is, the first data transmission route is preferentially adopted in the multi-core system, and the second data transmission route is adopted when the execution of the first data transmission route fails; further, the third data transmission route is taken only when the second data transmission route fails to be executed.

In general, in a case where a single processor cannot acquire required operation data from its corresponding SRAM, the processor needs to acquire the required operation data from the SRAM corresponding to other processors via a plurality of routers, and even needs to acquire the required operation data from off-chip memory (e.g., DRAM), which certainly lengthens the access distance of the processor, resulting in access delay.

The process of acquiring the required operation data from the SRAM (or DRAM) by the processor, that is, the process of reading the operation data from the SRAM (or DRAM) by the processor is required, so that the time for acquiring the required operation data by the processor is long and the efficiency for acquiring the required operation data is low. Further, the processor cannot perform other data operations while the processor acquires the required operation data from the SRAM (or DRAM). In other words, the process of the processor acquiring the required operation data from the SRAM (or DRAM) and the process of the processor performing other data operations cannot be performed in parallel.

In view of this, the embodiments of the present disclosure provide a multi-core system and a data transmission method based on a network on chip. In the embodiment of the disclosure, the on-chip memory is utilized to realize high-capacity cache, and the direct memory accessor is utilized to realize data handling among different storage spaces in the multi-core system, so that a processor can obtain required operation data by accessing the on-chip memory in the same network unit on the same chip, thereby shortening the access distance of the processor and solving the access delay problem.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a multi-core system according to an embodiment of the disclosure. As shown in fig. 4, the embodiment of the present disclosure provides a multi-core system 30 including a plurality of network-on-chip units 300, each network-on-chip unit 300 including:

a Router (Router) 310;

a processor (Processing Element, PE) 320 connected to the router 310;

a memory module 330 coupled to the router 310, the memory module 330 comprising inter-connected on-chip memory and direct memory access (Direct Memory Acess, DMA) 332; wherein different network-on-chip units 300 are connected to each other through a router 310 in the network-on-chip unit 300; direct memory access 332 enables data handling between different memory spaces within multi-core system 30 through router 310.

As shown in fig. 4, a plurality of routers in the multi-core system are arranged in an array, and the routers are connected with each other. It should be noted that fig. 4 illustrates an example in which the router 310 and 1 processor 320 of each network-on-chip unit 300 are connected, and the router 310 and 1 memory module 330 are connected. In fact, the number of interfaces of the router in the network-on-chip unit is not particularly limited in the embodiments of the present disclosure.

In an embodiment of the present disclosure, an on-chip memory includes a storage device and an on-chip memory controller coupled to the storage device, the on-chip memory controller configured to control the storage device. Fig. 4 illustrates an on-chip memory controller 331, the on-chip memory controller 331 and a direct memory access 332 being interconnected. The multi-core system provided by the embodiment of the disclosure comprises a logic chip and a storage chip which are bonded with each other, wherein the logic chip is provided with a router, a processor connected with the router, an on-chip memory controller connected with the router and a DMA; the memory chip is provided with a plurality of memory devices (i.e. memory particles), and the memory devices can be arranged in an array; each on-chip memory controller on the logic chip is electrically connected with a corresponding memory device on the memory chip and is used for controlling the corresponding memory device; each DMA on a logic chip enables data handling between different memory spaces (i.e., different memory devices on a memory chip) through a router. Fig. 4 illustrates only the router 310, the processor 320, the on-chip memory controller 331, and the direct memory access 332 on a logic chip, and fig. 4 does not illustrate the memory devices on a memory chip.

In the embodiment of the disclosure, the logic chip and the memory chip can be electrically connected through the TSV.

Here, the on-chip memory may be configured to store operation data. More specifically, the storage devices (i.e., memory granules) within the on-chip memory may be configured to store operational data.

In some embodiments, the on-chip memory may include DRAM. That is, a router, a processor connected to the router, and a DRAM controller and DMA connected to the router are provided on the logic chip.

Here, the same-sized DRAM has a larger storage capacity than the SRAM, and can store more operation data. The on-chip memory may be used to store operational data required by the processor to perform data operations. The data operation may be a plurality of data operations performed based on a preset algorithm program, or the data operation may be a plurality of data operations in a program, and the program may be stored in a storage device (i.e., a memory granule) of the on-chip memory.

Illustratively, the cost of SRAM is 80 times greater than that of DRAM, and it is difficult to balance the cost and storage capacity of SRAM with SRAM as an on-chip memory. The multi-core system provided by the embodiment of the disclosure, which sets the on-chip memory as the DRAM, can balance the cost and the performance of the on-chip memory, so that the multi-core system has a large capacity DSM.

Here, the DMA is configured to enable data handling between different memory spaces within the multi-core system. In the multi-core system provided by the embodiment of the disclosure, the data handling work is transferred to the DMA by the processor, so that the process of executing the data operation by the processor and the process of executing the data handling work by the DMA can be executed in parallel.

Here, routers within a plurality of network-on-chip units are interconnected for data communication. The router comprises at least three types of communication interfaces, namely a first communication interface, a second communication interface and a third communication interface; the router is connected with the processor through a first communication interface, is connected with the storage module through a second communication interface, and is connected with other routers through a third communication interface. The number of the communication interfaces of the router is not particularly limited, and can be designed according to actual requirements.

The multi-core system provided by the embodiment of the disclosure may further include: an off-chip memory located external to the NOC chip. In the embodiment of the present disclosure, the type of the off-chip memory is not particularly limited. The off-chip memory may also include a storage device and an off-chip memory controller coupled to the storage device, the off-chip memory controller configured to control the storage device. The off-chip memory controller may be connected to the router.

In the embodiment of the disclosure, temporary data and programs may be stored in the on-chip memory, and result data may be stored in a storage device of the off-chip memory. Thus, the processor can complete access to temporary data and programs in the NOC chip, thereby shortening the access distance of the processor and relieving the access delay problem.

In a particular example, the off-chip memory may include DRAM.

In the multi-core system provided by the embodiment of the disclosure, processors in the same network unit on the same chip are connected with a storage module through a router, wherein the storage module is a near-end memory of the processor, and more specifically, an on-chip memory in the storage module is the near-end memory of the processor. Alternatively, the storage module may also be referred to as a local memory of the processor, and more specifically, the storage device of the on-chip memory is the local memory of the processor.

In some embodiments, the processor 320 is configured to: transmitting a data carrying instruction;

the direct memory access 332 is configured to: and receiving a data handling instruction, reading data from a first storage space in the multi-core system 30, and writing the read data into a second storage space in the multi-core system 30.

After receiving the data transfer instruction, the DMA reads data from the first storage space, writes the read data into the second storage space, and transfers the data from the first storage space to the second storage space, i.e., a data transfer process. The first storage space is an original storage space of the data, and the second storage space is a target storage space of the data. The first storage space and the second storage space are both storage spaces in the multi-core system, and the first storage space (or the second storage space) can be an on-chip memory or an off-chip memory.

the processor 320 is configured to: the first operation data is obtained from the on-chip memory of the same network-on-chip unit 300, and the data operation is performed according to the first operation data.

Here, the on-chip memory may store first operation data, which refers to operation data required for the processor to perform a current data operation. The processor can acquire first operation data from an on-chip memory of the same network-on-chip unit through a router connected with the processor, and perform data operation according to the first operation data. In this process, the process of acquiring the first operation data and performing the data operation according to the first operation data are both completed in the same network-on-chip unit. Thus, the access distance of the processor can be shortened, and the access delay problem can be solved.

In the multi-core system provided by the embodiments of the present disclosure, each network-on-chip unit may be regarded as a Node, and for a fixed network-on-chip unit (i.e., a Node), the processor obtains first operation data from the on-chip memory, and performs data operation according to the first operation data. In other words, the acquisition of the first operation data and the data operation according to the first operation data are realized within the node.

In some embodiments, direct memory access 332 is configured to: carrying the second operation data from the on-chip memories of other on-chip network units except the affiliated on-chip network unit to the on-chip memories of the affiliated on-chip network unit;

The on-chip memory is further configured to: storing second operation data;

the processor 320 is further configured to: acquiring second operation data from an on-chip memory of the same network-on-chip unit 300, and performing data operation according to the second operation data;

wherein, the process of acquiring the first operation data by the processor 320 in the same network-on-chip unit 300 and performing the data operation according to the first operation data and the process of handling the second operation data by the direct memory access 332 overlap in time.

Here, the handling of the second operational data between the first memory space (i.e. the on-chip memory of the other network-on-chip unit) and the second memory space (i.e. the on-chip memory of the belonging network-on-chip unit) may be achieved with DMA. More specifically, the DMA reads the second operation data from the on-chip memory of the other on-chip network unit, and writes the read second operation data to the on-chip memory of the on-chip network unit to which it belongs.

Here, the on-chip memory may further store second operation data, which may refer to data required for the processor to perform a next data operation. That is, the processor may send a data handling instruction to the DMA of the network-on-chip unit, and the DMA may handle the second operation data required for the processor to perform the next data operation, while the processor may handle the first operation data of the on-chip memory of the network-on-chip unit. The difference between the current data operation and the next data operation is that the execution time of the current data operation is earlier than the execution time of the next data operation, and the processor can execute the current data operation according to the first operation data at the current time; the processor may perform a next data operation at a next time based on the second operational data. The embodiments of the present disclosure are not particularly limited as to the type, duration, etc. of the current data operation and the next data operation.

In the embodiment of the disclosure, a processor can acquire first operation data from an on-chip memory of the same network-on-chip unit through a router connected with the processor, and perform current data operation according to the first operation data; meanwhile, the DMA may transfer the second operation data from the on-chip memory of the other on-chip network units other than the affiliated on-chip network unit to the on-chip memory of the affiliated on-chip network unit; the process of the processor obtaining the first operation data and carrying out the current data operation according to the first operation data and the process of the DMA carrying the second operation data are executed in parallel. In other words, there may be overlap in time between the process of the processor acquiring the first operational data and performing the current data operation in accordance with the first operational data and the process of the DMA handling the second operational data.

Further, in the embodiment of the disclosure, the DMA is used for carrying data and the processor is used for carrying data, so that the processor can be released from the data carrying, and further, the bandwidth of the NOC chip is released, and the performance of the NOC chip is improved. Here, the data operation includes arithmetic operation and other operations.

In the embodiment of the disclosure, when operation data (i.e., first operation data) required by a processor to perform a data operation (i.e., a current data operation) is stored in an on-chip memory of the same network-on-chip unit, the process of acquiring the first operation data and performing the data operation according to the first operation data by the processor is completed in the network-on-chip unit; in the case where the operation data (i.e., the second operation data) required for the processor to perform the data operation (i.e., the next data operation) is stored in the on-chip memory of the other on-chip network unit, the second operation data may be carried into the on-chip memory of the same on-chip network unit by using DMA, so that the processes of acquiring the second operation data and performing the data operation according to the second operation data by the processor are completed in the on-chip network unit. In this way, even if the operation data (i.e., the second operation data) required for the processor to perform the data operation (the next data operation) is not stored in the on-chip memory of the same on-chip network unit, the required operation data can be transferred to the on-chip memory of the same on-chip network unit by DMA; thus, the access distance of the operation data (namely, the second operation data) required by the processor access can be shortened, and the access delay problem can be solved.

In the multi-core system provided by the embodiment of the disclosure, each network-on-chip unit may be regarded as a node, and for a fixed network-on-chip unit (i.e., a node), after the second operation data is carried to the on-chip memory by using the DMA, the processor obtains the second operation data from the on-chip memory, and performs a data operation according to the second operation data. In other words, even if the operation data (i.e., the second operation data) required by the processor is not stored in the on-chip memory, the acquisition of the second operation data and the data operation according to the second operation data can be realized in the node using the DMA. In other words, the acquisition of the second operation data and the data operation according to the second operation data are realized within the node.

In some embodiments, the multi-core system 30 also includes off-chip memory;

the direct memory access 332 is configured to: carrying the third operation data from the off-chip memory into the on-chip memory of the affiliated network-on-chip unit 300;

the on-chip memory is further configured to: storing third operation data;

the processor 320 is further configured to: acquiring third operation data from an on-chip memory of the same network-on-chip unit 300, and performing data operation according to the third operation data;

The process of acquiring the first operation data by the processor 320 in the same network-on-chip unit 300 and performing the data operation according to the first operation data and the process of carrying the third operation data by the direct memory access 332 overlap in time.

Here, the handling of the second operational data between the first memory space (i.e. the off-chip memory) and the second memory space (i.e. the on-chip memory of the belonging network-on-chip unit) may be achieved with DMA. More specifically, the DMA reads the second operation data from the off-chip memory and writes the read second operation data into the on-chip memory of the network-on-chip unit to which it belongs.

Here, the on-chip memory may further store third operation data, which may refer to operation data required for the processor to perform a next data operation. That is, the processor may send a data handling instruction to the DMA of the network-on-chip unit to which the processor is attached, and the DMA handles the third operation data required for the processor to perform the next data operation, and the processor may perform the data operation on the first operation data of the on-chip memory of the network-on-chip unit to which the processor is attached. The difference between the current data operation and the next data operation is that the execution time of the current data operation is earlier than the execution time of the next data operation, and the processor can execute the current data operation according to the first operation data at the current time; the processor may perform a next data operation at a next time according to the third operational data. The embodiments of the present disclosure are not particularly limited as to the type, duration, etc. of the current data operation and the next data operation.

In the embodiment of the disclosure, a processor can acquire first operation data from an on-chip memory of the same network-on-chip unit through a router connected with the processor, and perform current data operation according to the first operation data; meanwhile, the DMA can transfer the third operation data from the off-chip memory to the on-chip memory of the affiliated network-on-chip unit; the process of the processor obtaining the first operation data and carrying out the current data operation according to the first operation data and the process of the DMA carrying the third operation data are executed in parallel. In other words, there may be overlap in time between the process of the processor acquiring the first operational data and performing the current data operation according to the first operational data and the process of the DMA handling the third operational data.

In the embodiment of the disclosure, when operation data (i.e., first operation data) required by a processor to perform a data operation (i.e., a current data operation) is stored in an on-chip memory of the same network-on-chip unit, the process of acquiring the first operation data and performing the data operation according to the first operation data by the processor is completed in the network-on-chip unit; in the case where the operation data (i.e., the third operation data) required for the processor to perform the data operation (i.e., the next data operation) is stored in the off-chip memory, the third operation data may be carried into the on-chip memory of the same network-on-chip unit using DMA, so that the process of the processor obtaining the third operation data and performing the data operation according to the third operation data is completed within the network-on-chip unit. As such, even if the data (i.e., the third operation data) required for the processor to perform the data operation (the next data operation) is not stored in the on-chip memory of the same on-chip network unit, the required operation data can be carried into the on-chip memory of the same on-chip network unit by DMA; thus, the access distance of the operation data (i.e., the third operation data) required for the processor access can be shortened, and the access delay problem can be solved.

In the multi-core system provided by the embodiment of the disclosure, each network-on-chip unit may be regarded as a node, and for a fixed network-on-chip unit (i.e., a node), after the third operation data is carried to the on-chip memory by using the DMA, the processor obtains the third operation data from the on-chip memory, and performs a data operation according to the third operation data. In other words, even if the operation data (i.e., the third operation data) required by the processor is not stored in the on-chip memory, the acquisition of the third operation data and the data operation according to the third operation data can be realized in the node using the DMA. In other words, the acquisition of the third operation data and the data operation according to the third operation data are realized within the node.

Illustratively, toThe grid is exemplified by 55 nanoseconds for the processor to access the on-chip memory of the network element on-chip and 154 nanoseconds for the processor to access the off-chip memory. That is, the processor accesses off-chip memory more time than the processor accesses on-chip memory of the network-on-chip unit to which it belongs. In the embodiment of the disclosure, DMA is utilized to carry data, so that the processor can obtain any required operation data from the on-chip memory of the network-on-chip unit, and the access distance and access time of the processor are effectively shortened.

Further, the DMA may transfer the second operational data from the on-chip memory of the other on-chip network unit than the on-chip network unit to which it belongs to the on-chip memory of the on-chip network unit to which it belongs, or the DMA may transfer the third operational data from the off-chip memory to the on-chip memory of the on-chip network unit to which it belongs. Since the process of carrying out data by the DMA and the process of acquiring operation data from the on-chip memory by the processor and carrying out data operation according to the operation data are executed in parallel, the DMA can carry out data at any time. Therefore, the internal cache of the DMA is small, which is beneficial to reducing the area of the memory module and further beneficial to reducing the area of the multi-core system.

In some embodiments, data handling between different memory spaces within the multi-core system 30 includes at least one of:

data handling between on-chip memories within different network-on-chip units 300;

data handling between on-chip memory and off-chip memory within network-on-chip unit 300.

Here, the multi-core system may include a first network-on-chip unit including a first router, a first processor, and a first memory module, wherein the first memory module includes a first on-chip memory and a first direct memory accessor connected to each other; the second network-on-chip unit comprises a second router, a second processor and a second storage module, wherein the second storage module comprises a second on-chip memory and a second direct memory access device which are connected with each other. The first on-chip memory may store first operational data and the second on-chip memory may store second operational data. In the case that the first processor performs the data operation requiring the second operation data, the first processor is not required to access the second operation data in the second on-chip memory through the first router and the second router, but the second operation data is carried from the second on-chip memory of the second on-chip network unit to the first on-chip memory of the first on-chip network unit through DMA. Data handling between on-chip memories within different network-on-chip units may include: the first direct memory access device can directly carry second operation data from a second on-chip memory of the second on-chip network unit; alternatively, the second direct memory access device may handle the second operand data from the second on-chip memory and the first direct memory access device may handle the second operand data from the second direct memory access device.

Referring to fig. 5, fig. 5 is a flowchart illustrating a data transmission method according to an embodiment of the disclosure. As shown in fig. 4 and 5, the embodiment of the present disclosure provides a data transmission method applied to a multi-core system, the multi-core system 30 including a plurality of network-on-chip units 300, each network-on-chip unit 300 including: a router 310; a processor 320 connected to the router 310; a memory module 330 coupled to the router 310, the memory module 330 comprising an inter-connected on-chip memory and direct memory accessor 332; wherein different network-on-chip units 300 are connected to each other through a router 310 in the network-on-chip unit 300;

the data transmission method comprises the following steps:

step S501: the processor sends a data handling instruction;

step S502: the direct memory access device receives the data handling instruction and realizes the data handling among different storage spaces in the multi-core system through the router.

In some embodiments, the data transmission method includes:

the on-chip memory stores first operation data;

In some embodiments, the data transmission method further includes:

the direct memory access device carries the second operation data from the on-chip memories of other on-chip network units except the affiliated on-chip network unit to the on-chip memories of the affiliated on-chip network unit;

the on-chip memory stores second operation data;

the processor acquires second operation data from an on-chip memory of the same network-on-chip unit and performs data operation according to the second operation data;

the process of acquiring the first operation data by the processor in the same network element on the same chip and performing data operation according to the first operation data and the process of carrying the second operation data by the direct memory access device overlap in time.

In some embodiments, the multi-core system further comprises; the data transmission method further comprises the following steps:

the direct memory access device carries the third operation data from the off-chip memory to the on-chip memory of the affiliated network-on-chip unit;

the on-chip memory stores third operation data;

the processor acquires third operation data from an on-chip memory of the same network-on-chip unit, and performs data operation according to the third operation data;

the process of acquiring the first operation data by the processor in the same network element on the same chip and performing data operation according to the first operation data and the process of carrying the third operation data by the direct memory access device overlap in time.

The embodiments of the present disclosure stack logic chips (i.e., routers, processors, on-chip memory controllers, and DMAs) and memory chips (i.e., storage devices) in a NOC architecture by 3DIC packaging techniques to form a distributed memory module, so that the processors have a larger capacity of near-end memory (or local memory), solve the problem of bandwidth mismatch between the processors and DRAM, increase memory bandwidth, and control cost within an acceptable range and reduce access latency. The near memory of the processor refers to the near memory of the processor, i.e., the on-chip memory located in the same network unit on the chip. More specifically, near memory of a processor refers to the storage (i.e., memory granule) of on-chip memory located within the same network element on-chip.

The disclosed embodiments utilize 3DIC packaging techniques to stack logic chips (i.e., routers, processors, on-chip memory controllers, and DMAs) and memory chips (i.e., storage devices) in a NOC architecture to form distributed modules, each network-on-chip unit can be considered as a node to shorten the access distance between the processor and DRAM to solve the problem of large access latency, and can change the data access process between the nodes of the NOC chip into the data access process within the node, solving the problem that the processor access DRAM is limited by the NOC chip throughput.

The multi-core system based on the network on chip provided by the embodiment of the disclosure comprises the following key points:

first point: the distributed on-chip memory controller (e.g., DRAM controller) and the DMA constitute a DSM, the DRAM controller within the DSM achieving data sharing by DMA handling data;

second point: the DSM module consists of a DRAM controller and a DMA, which are tightly bound for external memory or other DRAM, data handling with the bound DRAM, and specific functions. The DRAM and the DMA are tightly combined, so that the internal cache of the DMA is small, and the area of the stacking structure is reduced;

third point: in order to ensure that DSM data is stored in a scattered way without affecting the performance of a processor, DMA is added to take charge of data coordination and format conversion; in other words, the DMA can not only carry out data handling, but also coordinate data and convert format in the image processing system;

fourth point: stacking high capacity DRAM at nodes of each NOC chip using 3DIC packaging techniques, the DRAM capacity at each node being configurable to approximately 100MB or so depending on the application and generation constraints;

fifth point: programs and data that often need to be accessed may be placed in the on-chip memory of the network-on-chip unit to which the processor belongs; thus, access to other nodes in the NOC chip can be reduced, and the problem of path blockage of the NOC chip is alleviated.

In the embodiment of the disclosure, the DMA belongs to a functional module and is responsible for carrying out data handling, and the processor can directly access the on-chip memory and is responsible for carrying out data operation. After the processor transfers the data handling function to the DMA, the process of the data operation of the processor and the process of the data handling of the DMA may be performed in parallel.

In the embodiment of the disclosure, through a 3DIC packaging technology, stacking DRAM to nodes of a NOC chip for distributed shared memory; the DMA is utilized to carry data, so that the problem of low interaction efficiency among multiple cores in the DSM is solved. Thus, the DSM provides a new memory architecture for the many-core NOC chip.

The throughput of the NOC grid architecture in the related art limits the bandwidth of the processor to access the DRAM on the periphery of the NOC chip. In the disclosed embodiments, the high capacity DRAM is within the node of the NOC chip, and the processor may now access any required operational data within the node without having to seize the NOC network throughput, thereby providing greater scalability for many-core interconnects. The multi-core system provided by the embodiment of the disclosure is friendly to an interconnection system of a NOC grid architecture, and can complete the construction of a distributed memory by utilizing the interconnection characteristic of the NOC.

The embodiment of the disclosure provides a multi-core system and a data transmission method based on a network-on-chip. In an embodiment of the disclosure, a multi-core system includes a plurality of network-on-chip units, each network-on-chip unit includes a router, a processor connected with the router, and a storage module connected with the router, where the storage module includes an on-chip memory and a direct memory access device that are connected with each other; the large-capacity cache is realized by using the on-chip memory, and the data handling between different storage spaces in the multi-core system can be realized by using the direct memory accessor, so that a processor can obtain required operation data by accessing the on-chip memory in the same network unit on the same chip, thereby shortening the access distance of the processor and solving the access delay problem.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present disclosure, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by their functions and internal logic, and should not constitute any limitation on the implementation of the embodiments of the present disclosure. The foregoing embodiment numbers of the present disclosure are merely for description and do not represent advantages or disadvantages of the embodiments.

The foregoing description is only of the preferred embodiments of the present disclosure, and is not intended to limit the scope of the present disclosure, but rather, the equivalent structural changes made by the present disclosure and the accompanying drawings under the inventive concept of the present disclosure, or the direct/indirect application in other related technical fields are included in the scope of the present disclosure.

Claims

1. A network-on-chip-based multi-core system, the multi-core system comprising a plurality of network-on-chip units, each network-on-chip unit comprising:

a router;

a processor connected to the router;

2. The network-on-chip based multi-core system of claim 1, wherein the network-on-chip based multi-core system comprises,

the processor is configured to: transmitting a data carrying instruction;

3. The network-on-chip based multi-core system of claim 1, further comprising an off-chip memory;

4. The network-on-chip based multi-core system of claim 1, wherein the network-on-chip based multi-core system comprises,

the on-chip memory is configured to: storing first operation data;

5. The network-on-chip based multi-core system of claim 4, wherein the plurality of cores,

the direct memory access is configured to: carrying second operation data from the on-chip memories of other on-chip network units except the on-chip network unit to which the second operation data belongs to the on-chip memories of the on-chip network unit to which the second operation data belongs;

the on-chip memory is further configured to: storing the second operation data;

6. The network-on-chip based multi-core system of claim 4, further comprising an off-chip memory;

the on-chip memory is further configured to: storing the third operation data;

7. The network-on-chip based multi-core system of claim 1, wherein the on-chip memory comprises dynamic random access memory, DRAM.

8. A data transmission method, wherein the data transmission method is applied to a network-on-chip-based multi-core system, the multi-core system comprising a plurality of network-on-chip units, each network-on-chip unit comprising: a router; a processor connected to the router; the storage module is connected with the router and comprises an on-chip memory and a direct memory accessor which are connected with each other; wherein different network-on-chip units are connected with each other through a router in the network-on-chip unit;

the data transmission method comprises the following steps:

the processor sends a data handling instruction;

9. The data transmission method according to claim 8, characterized in that the data transmission method comprises:

the on-chip memory stores first operation data;

10. The data transmission method according to claim 9, characterized in that the data transmission method further comprises:

the on-chip memory stores the second operation data;

11. The data transmission method of claim 9, wherein the multi-core system further comprises an off-chip memory; the data transmission method further comprises the following steps:

The on-chip memory stores the third operation data;