CN111860788A

CN111860788A - Neural network computing system and method based on data flow architecture

Info

Publication number: CN111860788A
Application number: CN202010733604.2A
Authority: CN
Inventors: 王佳东; 李远超; 蔡权雄; 牛昕宇
Original assignee: Shenzhen Corerain Technologies Co Ltd
Current assignee: Shenzhen Corerain Technologies Co Ltd
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2020-10-30

Abstract

The embodiment of the invention discloses a neural network computing system and method based on a data flow architecture. The data flow architecture based neural network computing system comprises: the device comprises an off-chip memory and a neural network acceleration module, wherein the neural network acceleration module comprises a conversion unit, an on-chip memory and a calculation unit; the off-chip memory is used for storing data; the conversion unit is connected between the on-chip memory and the off-chip memory and used for realizing conversion between a first communication bus of the off-chip memory and a second communication bus of the on-chip memory so as to store the data into the on-chip memory; the calculation unit is directly connected to the on-chip memory through a second communication bus for performing calculations based on the data received from the on-chip memory. By arranging a conversion unit between the on-chip memory and the off-chip memory, the speed and the efficiency of the neural network calculation of the data flow architecture are improved under the condition of ensuring the cost.

Description

Neural network computing system and method based on data flow architecture

Technical Field

The embodiment of the application relates to the technical field of neural networks, for example, a neural network computing system and method based on a data flow architecture.

Background

With the rapid development of computers, the calculation of neural network data becomes more and more important. The computation of neural networks requires a large amount of data. The data flow architecture completes the whole calculation process according to the continuous flow of data, and no instruction set participates. The traditional instruction set architecture can complete one complete operation through several stages, namely an instruction fetching stage, an instruction decoding stage, an instruction executing stage, an access and access stage and a result writing-back stage, and the whole process efficiency is extremely low. Data flow architectures can maximize the performance and efficiency of the chip compared to instruction set architectures, but because data flow architectures require data to be flowing at all times, stringent requirements are placed on storage, bus and bandwidth.

Currently, for the artificial intelligence acceleration chip of the data flow architecture, the data of the calculation mainly comes from two ways. The first method is as follows: accessing external storage through a bus, such as accessing off-chip Double Data Rate (DDR) memory through an axi (advanced eXtensible interface) bus; the second method comprises the following steps: the output of each layer of the neural network is stored on-chip by placing a relatively large Random Access Memory (RAM) on-chip.

However, while the cost of the approach one is relatively modest, the transfer rate is limited by the rate and bandwidth of the bus. Often, a chip has a plurality of devices to access the external storage simultaneously, and the external storage can only be read operation or write operation at the same time, and reading and writing cannot be performed simultaneously. Therefore, the transmission efficiency of the data is seriously influenced, and the neural network can start to calculate only by receiving the data, so that the data transmission time and the data processing time of the first mode are difficult to overlap, and a performance bottleneck is caused. In the second mode, although a transmission time of data is saved by local storage, the capacity of the local storage must be larger than that of the layer with the largest data amount in the neural network to realize complete functions, which results in a large chip area and a sharp increase in chip cost.

Disclosure of Invention

The embodiment of the invention provides a data flow architecture-based neural network computing system and method, which aim to improve the speed and efficiency of neural network computing of a data flow architecture under the condition of ensuring the cost.

In a first aspect, an embodiment of the present invention provides a stream-based neural network computing system, including:

the device comprises an off-chip memory and a neural network acceleration module, wherein the neural network acceleration module comprises a conversion unit, an on-chip memory and a calculation unit;

the off-chip memory is used for storing data;

the conversion unit is connected between the on-chip memory and the off-chip memory and used for realizing conversion between a first communication bus of the off-chip memory and a second communication bus of the on-chip memory so as to store the data into the on-chip memory;

the calculation unit is directly connected to the on-chip memory through a second communication bus for performing calculations based on the data received from the on-chip memory.

Optionally, the computing unit is directly connected to the off-chip memory through a first communication bus, and is configured to perform computation based on the data directly received from the off-chip memory.

Optionally, the computing unit includes:

and the data flow direction selection subunit is used for controlling the flow direction of the data.

Optionally, the first communication bus is an AXI bus, and the data is interacted between the conversion unit and the off-chip memory through an AXI bus protocol or between the computation unit and the off-chip memory through an AXI bus protocol.

Optionally, the second communication bus is a Sif bus, the data is interacted between the conversion unit and the on-chip memory through a Sif bus protocol, and the data is interacted between the on-chip memory and the calculation unit through the Sif bus protocol.

Optionally, the number of channels of the Sif bus is at least one.

Optionally, the calculation unit includes an operator.

In a second aspect, an embodiment of the present invention provides a neural network computing method based on a data flow architecture, which is applied to a neural network computing system according to any embodiment of the present invention, where the neural network is multi-layer, and the method includes:

acquiring data to be calculated of a current layer of a neural network;

predicting the calculation time and the transmission time of the current layer according to the data;

and determining a transmission path of the data needing to be calculated according to the calculation time and the transmission time.

Optionally, the transmission time includes a first transmission time and a second transmission time, the first transmission time is a data transmission time between the computing unit and the off-chip memory, the second transmission time is a data transmission time between the computing unit and the on-chip memory, and the first transmission time is greater than the second transmission time;

the determining a transmission path of the data to be calculated according to the calculation time and the transmission time includes:

judging whether the calculation time is greater than the first transmission time or not;

controlling a computing unit to perform data interaction with an off-chip memory or controlling the computing unit to perform data interaction with an on-chip memory under the condition that the computing time is longer than the first transmission time;

and the computing unit performs data interaction with the on-chip memory under the condition that the computing time is less than or equal to the first transmission time.

Optionally, the controlling the computing unit to perform data interaction with an off-chip memory, or controlling the computing unit to perform data interaction with an on-chip memory includes:

judging whether the data are arranged in sequence;

under the condition that the data are arranged in sequence, controlling the computing unit to interact with the off-chip memory;

and controlling the computing unit to interact with the on-chip memory under the condition that the data are not arranged in sequence.

The neural network computing system based on the data flow architecture comprises the following components: the device comprises an off-chip memory and a neural network acceleration module, wherein the neural network acceleration module comprises a conversion unit, an on-chip memory and a calculation unit; the off-chip memory is used for storing data; the conversion unit is connected between the on-chip memory and the off-chip memory and used for realizing conversion between a first communication bus of the off-chip memory and a second communication bus of the on-chip memory so as to store the data into the on-chip memory; the calculation unit is directly connected to the on-chip memory through a second communication bus for performing calculations based on the data received from the on-chip memory. By arranging a conversion unit between the on-chip memory and the off-chip memory, the problem that the transmission rate is too slow due to accessing the off-chip memory or the area of a chip is large due to the fact that a large ram is arranged on the chip is solved, and the speed and the efficiency of the neural network calculation of the data flow architecture are improved under the condition of ensuring the cost.

Drawings

FIG. 1 is a schematic structural diagram of a neural network computing system based on a dataflow architecture according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of another data flow architecture based neural network computing system according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a neural network computing method based on a dataflow architecture according to an embodiment of the present application;

FIG. 4 is a flow chart of another method for calculating a neural network based on a dataflow architecture according to an embodiment of the present application;

fig. 5 is a flowchart illustrating a method for calculating a neural network based on a dataflow architecture according to an embodiment of the present application.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. A process may be terminated when its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc.

Furthermore, the terms "first," "second," and the like may be used herein to describe various orientations, actions, steps, elements, or the like, but the orientations, actions, steps, or elements are not limited by these terms. These terms are only used to distinguish one direction, action, step or element from another direction, action, step or element. For example, a first datum may be referred to as a second datum, and similarly, a second datum may be referred to as a first datum, without departing from the scope of the present application. The first data and the second data are both data, but they are not the same data. The terms "first", "second", etc. are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Examples

Fig. 1 is a schematic diagram of a neural network computing system based on a data flow architecture according to an embodiment of the present disclosure, where the embodiment is applicable to a scenario in which data of a neural network is computed through the structure.

The neural network computing system based on the dataflow architecture provided by the embodiment of the application comprises an off-chip memory 110 and a neural network acceleration module 120. The neural network acceleration module 120 includes a conversion unit 121, an on-chip memory 122, and a calculation unit 123. In this embodiment, the neural network acceleration module 120 includes a conversion unit 121, an on-chip memory 122, and a calculation unit 123 integrated in the same chip.

The off-chip memory 110 is used to store data.

The conversion unit 121 is connected between the on-chip memory 122 and the off-chip memory 110, and is configured to implement conversion between a first communication bus 130 of the off-chip memory 110 and a second communication bus 140 of the on-chip memory 122, so as to transmit and store data to the on-chip memory 122 through the first communication bus 130, the conversion unit 121, and the second communication bus 140 in sequence. In one embodiment, the conversion unit 121 receives the data from the off-chip memory 110 through the first communication bus 130 and stores the data to the on-chip memory 122 through the second communication bus 140.

The calculation unit 123 is directly connected to the on-chip memory 122 via a second communication bus 140 for performing calculations based on data received from the on-chip memory 122.

The off-chip memory 110 has a main function of storing various data and can access data automatically and at high speed during operation on a computer or a chip. The off-chip memory 110 is a device with a "memory" function that uses a physical device with two stable states to store information. The storage capacity of the off-chip memory 110 is large to meet the demand of the neural network data computation. Compared with the simple utilization of the on-chip memory 122, the chip cost can be greatly reduced. Illustratively, the off-chip Memory 110 may be a Dynamic Random Access Memory (DRAM) or a Double Data Rate (DDR) synchronous Dynamic Random Access Memory. For example, the off-chip memory 110 is a DDR memory to satisfy higher data transfer efficiency.

The on-chip memory 122 is located in the neural network acceleration module 120, and is an on-chip memory core of a data flow architecture, mainly for improving the efficiency of chip data transmission. The size of the on-chip memory 122 needs to be determined by considering the comprehensive cost, and can be set according to the configuration of a common network, and the size does not need to be too large as long as the performance bottleneck of the neural network can be supported. In one embodiment, the size of the on-chip memory 122 needs to be larger than the amount of data that needs to be calculated by the current layer of the neural network. Compared with the simple use of the off-chip memory 110, the on-chip memory 122 has a large bandwidth, which can greatly increase the speed and efficiency of data reading. The on-chip memory 122 may be a Cache or a Cache Buffer, and the type of the on-chip memory 122 is not limited herein. Illustratively, the on-chip memory 122 is a cache Buffer.

Because the communication bus protocols of the off-chip memory 110 and the on-chip memory 122 are not consistent, data cannot be directly transmitted between the off-chip memory 110 and the on-chip memory 122, and a conversion unit 121 needs to be added between the off-chip memory 110 and the on-chip memory 122, so that data can be transmitted to memories with different bus protocols through bus conversion. Because the data amount of each layer of the neural network is different, performance and cost can be balanced by using two kinds of storage combination by adding the conversion unit 121, and product competitiveness is greatly improved.

The computing unit 123 represents a neural network accelerator core computing ENGINE (ENGINE), contains various operators required by neural network computing, and is a computing core of a data flow architecture. The conversion unit 121 is used to implement the conversion of the bus communication protocol, and is the core of the data flow architecture responsible for data exchange.

In one embodiment, the neural network has multiple layers of calculation, and the data of the previous layer needs to be output as the data input of the current layer. Illustratively, the neural network has A, B, C layers, B being the next layer of A, C being the next layer of B, and the last layer. The data before the neural network begins computation is stored in off-chip memory 110. In the case that the amount of data calculated by the layer a can be stored in the on-chip memory 122, all data to be calculated by the layer a can be transmitted to the on-chip memory 122 through the first communication bus 130, the conversion unit 121, and the second communication bus 140 in the off-chip memory 110 before performing the neural network calculation, so that the calculation unit 123 can pull the data from the on-chip memory 122 when starting to calculate the layer a. After the calculation unit 123 completes the calculation of the data of the neural network of the layer a, the calculation result is transmitted to the on-chip memory 122 for simple arrangement, and then transmitted to the calculation unit 123, so as to continue the calculation of the data of the layer B. After the computation unit 123 completes the computation of the data of the C layer, the final computation result may be transmitted to the on-chip memory 122, and the on-chip memory 122 transmits the final computation result to the off-chip memory 110 through the conversion unit 121. After one computation of the neural network is completed, the final computation result is stored in the off-chip memory 110. In an alternative embodiment, the final calculation result may also be stored directly from the calculation unit 123 to the on-chip memory 122 via the second communication bus 140.

In an embodiment, in a case that after the computing unit 123 executes the current computing task, a small amount of data in the off-chip memory 110 is needed to perform the next task computing, during the execution of the current computing task, the data corresponding to the next task may be transmitted to the on-chip memory 122 through the converting unit 121 for pre-sorting, and after the computing unit 123 completes the current task computing, the data corresponding to the next task may be immediately obtained from the on-chip memory 122 through the faster second bus and the computing may be started. In addition, since the data corresponding to the next task in the on-chip memory 122 is pre-ordered, the calculation unit 123 can directly calculate by pulling the data corresponding to the next task, thereby improving the calculation efficiency of the neural network.

In one embodiment, sorting the data includes sorting different types of data and sorting the same type of data. Taking the ordering of different kinds of data as an example, the data types may include data, weight, bias, and the like, and different computing units 123 may have different requirements on the order of the transmitted data types, and may transmit the data first and then transmit the weight bias, or may transmit the bias and then transmit the weight, and therefore, the data types need to be determined according to the order of the computing data types required by the computing unit 123. The order of the data types required by the calculation unit 123 is well ordered in advance, which can improve the calculation capability of the calculation unit 123.

Ordering is also required for the same type of data. Taking the sorting of the data as an example, the picture is also one of the data. Assume that a picture has 10 × 8 pixels, including three colors, namely red, blue and yellow. The total data size of this picture is 10 x 8 x 3. The ordering may be done for each color, for example, the pixel points are sorted in red, then in yellow, and finally in blue. The three colors of red, blue and yellow of the first pixel point can be well ordered, and then the three colors of the second pixel point can be well ordered, and the ordering requirement for the data of the same type can be determined according to the requirement of the computing unit 123 and the data storage type mode.

According to the technical scheme of the embodiment of the application, the conversion unit 121 is arranged between the on-chip memory 122 and the off-chip memory 110, so that data can be transmitted between the on-chip memory 122 and the off-chip memory 110. The time of data transmission can be shortened by performing data conversion through the on-chip memory 122 in the calculation process, so that the calculation efficiency is improved, the final calculation result is output to the off-chip memory 110 after the calculation is completed, the condition that the transmission rate is limited by the rate and the bandwidth of a bus or the area of a chip is large due to the fact that a larger RAM is arranged on the chip when the off-chip DDR memory is accessed through the AXI bus is avoided, and the speed and the efficiency of the neural network calculation are improved.

On the basis of the above embodiments, the present embodiment refines a partial structure, and the present embodiment is applicable to a scenario in which data of a neural network is calculated by using the structure, as follows:

as shown in fig. 2, the calculation unit 123 is directly connected to the off-chip memory 110 through a first communication bus 130 for performing calculation based on data.

The computing unit 123 is directly connected to the off-chip memory 110, and may directly pull data from the off-chip memory 110 for computation. The calculation unit 123 may pull data from the on-chip memory 122 to perform calculation of the neural network, or may pull data directly from the off-chip memory 110, so as to improve flexibility of data acquisition, a path through which the calculation unit 123 pulls data directly from the off-chip memory 110 is denoted as path a, a path through which the calculation unit 123 pulls data from the on-chip memory 122 is denoted as path B, data transmission time of the path a is denoted as first transmission time, data transmission time of the path B is denoted as second transmission time, and the second transmission time is shorter than the first transmission time. Similarly, the final calculation result of the neural network can be directly transmitted to the off-chip memory 110 through the first communication bus 130, which also improves the efficiency of data transmission. The calculation unit 123 includes operators 1232 required for neural network calculations. Operator 1232 refers to the smallest unit of operation that can be executed by the CPU/GPU.

In an embodiment, the computing unit 123 includes a data flow direction selecting subunit 1231, configured to control a flow direction of data, and is a data flow gating control core of the data flow architecture. For example, the data stream selection subunit 1231 may control the computing unit 123 to receive the data stored in the off-chip memory 110 via the first communication bus 130, and may also control the computing unit 123 to receive the data in the on-chip memory 122 via the second communication bus 140. Similarly, the data stream selection subunit 1231 may also control whether the final data result is stored in the on-chip memory 122 or the off-chip memory 110.

In an embodiment, when all data required to be calculated by the current layer of the neural network may be stored in the on-chip memory 122, and the calculation time of the calculation unit 123 of the current layer of the neural network is greater than the first transmission time, the calculation unit 123 may receive the data stored in the off-chip memory 110 through the first communication bus 130, or may receive the data of the on-chip memory 122 through the second communication bus 140, that is, the data required to be calculated by the current layer of the neural network may be acquired through the path a or the path B. At this time, in the case where the data in the off-chip memory 110 is in the sorted order, the calculation unit 123 may read the data from the off-chip memory 110. In the case that the data is not arranged in order, the data can be sent to the on-chip memory 122 through the conversion unit 121 for simple processing and then sent to the calculation unit 123 for calculation, thereby reducing the preparation time before the calculation by the calculation unit 123.

When all data required to be calculated by the current layer of the neural network can be stored in the on-chip memory 122, and the calculation time of the calculation unit 123 of the current layer of the neural network is less than or equal to the first transmission time, the calculation unit 123 receives the data through the on-chip memory 122, that is, the data required to be calculated by the current layer of the neural network is acquired through the path B. In one embodiment, when the computing unit 123 calculates the upper layer data, the output result is sent to the on-chip memory 122 for sorting and then sent to the computing unit 123. When the calculation result of the current layer is used in a subsequent layer, the data is stored in the on-chip memory 122 first, and calculation is performed in the subsequent layer.

And storing the calculation result of the previous layer in the on-chip memory when the capacity of the on-chip memory can store the calculation result of the previous layer, and storing the calculation result of the previous layer in the off-chip memory when the capacity of the on-chip memory cannot store the calculation result of the previous layer.

When the data required to be calculated by the current layer of the neural network cannot be stored in the on-chip memory 122, and the calculation time of the calculation unit 123 of the current layer of the neural network is longer than the first transmission time, the calculation unit 123 receives the data stored in the off-chip memory 110 through the first communication bus 130, that is, the data required to be calculated by the current layer of the neural network is acquired through the path a.

Under the condition that the current layer is the last layer, the final calculation result of the current layer can be directly output to the off-chip memory 110 without being transmitted to the off-chip memory 110 through the on-chip memory 122, so that the calculation efficiency of the neural network is improved.

In an embodiment, the first communication bus 130 may be an AXI bus, and the data is interacted between the conversion unit 121 and the off-chip memory 110 through an AXI bus protocol or between the calculation unit 123 and the off-chip memory 110 through an AXI bus protocol.

The second communication bus 140 may be a Sif bus, the data is interacted between the conversion unit 121 and the on-chip memory 122 through the Sif bus protocol, and the data is interacted between the on-chip memory 122 and the calculation unit 123 through the Sif bus protocol. Sif (streaming interface) is a simple, practical, and compatible bus protocol. The handshake protocol based on flow control has no requirement on data format, data transmission can be full duplex and completely adapted to an AXI bus, and data transmission of a source end and a destination end can be efficiently realized.

Because the neural network computation needs to switch back and forth between multiple data types, the Sif bus protocol with low resource overhead is adopted and multiple channels are formed to meet multiple data transmission requirements, and therefore the conversion unit 121 is needed to convert the communication bus protocol. If it is necessary to maximize the chip efficiency and fully utilize the two memories, the data stream selection subunit 1231 needs to flexibly switch the storage modes, and the data calculated by the calculation unit 123 may receive the data stored in the off-chip memory 110 via the AXI bus of the first communication bus 130, or may receive the data cached in the on-chip memory 122 via the Sif bus of the second communication bus 140.

Wherein, the channel of Sif bus is at least one. In one embodiment, the number of channels of Sif is determined by the type and amount of data to be transmitted. One Sif channel can only transmit one type of data, but one type of data can be transmitted through a plurality of Sif channels. Illustratively, there are A, B, C types of data, and then there are at least three Sif channels. However, in the case that the amount of the data of the type a is relatively large with respect to the amount of the data of the type B, C, the data of the type a can be transmitted simultaneously through two Sif channels, thereby reducing the transmission time of the data and preventing the calculation efficiency of the neural network from being affected due to too slow data transmission.

According to the technical scheme of the embodiment of the application, the conversion unit is arranged between the on-chip memory and the off-chip memory, so that data can be transmitted between the on-chip memory and the off-chip memory. The time of data transmission can be shortened by performing data conversion through the on-chip memory in the calculation process, so that the calculation efficiency is improved, the final calculation result is output to the off-chip memory after the calculation is finished, the condition that the off-chip DDR memory is accessed through the AXI bus, the transmission rate is limited by the rate and the bandwidth of the bus or the area of a chip is large due to the fact that a large RAM is arranged on the chip is avoided, and the speed and the efficiency of the neural network calculation are improved.

Fig. 3 is a schematic flowchart of a data flow architecture-based neural network computing method according to an embodiment of the present application, where the present embodiment is applicable to a scenario of computing data of a neural network, and the method can be executed by the data flow architecture-based neural network computing system, and includes steps S310 to S330.

In step S310, data required to be calculated by the current layer of the neural network is obtained.

The current layer is a layer which needs to perform data calculation currently. In one embodiment, the neural network has multiple layers, and after the data calculation of the previous layer is completed, the calculation result is output to the next layer to be used as the input data of the next layer to be calculated. Illustratively, the neural network comprises A, B, C, D four layers, when the computation of the layer A data is completed, the computation result of the layer A is output to the layer B, the layer B is not computed yet and is ready to start computation, and the layer B is the current layer.

In step S320, the calculation time and the transmission time are estimated according to the data.

The calculation time refers to the time required for the current layer to calculate data. In one embodiment, since the calculation configuration of each layer of the neural network acceleration module is fixed, the calculation time can be estimated according to the size of the data volume and the calculation configuration of the current layer. The transmission time refers to the time required to transmit the data to the computing unit in the neural network acceleration module.

In step S330, a transmission path of the data to be calculated is determined according to the calculation time and the transmission time.

The transmission path refers to a manner of transmitting data to the computing unit. In this embodiment, there are two transmission paths, path a and path B. For example, the computing unit may perform data transfer with an on-chip memory (path B) or with an off-chip memory (path a).

Referring to fig. 4, step S330 includes steps S331 to S333.

In step S331, it is determined whether the calculated time is greater than the first transmission time.

The calculation time and the transmission time can be estimated, so that the calculation time and the transmission time can be judged. In one embodiment, since there are two transmission paths, the data transmission time of the two paths is calculated separately. For example, if the transmission path includes two paths, i.e., a path a and a path B, the transmission time of the path a and the transmission time of the path B need to be calculated, where the transmission time of the path a is recorded as a first transmission time, and the transmission time of the path B is recorded as a second transmission time, where the second transmission time is less than the first transmission time.

In step S332, if the calculation time is longer than the first transmission time, the calculation unit is controlled to perform data interaction with the off-chip memory or the calculation unit is controlled to perform the data interaction with the on-chip memory.

The data required to be calculated by the current layer of the neural network can be stored in the on-chip memory. In one embodiment, the A path may control the computing unit to interact with the off-chip memory for data. The B path can control the computing unit to perform data interaction with the on-chip memory. Wherein, interactive refers to sending and receiving data. In the case where the calculation time is longer than the first transmission time, data transmission may be performed along any path. Illustratively, the path with the least transmission time is selected for data transmission. For example, first data is first transferred from the off-chip memory to the on-chip memory, thereby realizing data interaction between the on-chip memory and the computing unit.

In step S333, in the case that the calculation time is less than the first transmission time, the calculation unit performs the data interaction with the on-chip memory.

Referring to fig. 5, step S332 includes steps S3321 to S3323.

In step S3321, it is determined whether the data is sorted.

Wherein the data refers to data stored in an off-chip memory. The sequence means to be arranged according to a certain rule. In this embodiment, the sorting order means that data to be calculated in the current layer is classified according to the type of the data, that is, data of the same type are arranged together to be the sorting order.

In step S3322, the computing unit is controlled to interact with the off-chip memory when the data is ordered.

In step S3323, if the data is not ordered, the computing unit is controlled to interact with the on-chip memory.

In one embodiment, the off-chip memory sends the data to the on-chip memory for simple sorting and then transmits the data to the computing unit.

In an alternative embodiment, it may also be determined whether the calculation result of the current layer needs to be used by a later layer. Under the condition that the calculation result of the current layer needs to be used by a certain layer later, the data of the current layer can be temporarily stored in the on-chip memory and then sequenced, and the use of the certain layer is waited. Illustratively, the neural network sequentially comprises an A layer, a B layer and a C layer, when the C layer needs to use the data of the A layer during calculation, the data of the A layer is temporarily stored in the on-chip memory, and the on-chip memory transmits the data of the A layer to the calculation unit for calculation when the C layer is waited for calculation. And in the case that the calculation result of the current layer is not required to be used by a certain layer later, directly outputting the result to the off-chip memory to release the storage space of the on-chip memory.

In one embodiment, sorting the data includes sorting different types of data and sorting the same type of data. Taking the ordering of different kinds of data as an example, the data types may include data, weight, bias, and the like, and different computing units 123 may have different requirements on the order of the transmitted data types, and may transmit the data first and then transmit the weight bias, or may transmit the bias and then transmit the weight, and therefore, the data types need to be determined according to the order of the computing data types required by the computing unit 123. The data types required by the computing unit are well ordered in advance, so that the computing capacity of the computing unit can be improved.

Ordering is also required for the same type of data. Taking the sorting of the data as an example, the picture is also one of the data. Generally, only three colors of red, blue and yellow are calculated when the picture is calculated by the first layer of the neural network. Assume that a picture has 10 × 8 pixels, including three colors, namely red, blue and yellow. The total data size of this picture is 10 x 8 x 3. The ordering may be done for each color, for example, the pixel points are sorted in red, then in yellow, and finally in blue. The three colors of red, blue and yellow of the first pixel point can be well ordered, then the three colors of the second pixel point are well ordered, and the ordering requirement of the data of the same type is determined according to the requirement of the computing unit and the mode of the data storage type.

According to the technical scheme of the embodiment of the application, the data quantity required to be calculated by the current layer of the neural network is obtained, the transmission time and the calculation time are estimated according to the data quantity, and the transmission path is determined according to the calculation time and the transmission time, so that an optimal path is selected to meet the calculation efficiency of the neural network, the condition that the calculation efficiency is too low or the chip area is too large is avoided, and the calculation speed and efficiency of the neural network are improved.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A neural network computing system based on a dataflow architecture, comprising:

the off-chip memory is used for storing data;

2. The neural network computing system of claim 1, wherein the computing unit is directly connected to the off-chip memory through a first communication bus for performing computations based on the data received directly from the off-chip memory.

3. The neural network computing system of claim 2, wherein the computing unit comprises:

4. The neural network computing system of claim 1 or 2, wherein the first communication bus is an AXI bus, the data being interacted between the conversion unit and the off-chip memory via an AXI bus protocol or the data being interacted between the computing unit and the off-chip memory via an AXI bus protocol.

5. The neural network computing system of claim 1, wherein the second communication bus is a Sif bus, the data is interacted between the conversion unit and the on-chip memory through a Sif bus protocol, and the data is interacted between the on-chip memory and the computing unit through the Sif bus protocol.

6. The neural network computing system of claim 5, wherein the channels of the Sif bus are at least one.

7. The neural network computing system of claim 1, wherein the computational units comprise operators.

8. A neural network computing method based on a data flow architecture, applied to the neural network computing system according to claims 1-7, wherein the neural network is multi-layered, and the method comprises:

acquiring data to be calculated of a current layer of a neural network;

9. The method of claim 8, wherein the transmission times include a first transmission time and a second transmission time, the first transmission time being a data transmission time between the computing unit and the off-chip memory, the second transmission time being a data transmission time between the computing unit and the on-chip memory, the first transmission time being greater than the second transmission time;

10. The method of claim 9, wherein controlling the computing unit to interact with off-chip memory or controlling the computing unit to interact with on-chip memory comprises:

judging whether the data are arranged in sequence;