CN116472537A - Data processing method and processor - Google Patents

Data processing method and processor Download PDF

Info

Publication number
CN116472537A
CN116472537A CN202180077853.3A CN202180077853A CN116472537A CN 116472537 A CN116472537 A CN 116472537A CN 202180077853 A CN202180077853 A CN 202180077853A CN 116472537 A CN116472537 A CN 116472537A
Authority
CN
China
Prior art keywords
calculation
layer
data
run
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180077853.3A
Other languages
Chinese (zh)
Inventor
熊旭红
石洁珂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN116472537A publication Critical patent/CN116472537A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a data processing method and a processor, relates to the field of artificial intelligence, and solves the problem that the processor needs to read and write data for many times, so that a large amount of power consumption is caused. The specific scheme is as follows: and acquiring first data for performing a first calculation trip of the first calculation layer. The first data is stored in a first line cache of the first compute layer, the first line cache of the first compute layer included in the local cache. A first calculation run of the first calculation layer is calculated to obtain second data. The second data is stored in a first line cache of a second compute layer, the second compute layer being a compute layer subsequent to the first compute layer of the N compute layers. And calculating the first calculation travel of the second calculation layer to acquire fifth data corresponding to the first calculation travel of the second calculation layer under the condition that the accumulated data stored in the first line cache of the second calculation layer can carry out the first calculation travel of the second calculation layer.

Description

Data processing method and processor Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to a data processing method and a processor.
Background
Currently, neural networks have been widely used in scenes such as image classification, video processing, speech recognition, data analysis, and the like. Take the example of the processor processing the image using a neural network. Since the Feature Map (FM) data of an image to be processed is large in volume and cannot be stored in a local cache of a processor in general, the feature map data can be stored in an external memory having a large storage space. In processing the image, the processor may read feature map data (e.g., referred to as raw input feature maps) of the image from an external memory into the processor and calculate from the neural network model. After obtaining the calculation result (e.g., referred to as an output feature map), the processor may store the output feature map in an external memory.
It should be noted that, in the neural network model, a plurality of different calculation layers, such as a convolution layer, a pooling/activation layer, and the like, are generally included. The computation process is different for each computation layer. One computing layer may correspond to a kernel function (kernel), and according to the kernel function, the processor may perform computation on an input feature map input to the computing layer, and obtain a corresponding output feature map. After the calculation of one calculation layer is completed to obtain the corresponding output characteristic diagram, the output characteristic diagram can be stored in an external memory so as to be convenient for reading data stored in the external memory to be used as an input characteristic diagram of the current calculation layer for calculation when the next calculation layer is performed.
It can be seen that in completing the calculation of the neural network model once, the processor needs to read or write a large amount of data from or to the external memory a plurality of times, thereby causing a large amount of power consumption for the device performing the neural network calculation. In addition, as the number of objects to be processed (e.g., the number of images to be processed) and the complexity (e.g., the data amount of the feature map of the images to be processed) increase, the calculation power consumption also increases.
Disclosure of Invention
The embodiment of the application provides a data processing method and a processor, so as to reduce the power consumption of neural network calculation. In order to achieve the above purpose, the following technical solutions are adopted in the embodiments of the present application.
In a first aspect, a data processing method is provided, where the method is applied to a processor that performs computation of a neural network, where the neural network includes N computation layers, where N is an integer greater than or equal to 2. The processor is provided with a local cache. The method comprises the following steps: first data is acquired, the first data being used to perform a first calculation run of a first calculation layer, the first calculation layer being any one of the N calculation layers. The first data is stored in a first line cache of the first compute layer, the first line cache of the first compute layer included in the local cache. Calculating a first calculation run of the first calculation layer to obtain second data corresponding to the first calculation run of the first calculation layer, wherein the first calculation run of the first calculation layer comprises a convolution calculation of one or more lines of data of the first data using a convolution window of the first calculation layer. The second data is stored in a first line cache of a second compute layer included in the local cache, the second compute layer being a compute layer subsequent to the first compute layer of the N compute layers. And calculating the first calculation run of the second calculation layer to acquire fifth data corresponding to the first calculation run of the second calculation layer, wherein the first calculation run of the second calculation layer comprises convolution calculation of one or more lines of data of the second data using a convolution window of the second calculation layer, in case the accumulated data stored in the first line cache of the second calculation layer is capable of performing the first calculation run of the second calculation layer.
Based on this scheme, a pipelined computing mechanism between multiple computing layers is provided. In this example, the processor may obtain the data needed to perform one calculation run while performing the convolution calculations of one calculation layer. Taking convolution calculation as an example. One calculation run may be the calculation of the convolution window from left side sliding calculation to the rightmost one run. For example, if the convolution window is row a, the processor only needs to acquire data of row a to start calculation before performing convolution calculation of the calculation layer, and does not need to acquire full data of the input feature map required by the current calculation layer. Since the data amount of the a-line data is very small, it can be stored in a local cache, not in an external memory (such as DDR). Thus, when the current layer is calculated, the line A data can be directly read from the local cache, and one stroke of the current calculation layer is calculated according to the line A data. It will be appreciated that when the current computational layer is not the first computational layer of the neural network, then the a-line data may be the result of the computation of the previous computational layer. Compared with the prior art, in the scheme provided by the present example, since the previous calculation layer only needs to calculate and acquire the data of the a data, the intermediate data between the previous calculation layer and the current calculation layer does not need to be written into the DDR and read from the DDR again by the standby processor. Alternatively, the previous compute layer may store the data in a line cache configured for the current compute layer in the local cache after the compute acquires the a line data. That is, the intermediate data does not need to be written into the DDR, and thus does not need to be read from the DDR when performing the current layer of computation. And the data read-write from the local cache does not need to perform a plurality of times of data interaction with the DDR, thereby saving power consumption.
In one possible design, the method further comprises: and calculating a second calculation trip of the first calculation layer under the condition that the accumulated data cannot carry out the first calculation trip of the second calculation layer, wherein the second calculation trip of the first calculation layer is a calculation trip after the first calculation trip of the first calculation layer. Based on the scheme, a rollback mechanism in the interlayer computing process is provided. In this example, after the previous calculation layer completes the calculation of one run, the processor may calculate whether the calculation of one run of the current layer may be performed. If there is no data stored in the line cache corresponding to the current calculation layer, the processor may fall back to the previous layer to continue to perform the calculation of the next stroke, so as to acquire a new line calculation result and update the new line calculation result into the line cache currently calculated. The processor may then loop through the above scheme, for example, to determine whether the data stored in the current line cache can support the current calculation layer to complete a calculation of one run, if so, to perform a calculation of one run of the current calculation layer, and if not, to continue to return to the previous calculation layer for calculation. By analogy, a similar judgment rollback mechanism can be executed for the following calculation layers, so that the system calculation cannot be blocked on a certain calculation layer, and each calculation layer only needs to occupy a line cache corresponding to the number of convolution window lines of the calculation layer at the same time.
In one possible design, the number of lines of the first line cache is equal to the number of lines of the convolution window of the first calculation layer. Based on this scheme, a specific definition of the line number of the first line cache is provided. It will be appreciated that the first line cache is used to store the first data, which may be a memory space configured in a local cache of the processor for the first computation layer, for storing the computed data of any one run of the first computation layer. The number of lines of the first line buffer needs to be at least equal to the number of lines of the convolution window of the first convolution calculation layer in order to be able to store enough data for one-pass calculation.
In one possible design, when the first computational layer is the first computational layer of the neural network, the acquiring the first data includes: the first data is read from an external memory, the first data being at least part of an input profile stored in the external memory, the external memory being a storage medium coupled to the processor. Based on the scheme, a data acquisition mechanism is provided when the first calculation layer is the first calculation layer of the neural network. It will be appreciated that since the amount of data entered into the profile is typically large, it may be stored in an external memory (e.g., DDR) that can interact with the processor. The processor may read corresponding data from the DDR prior to executing one of the compute runs of the first compute layer, write the corresponding data to a line cache configured for the first compute layer to perform the computation of the one of the compute runs of the first compute layer.
In one possible design, the first data is part of an input signature stored in an external memory, the method further comprising: and acquiring third data from the external memory, wherein the third data is another part of the input characteristic diagram and is used for carrying out a second calculation stroke of the first calculation layer. And the third data is stored in an overlaying manner, and the fourth data is data which is not participated in the calculation of the first calculation layer in the first data. Based on this scheme, a mechanism is provided to dynamically adjust the data in the line cache. In this example, for each calculation in one calculation run, there is a portion of the first data that will not be re-used in subsequent calculations. Then the processor can read part of the new data from the DDR, and cover the data which cannot be applied in the subsequent calculation, so as to realize the update of the data. So that after the current calculation run is completed, data that can be used to perform the calculation of a new calculation run can be stored in the corresponding line cache. It should be noted that, in some embodiments of the present application, the replacement of the data may be performed after completing a calculation run, or may be performed during the execution of a calculation run.
In one possible design, the storing the second data in the first line cache of the second compute layer includes: in the process of carrying out the first calculation travel of the first calculation layer, each calculation result of the convolution window of the first calculation layer at one position is obtained, and the calculation result is stored in the first line cache of the second calculation layer. Based on this scheme, a writing mechanism of the second data is provided. In this example, each computation in the first computation layer obtains a computation result, which may be stored in a corresponding location in the line cache of the second computation layer. Thus, after one calculation pass in the first calculation is completed, one or more lines of data stored in the line cache of the second calculation layer may be retrieved for use in performing the calculation of the second calculation layer.
In one possible design, after acquiring the fifth data corresponding to the first calculation run of the second calculation layer, the method further includes: storing the fifth data in a first line cache of a third computing layer, the first line cache of the third computing layer included in the local cache; the third calculation layer is a calculation layer after the second calculation layer, and the fifth data is used for performing convolution calculation of the third calculation layer. Based on this scheme, a computing mechanism of other computing layers included in the neural network is provided. For example, each time the second computing layer completes a calculation of a run, the calculation result may be stored in a line cache corresponding to the next computing layer (for example, the third computing layer). So that after acquiring sufficient data, calculation of one run of the third calculation layer is performed.
In a second aspect, a data processing apparatus is provided for performing a neural network calculation, the neural network including N calculation layers, N being an integer greater than or equal to 2. The data processing device is provided with a local cache. The device comprises: an acquiring unit configured to acquire first data for performing a first calculation run of a first calculation layer, the first calculation layer being any one of the N calculation layers. And the storage unit is used for storing the first data in a first line cache of the first computing layer, wherein the first line cache of the first computing layer is included in the local cache. And a calculation unit configured to calculate a first calculation run of the first calculation layer to obtain second data corresponding to the first calculation run of the first calculation layer, where the first calculation run of the first calculation layer includes a convolution calculation of one or more lines of data of the first data using a convolution window of the first calculation layer. The storage unit is further configured to store the second data in a first line cache of a second computing layer, where the first line cache of the second computing layer is included in the local cache, and the second computing layer is a computing layer after the first computing layer of the N computing layers. And the calculating unit is further used for calculating the first calculation stroke of the second calculation layer to obtain fifth data corresponding to the first calculation stroke of the second calculation layer under the condition that the accumulated data stored in the first line cache of the second calculation layer can carry out the first calculation stroke of the second calculation layer, wherein the first calculation stroke of the second calculation layer comprises convolution calculation of one or more lines of data of the second data by using a convolution window of the second calculation layer.
In one possible design, the computing unit is further configured to compute a second computation run of the first computation layer if the accumulated data fails to perform the first computation run of the second computation layer, the second computation run of the first computation layer being a computation run after the first computation run of the first computation layer.
In one possible design, the number of lines of the first line cache is equal to the number of lines of the convolution window of the first calculation layer.
In one possible design, the obtaining unit is configured to read the first data from an external memory, where the first data is at least part of an input feature map stored in the external memory, and the external memory is a storage medium coupled to the processor.
In one possible design, the first data is part of an input feature map stored in an external memory, the acquisition unit is further configured to acquire third data from the external memory, the third data being another part of the input feature map, the third data being used to perform a second calculation run of the first calculation layer. And the third data is stored in an overlaying manner, and the fourth data is data which is not participated in the calculation of the first calculation layer in the first data.
In one possible design, the storage unit is further configured to store, during the first calculation run of the first calculation layer, a calculation result of the convolution window of the first calculation layer at a location every time the calculation result is acquired in the first line buffer of the second calculation layer.
In one possible design, the obtaining unit is further configured to obtain fifth data corresponding to the first calculation run of the second calculation layer. A storage unit that stores the fifth data in a first line cache of a third computation layer, the first line cache of the third computation layer being included in the local cache; the third calculation layer is a calculation layer after the second calculation layer, and the fifth data is used for performing convolution calculation of the third calculation layer.
In a third aspect, there is provided a processor comprising one or more computational cores, and a local cache, the processor being configured to implement the data processing method of any of the first aspect and its possible designs.
In a fourth aspect, there is provided an electronic device comprising one or more processors as described in the third aspect and one or more memories. The memory is coupled to the processor, the memory storing computer instructions. The computer instructions, when executed by the processor, cause the electronic device to perform the data processing method of any of the first aspect and its possible designs.
In a fifth aspect, there is provided a computer readable storage medium comprising computer instructions which, when run, perform the data processing method of any of the first aspect and its possible designs.
For example, any of the designs and possible designs of the second aspect to the fifth aspect may correspond to any of the possible designs of the first aspect and any of the possible designs of the first aspect, and thus, similar technical effects may be brought about, which are not repeated here.
Drawings
FIG. 1 is a schematic diagram of a convolutional neural network;
fig. 2 is a schematic structural diagram of a neural network computing device according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a convolutional layer according to an embodiment of the present disclosure;
fig. 4 is a flow chart of a data processing method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a computing logic provided in an embodiment of the present application;
FIG. 6 is a schematic diagram of another computing logic provided in an embodiment of the present application;
FIG. 7 is a schematic diagram of another computing logic provided in an embodiment of the present application;
FIG. 8 is a schematic diagram of a line cache according to an embodiment of the present disclosure;
FIG. 9 is a schematic diagram of a further line cache provided in an embodiment of the present application;
FIG. 10 is a schematic diagram of another computing logic provided in an embodiment of the present application;
FIG. 11 is a schematic diagram of another computing logic provided in an embodiment of the present application;
fig. 12 is a schematic structural diagram of a neural network according to an embodiment of the present application;
FIG. 13 is a schematic diagram of a computation logic timing diagram in a single-core and multi-core scenario according to an embodiment of the present disclosure;
FIG. 14 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;
fig. 15 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Neural networks commonly used in the art of artificial intelligence may include convolutional neural networks (Convolutional Neural Networks, CNN), recursive neural networks (recursive neural network, RNN), and the like. By way of example, fig. 1 shows a schematic diagram of a convolutional neural network. As shown in fig. 1, a convolutional neural network may be provided with a convolutional layer including one or more convolutional calculation layers, and a pooling/activation layer including one or more calculation layers. When data is input into the convolution layers, the processor can perform convolution calculation according to the convolution check input feature images corresponding to each convolution calculation layer in the convolution layers. The processor may perform sliding calculation on the input feature map according to a preset convolution kernel using a convolution window corresponding to the convolution kernel, and perform sliding calculation according to a preset step size (stride) to obtain calculation results corresponding to the positions of the windows, where the calculation results may be combined into a corresponding output feature map. At the data input pooling/activation layer, the processor may process the input feature map according to a specific function, such as a pooling (activity) and/or activation (activity) process. As shown in fig. 1, the convolutional neural network may further include an input layer. The input layer may be used to store feature map data for images that need to be processed so that the convolution layer may obtain an input feature map from the input layer. In some implementations, the input layer may be provided in an external memory connected to the processor performing the convolution calculations.
It should be noted that the convolutional neural network structure in different application scenarios may be different. In some implementations, multiple ones of the convolutional computational layers may be disposed across the computational layers in the pooling/activation layer. For example, as shown in FIG. 1, a pooling or activation process may be performed after a portion of the convolution calculations are performed, followed by the convolution calculations. In other implementations, convolution calculations may be performed first, followed by optional pooling and/or activation processing. After the calculation is completed, a result of completing a round of calculation of the convolutional neural network may be obtained and output through an output layer.
It should be noted that a local buffer (local buffer) may be provided inside the processor. The local cache may be used to store small amounts of data. The data stored in the local cache has the characteristic of fast reading and writing. For example, the local buffer may be used to store data such as convolution kernels corresponding to respective computational layers of a convolutional neural network model. When the neural network calculation is performed, the data volume of the original input feature map is generally large, and cannot be stored in the local cache. Thus, the raw input profile may be stored in an external memory connected to the processor. Wherein the external memory may be a storage medium having a large storage space. For example, the storage medium may be a double rate synchronous dynamic random access memory (DDR SDRAM), or the like. In this example, the DDR SDRAM may also be referred to simply as DDR. Take the external memory as DDR for example. The processor may read the raw input signature from the DDR at the beginning of the computation to perform the neural network computation. When multiple computing layers are included in the neural network model, then the output feature map of the upper layer may be used as the input feature map of the lower layer. The feature map data between two computing layers may also be referred to as intermediate data. In general, the amount of data of intermediate data of the neural network calculation process is not greater than the amount of data of the original input feature map. However, the data size of the intermediate data is far beyond the storage capacity of the local cache. If the processor writes the intermediate data into the DDR, the processor may cause the processor to perform multiple read-write interactions with the DDR with a larger data amount, thereby causing a large amount of power consumption. In addition, with the continuous improvement of the computing power of the processor, the shortage of the read-write bandwidth also limits the computing efficiency of the neural network.
In order to solve the above problem, the data of the feature map stored in the DDR may be split so that the split feature map slices (slices) may be stored in the local cache. Thus, when the neural network calculation is performed, the processor can read a slice from the DDR and store the slice in the local cache, and calculate the data of the slice. In combination with the above description, the data amount of the intermediate data and the output feature map in the calculation process of one slice is not greater than the data amount of the input feature map corresponding to the slice. After completing the computation of one slice, the processor may read the next 1 slice from the DDR and put it into the local cache, and repeat the above steps to perform the computation. And repeating the steps until the calculation of all the slices is completed. In this way, the DDR can store output feature maps corresponding to the plurality of slices. And then, the processor needs to combine the output characteristic diagrams corresponding to the slices respectively, so as to obtain a complete output characteristic diagram corresponding to the original input characteristic diagram. In order to make the output feature graphs of each slice have no gap, when the original input feature graph is sliced, repeated data needs to be included on adjacent slices. This results in repeated data being calculated multiple times, thereby reducing the efficiency of the overall calculation process and not optimizing power consumption.
In order to solve the above-mentioned problems, the data processing method provided in the embodiments of the present application can establish corresponding pipeline computing mechanisms in the computation of different computing layers, so that the computation of the previous layer does not need to be completely executed before the computation of the next layer is performed. The scheme can remarkably reduce the data quantity of the intermediate data, so that the intermediate data can be stored in a local cache, and the power consumption overhead of multiple times of large-scale read-write interaction between the processor and an external memory (such as DDR) is avoided. Meanwhile, the pipeline computing mechanism is built, so that the whole computing process does not need to wait, and the computing efficiency can be effectively improved. By the scheme, the repeated invalid calculation of the data is not performed due to the mechanism of pipeline calculation, so that the calculation efficiency is obviously higher than that of the prior art scheme. Particularly, the data processing method provided by the embodiment of the application provides a calculation method with a rollback mechanism, which can be suitable for different convolution calculation scenes with the step length being greater than or equal to 1. The following describes the scheme provided in the embodiments of the present application in detail with reference to the accompanying drawings.
Referring to fig. 2, a logic structure diagram of a neural network computing device 200 according to an embodiment of the present application is shown. The neural network computing device 200 may be used to implement computation of a neural network model including a convolutional neural network model according to the methods provided in embodiments of the present application. For ease of illustration, an external storage module 230 coupled to the neural network computing device 200 is also shown in this fig. 2. The neural network computing device 200 may interact with the external memory module 230 through an interface provided thereon. For example, the feature map data to be processed (e.g., the original input feature map) is read from the external storage module 230. For another example, the output profile data that completes the neural network computation is written to the external storage module 230 or the like. The neural network computing device 200 may be the processor described previously.
In some implementations, the external memory module 230 may include the DDR in the above description. In other implementations, the external storage module 230 may also include a system cache, which may be a system cache provided with the devices of the neural network computing apparatus 200 shown in FIG. 2. The system cache may be implemented via different storage media in different devices. For example, the system cache is via flash memory (flash). For another example, the system cache may also be a solid state disk (Solid State Device, SSD) or other storage medium.
As shown in fig. 2, in the neural network computing device 200 provided in the embodiment of the present application, a computing module 210 and a local cache 220 may be included. The computing module 210 may be a module in the neural network computing device 200 for implementing various computing functions. For example, the calculation module 210 may include a convolution calculation unit 211, an activation calculation unit 212, a pooling calculation unit 213, an Eltwise calculation unit 214, and the like. Among them, the convolution calculation unit 211 may be used to perform convolution calculation. As one possible implementation, the convolution calculation unit 211 may include one or more multipliers or other components capable of performing convolution calculations. The activation calculation unit 212 may be used to perform an activation process. The pooling calculation unit 213 may be used to implement the functions of the pooling process. The Eltwise calculation unit 214 may be used to implement a data-by-data (elementwise) calculation function.
It should be noted that, in the foregoing examples, the structural example of the computing module 210 is only one possible implementation, and in the computation for different neural network models, the units included in the computing module 210 for implementing the respective functions may be the same as or different from the foregoing examples. For example, when there is no need for activating computation and pooling computation in the neural network model, the activating computation unit 212 and the pooling computation unit 213 may not be provided in the computation module 210. For another example, the Eltwise calculation unit 214 may not be provided in the calculation module 210 when there is no calculation requirement of the elementwise in the neural network model.
As a possible implementation, the computing module 210 may be a neural network Processor (neural-network processing units, NPU), or a field programmable gate array (Field Programmable Gate Array, FPGA), or a central processing unit (Central Process ing Unit/Processor, CPU), or a graphics Processor (Graphics Processing Unit, GPU), which can implement the corresponding computing functions. It should be noted that, taking the computing module 210 as an NPU, the NPU may be a single-core NPU, or may be a multi-core NPU with multiple computing cores. The embodiments of the present application are not limited in this regard. It can be appreciated that the data processing method provided in the embodiment of the present application may be multiplexed on a multi-core NPU when applied to processing logic on a single-core NPU. When the scheme is applied to the multi-core NPU, a parallel computing mechanism of the multi-core NPU can be utilized, and the computing efficiency is further improved. For example, when there are multiple computing cores in the NPU, interconnection of the multiple computing cores may be implemented in an interconnection manner. For example, a Network On Chip (NOC) structure may be used to implement interconnection of multiple computing cores. It can be understood that, by adopting the NOC interconnection manner, the interconnection relationship between different computing cores can be dynamically configured according to the network structure, so as to dynamically configure the computation amount according to the computation pressure of each computing core, and realize the dynamic scheduling of computation, thereby improving the computation efficiency of the multi-core NPU.
With continued reference to fig. 2, a local cache 220 may also be provided in the neural network computing device 200 provided in an embodiment of the present application. The local cache 220 may be used for fast reading and writing of data. In some implementations of the present application, to save the cost of the neural network computing device 200 while taking into account the size requirements of the neural network computing device 200, the local cache 220 may be a storage medium with a smaller storage space. For example, taking the function of the computing module 210 implemented by the NPU as an example, the local cache 220 may be an internal cache of the NPU. In this application, the local cache 220 may be used to support line buffer (line buffer) techniques. For example, as shown in FIG. 2, the local cache 220 may have multiple line caches configured therein.
As an example, the plurality of line caches may each correspond to a different computational layer in the neural network model. The number of line caches corresponding to a computing layer may be determined according to the window size of the kernel function of the computing layer. For example, the convolution window of the convolution calculation layer is exemplified by M rows and N columns. M line buffers may be configured in local buffer 220 for the convolutional computational layer. Similarly, for other computing layers in the neural network model, corresponding line caches may also be respectively configured in the local caches 220. It will be appreciated that since the number of rows of kernel functions is generally small, the sum of the number of row caches configured for the various computational layers of the neural network model is not excessive. The above configuration may be implemented for the current local cache 220 space.
It should be noted that, in other implementations of the present application, the configuration of the line cache of the computation layer may also be performed in combination with the step size of the current computation layer and the related computation layer, and/or the special computation (such as elementwise computation) requirement in the neural network model. The specific configuration will be set forth in detail in the following description.
In the following example, a neural network model is taken as a convolutional neural network model, the convolutional neural network model has a structure shown in fig. 1, a calculation module is a single-core NPU, a local cache is a local cache in the NPU, and an external storage module is a DDR for example to describe the data processing method provided in the embodiment of the present application. In the following description, a process of performing each convolution calculation layer in the convolution layers by using the data processing method provided in the embodiment of the present application will be described in detail. It will be appreciated that for the computation of other layers in the convolutional neural network model, the process may refer to the computation process in that convolutional layer. As an example, the convolutional neural network may include convolutional layers in which N convolutional calculation layers as shown in fig. 3 are disposed. As shown in fig. 3, the N convolution calculation layers may be layer 1, layer 2, … …, layer N. In the process of performing convolution computation, the input feature map of layer 1 may be an original input feature map. The output feature map obtained after the layer N completes the convolution calculation may be referred to as a convolution output feature map.
The NPU may be initialized at the beginning of the convolution calculation. Illustratively, during the initialization process, the NPU may read data corresponding to the number of convolution window lines of layer 1 from the DDR and write the data into the local cache, which is a line cache configured for layer 1. Exemplary, the following description will refer to the ith row and jth column data of the original input feature map as a ij I and j are integers greater than or equal to 1. Take the example of a layer 1 convolution kernel of size A1 row B1 column. The local cache can be used for configuring the line cache of the A1 for the layer 1. In the initialization process, the NPU may read the previous A1 line data of the original input feature map from the DDR and write the line data into the line cache configured for layer 1 in the local cache. After initialization is complete, the NPU may perform layer 1 convolution calculations on the data written to the local cache. For example, a convolution window corresponding to layer 1 is adopted to perform convolution calculation on the line cache of A1 from left to right in sequence, so as to obtain a convolution result of a corresponding position.
The calculation process performed by the convolution window from the leftmost side to the rightmost side may be referred to herein as calculation of one run. The calculation of a run includes a calculation process of a part of the data located in the window among one or more lines of data. After the calculation of one stroke is completed, the 1 st row result of the output characteristic diagram corresponding to the layer 1 can be obtained. The output profile of layer 1 may be used as the input profile of layer 2. Thus, each time a convolution calculation for a location is completed, the NPU may store the calculation result in the local cache, the corresponding location in the line cache configured for layer 2. For example, after the convolution calculation of the 1 st run of the layer 1 is completed, the 1 st line data of the input feature map corresponding to the layer 2 may be stored in the line buffer corresponding to the layer 2. It should be noted that in some implementations of the present application, the NPU may read data that is not in use in the calculation of the new data overlay 1 from the DDR as the convolution calculation in layer 1 proceeds, so that after the calculation of 1 run is completed, layer 1 may continue to perform the calculation of the next 1 run without waiting for the NPU to read data from the DDR.
Illustratively, take the convolution kernel step of layer 1 as S1. After the layer 1 convolution window completes the 1 st convolution calculation in the corresponding run, the previous S1 row S1 column data (e.g., data a 11 To data a (S1,S1) ) Will not be used again. Therefore, after the layer 1 completes the 1 st convolution calculation, the NPU may read the S1 column data from the s1+1st line to the 2xs1 st line in the original input feature map from the DDR, and store the data in the local buffer at the location where the S1 column data is originally stored in the previous S1 line. By analogy, after the convolution calculation of the 1 st run of layer 1 is completed, the data required for the convolution calculation of the next 1 run is already stored in the A1 line cache configured for layer 1. However, in other implementations of the present application, the NPU may also read the data needed for the next 1 run from the DDR after completing the 1-run calculation, and store the data in the A1-line cache of layer 1.
By the above steps, the convolution calculation of one run of layer 1 can be completed. Thereby acquiring data of the line 1 input feature map of layer 2. Take the convolution window of layer 2 as row A2 and column B2 as an example. The NPU may continue to perform convolution calculations of other runs at layer 1 according to the scheme described above until the A2 data needed for the convolution calculations by layer 2 is acquired. That is, the NPU may perform A2-run convolution computation at layer 1, thereby acquiring A2-line data to be stored as an input signature of layer 2 in an A2-line cache configured for layer 2 in the local cache. The NPU may then begin performing the convolution computation for run 1 of layer 2. And further acquiring a line 1 input feature map of the layer 3, and storing the line input feature map in a line cache configured for the layer 3 in a local cache.
It will be appreciated that after layer 2 completes the 1-run convolution computation, to continue to perform the 2-run convolution computation of layer 2, layer 1 is required to perform the corresponding convolution computation to obtain the new input signature data required for the 2-run convolution computation of layer 2. For example, take the step size of layer 2 as S2 as an example. In the process of performing the convolution calculation of the run 1 in the layer 2, the A2 line data obtained through the A2 run calculation of the layer 1 is stored in the line cache corresponding to the layer 2. Then after layer 2 completes the convolution computation of run 1, the NPU may return to layer 1, performing the computation of (a2+1) to (a2+s2) runs of layer 1. Thereby acquiring new S2 data and storing the new S2 data in the line cache corresponding to A2. So that the NPU can continue to perform the 2 nd trip calculation in layer 2. By analogy, in the subsequent calculation process, each time the NPU completes the calculation of S2 runs of layer 1, the calculation of 1 run of layer 2 may be performed, with other layers being similar. When 1 run calculation of the layer N is completed, line 1 data of the convolution layer output characteristic diagram can be obtained.
As an example, fig. 4 shows a schematic flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 4, the method may include at least two computing processes (e.g., process 1 and process 2). Wherein, the process 1 is a flow when the neural network calculation is started. Process 2 is then the subsequent flow of computation start where layer 2 can perform 1 run. As shown, the process may specifically include: s401, calculating a layer 1 travel 1. S402, storing the output characteristic diagram of the layer 1 travel 1 into a line cache corresponding to the layer 2. S403, determining that the calculation of the layer 2 travel 1 cannot be performed in the layer 2, and returning to the layer 1 for the calculation. S404, calculating the layer 1 travel 2. S405, storing the output characteristic diagram of the layer 1 travel 2 into a line cache corresponding to the layer 2. S406, determining that the calculation of the layer 2 travel 1 cannot be performed in the layer 2, and returning to the layer 1 for the calculation. S407, calculating a layer 1 travel 3. S408, the output characteristic diagram of the layer 1 travel 3 is stored in a line cache corresponding to the layer 2.
It will be appreciated that this example is illustrated with a convolution window of layer 2 as a 3-row example. Thus, performing a layer 2 run calculation 1 time requires that layer 1 perform 3 runs of calculations. Correspondingly, if the convolution window of layer 2 is the A2 line, then layer 2 performs the calculation of the 1 st run, and layer 1 is required to perform the calculation of the A2 runs. The steps of procedure 1 are thus completed. It will be appreciated that in this process 1, since there is no data in the line cache corresponding to layer 2 at the start of the calculation, layer 1 is required to perform the calculation for 3 runs continuously to acquire the data of 1 run executed by layer 2.
Process 2 is described below. Since the line cache of the layer 2 already stores data, the layer 1 can update the data of the S2 line caches to the layer 2 every time the layer 1 completes the calculation of S2 runs, so that the layer 2 can execute the calculation of the next 1 run. Wherein S2 is the step size of the layer 2, and S2 is an integer greater than or equal to 1. Hereinafter, s2=1 is taken as an example. S409, calculating layer 2 travel 1. S410, determining that layer 2 travel 2 calculation cannot be performed in layer 2, and returning to layer 1 for calculation. S411, calculating a layer 1 travel 4. S412, the output characteristic diagram of the layer 1 travel 4 is stored in the line cache corresponding to the layer 2. S413, calculating a layer 2 travel 2. It can be seen that in process 2, layer 2 may continue to perform 1 run of calculations for every 1 run of calculations performed by layer 1. Thus, pipeline processing effects between different layers can be run.
In the above description, after each 1 run of calculation is completed in layer 1, 1 line of calculation results corresponding to the run are stored in a line cache corresponding to layer 2. In other implementations of the present application, as described above, layer 1 may store data in a corresponding location in the line cache of layer 2 for each calculation corresponding to 1 convolution window location during the execution of 1 run of calculations.
It should be noted that, in fig. 4, only the calculation logic of the layer 1 to the layer 2 is shown, and the calculation logic of other layers may employ the steps illustrated in the flowchart. For example, the input profile for layer 3 may be the output profile for layer 2. Thus, 1 calculation result is obtained during each 1 run of layer 2, and the result can be stored in the line cache of layer 3. After layer 2 completes 1 computation pass, the 1-line data may be updated in the layer 3 line cache. The NPU may determine whether layer 3 may perform 1 new run calculation and, if so, perform layer 3 calculation runs. If not, the calculation is returned to the layer 2, and if the layer 2 can perform the calculation of the next 1 run, the calculation of the next 1 run of the layer 2 is performed, and if not, the forward rollback is continued until the calculation of the next 1 run can be performed. After performing 1 run calculation for this layer, 1 line data may be updated into the line cache for the next layer and the next 1 run calculation for the next line may be performed. And so on, the convolution calculation of the N calculation layers shown in fig. 3 can be completed.
It can be seen from the above description that if there are no other layers of computation after the computation of the convolutional layer is completed in the convolutional neural network model, the NPU can write the line 1 data into the DDR. The NPU may directly write 1 data into the DDR after acquiring the data. The NPU may also write together in the DDR after 1 data is acquired. If the calculation of the convolutional layer is completed in the convolutional neural network model, and then the calculation of other layers is performed, such as performing an activation/pooling calculation or an elementwise calculation, the NPU may write the data in a line buffer corresponding to the subsequent calculation when acquiring the data of the output feature map of the convolutional layer, and perform the calculation according to the calculation in the convolutional layer.
It will be appreciated that the description based on fig. 4 above. Because there is a rollback mechanism (i.e., the NPU can determine whether the current layer can perform 1-stroke calculation, and if not, can rollback to the previous layer for calculation), when the step size of the calculation layer is not 1, the scheme can also be used to implement the pipeline calculation mechanism. Thus, all convolution calculations in each calculation layer can be realized by only configuring a line buffer corresponding to the number of the convolution window lines (such as configuring an A1 line buffer for layer 1) for each calculation layer. And the convolution calculation result of the current layer is directly written into the line cache of the next 1 calculation layer in the local cache and is used for calculating the next 1 calculation layer, so that the calculation effect of the pipeline is formed. Therefore, the NPU does not need to read intermediate data from the DDR any more during the next layer of computation. Therefore, the NPU can read the data quantity from the DDR only as 1 data quantity of the original input characteristic diagram in the process of carrying out convolution calculation on the convolution layer shown in fig. 3. And if there is no subsequent computational demand, the amount of data written into the DDR is only the amount of data of the 1 convolutional layer output profile. Obviously, by adopting the scheme in the example, the read-write data pressure of the NPU and the DDR can be obviously reduced, so that the power consumption overhead caused by a plurality of times of large-scale data read-write can be obviously controlled. In addition, because the NPU only needs to read the A1 row data each time when reading the data of the original input feature diagram from the DDR, the situation that the calculation efficiency of the whole system is affected due to the limitation of the read-write bandwidth is avoided. Compared with the scheme of slice calculation after current slicing, the data processing method provided by the embodiment of the application has the advantages that repeated calculation is not needed, so that calculation bandwidth and corresponding power consumption expenditure in the repeated calculation process can be saved.
In order to more clearly describe the scheme provided by the embodiment of the present application, fig. 5 to 13 below include 6×6 data in the original input feature map, i.e., i=j=6; the size of the layer 1 convolution window is 2×2, i.e., a1=b1=2, the step size s1=1 of the layer 1 convolution window; for layer 2, a2=b2=3, s2=2 is exemplified, and the convolution calculation performed using the scheme provided in the embodiment of the present application is exemplarily described. Referring to FIG. 5, a process diagram of the 1 st run of the layer 1 convolution calculation in this example is shown. During initialization, the NPU can extract a from the DDR 11 To a 26 And the two lines of data are respectively read into a local cache, namely a line cache 1 and a line cache 2 configured for the layer 1. After the initialization is completed, the NPUThe convolution computation for layer 1 may begin. For example, the NPU may perform sliding calculation on the data of the line cache 1 and the line cache 2 with the convolution window corresponding to the layer 1, thereby completing calculation of a trip.
It can be understood that 1 data of the corresponding position of the output feature map can be obtained after each calculation of the position of the convolution window is completed. In addition, the output feature map of layer 1 may be the input feature map of layer 2, so in this example, each time 1 data of the output feature map of layer 1 is acquired, the data may be stored in the local cache, and the corresponding location in the line cache configured for layer 2. For example, in connection with FIG. 6, to pair a 11 To a 26 The calculated results of (a) are b respectively 11 To b 15 As an example. Layer 1 convolution window at a 11 To a 22 The result obtained by the position calculation of (a) is b 11 . A new result can be obtained every time the slide is made during the calculation stroke. For example, calculate and acquire b 11 Then, sliding the layer 1 convolution window by 1 data to the right, and continuing to perform convolution calculation to obtain b 12 . By analogy, the layer 1 convolution window slides to the rightmost side, and b can be calculated and obtained 15 . In this example, at acquisition b 11 The NPU may then write the result to column 1 in the first line cache (e.g., line cache 3) configured for layer 2. At acquisition b 12 The NPU may then write the result to column 2 of line cache 3. Similarly, after a run of computation in layer 1 is completed, all data in line cache 3 (e.g., b 11 To b 15 )。
It will be appreciated that when layer 1 performs the 1 st run of convolution calculations, the convolution window is updated at position 1 (i.e., a 11 To a 22 The position) is calculated, then a 11 No further participation in subsequent calculations is possible. Thus, in this example, the NPU may read the data that needs to be supplemented for the 1 st convolution calculation in the 2 nd run calculation from the DDR, i.e., a), as shown in FIG. 7, after completing the 1 st convolution calculation in the 1 st run 31 . NPU canWill be a 31 Replacement a 11 Stored in a layer 1 line cache (e.g., line cache 1) to facilitate the subsequent 2 nd pass convolution calculation.
With a 31 Replacement a 11 Thereafter, the data stored in the line cache configured for layer 1 is as shown in fig. 8. It can be seen that when the NPU performs the 2 nd convolution calculation for the 1 st run of layer 1 (e.g., referred to as run 1), a is already stored in the NPU 31 For the 1 st convolution calculation of run 2. In this example, the NPU reads new data from the DDR after completing 1 convolution calculation, and replaces data that will not participate in subsequent calculation. In other implementations of the present application, the NPU may also read multiple data at once from the DDR after completing all convolution calculations for run 1, replacing data in the line cache that is no longer involved in the calculation. The number of times the NPU reads data from the DDR can be reduced.
After the layer 1 run 1 convolution computation is completed, the layer 1 line cache may store the data required to support the next 1 run (e.g., run 2) convolution computation. For example, the result of the data replacement is shown in fig. 9 (a). It will be appreciated that in order to ensure that the layer 1 run 2 computation proceeds smoothly, the NPU may adjust the position of the data in the line cache appropriately so that the correct data can be covered during the sliding of the convolution window. For example, the NPU may rollback data stored in the line cache in line units to achieve the effect of exchanging data stored in line cache 1 and line cache 2. That is, through rollback, the data in the line cache of layer 1 can be converted from the distribution shown in (a) in fig. 9 to the distribution shown in (b) in fig. 9.
Note that the rewinding operation as in (b) in fig. 9 is an optional step. In some implementations of the present application, this rollback processing of data is not required. It can be understood that in the process of performing the convolution calculation, the essence of the convolution calculation can be understood as products of data at each position of the convolution window and data at the position corresponding to the input feature map, and then adding the products to obtain a result of the convolution calculation. Therefore, in the process of performing convolution calculation, the data sequence on the line cache is not required to be adjusted as long as the correct corresponding relation between the data of the convolution window of the product operation and the data of the input feature map is ensured.
After the processing according to (b) in fig. 8-9, all data that can be used to perform the layer 1 run 2 convolution calculation has been stored on the layer 1 line cache. Thus, if the convolution computation of layer 1 run 2 is continued, line 2 data of the input feature map of layer 2 can be acquired. For example, referring to FIG. 10, the convolution window may be moved back to the leftmost side of line cache 1 and line cache 2, beginning to perform the layer 1 run 2 calculation. And each time of calculation, sliding rightward according to the step length corresponding to the layer 1 convolution window to perform the next calculation. And so on until all convolution calculations in run 2 are completed. Thus, the data of the 2 nd row of the input feature diagram of the layer 2 can be obtained, for example, the NPU can make the data (for example, b 21 -b 25 ) The line buffer 4 for storing line 2 data of the input profile of layer 2 is written.
It will be appreciated that layer 2, when performing the run 1 calculation, needs to acquire at least 3 lines of data of the input profile. Thus, in process 1, the NPU may perform a layer 1 run 1 to layer 1 run 3 convolution calculation to obtain the data required for layer 2 run 1 calculation. Data (e.g., b) required for run 1 calculation at acquisition layer 2 11 -b 35 ) Thereafter, the NPU may perform layer 2 calculation of run 1, i.e., calculation of entry into process 2. The convolution calculation process in layer 2 is similar to the convolution calculation process in layer 1. For example, referring to FIG. 11, a run 1 calculation for layer 2 may obtain input profile data for layer 3 (if present) (e.g., c 11 -c 13 ) The NPU may store the data in a line cache (e.g., line cache 6) configured for layer 3 in the local cache.
In this example, the step size of layer 2 is 2. That is, after the calculation of layer 2 run 1 is completed, the convolution window in layer 2 will slide down 2 data to begin the calculation of run 2. For example, at the 1 st convolution calculation in layer 2 run 1, the volume of layer 2The data of the input characteristic diagram covered by the product window is b 11 -b 35 . At the 1 st convolution calculation in layer 2 run 2, the input feature map data covered by the convolution window of layer 2 is b 31 -b 55 . And in order to obtain b 31 -b 55 The NPU needs to perform 2 runs of layer 1 (e.g., layer 1 run 4 and layer 1 run 5) calculations in the process 2 calculation. That is, when the step size of the next layer is greater than 1, in the calculation of the process 2 as shown in fig. 4, the current layer needs to continuously perform the calculation of the runs corresponding to the step size of the next layer for obtaining the input feature map data corresponding to the calculation capable of supporting 1 run of the next layer.
Illustratively, in some implementations of the present application, the layer 2 run 1 calculation may be performed after the layer 1 run 3 calculation is completed. Thereafter, the NPU determines that the layer 2 run 2 calculation cannot be performed continuously, and may fall back into the layer 1 calculation to perform the layer 1 run 4 calculation. After the layer 1 run 4 calculation is completed, the NPU may determine whether a layer 2 run 2 calculation can be performed. Since the step size of layer 2 is 2, the calculation of layer 2 run 2 cannot be performed from the current data. Thus, the calculation of the layer 1 run 5 can be performed while continuing to fall back into the calculation of the layer 1. After the layer 1 run 5 calculation is completed, the NPU may determine whether a layer 2 run 2 calculation can be performed. Since the input feature map data (e.g., b) required for the layer 2 run 2 calculation has been acquired 31 -b 55 ) The NPU may then perform layer 2 run 2 calculations. The subsequent processes are similar and will not be described in detail here.
In other implementations of the present application, since the convolutional neural network model has already been determined for each calculation layer step at the beginning of the calculation, the NPU may perform the calculation of layer 1 run 5 directly after the calculation of layer 1 run 4 is completed. Data capable of supporting layer 2 run 2 calculations is thus acquired at one time. This reduces the number of execution times of NPU decision logic, but requires more line caches to be configured for layer 1 to be able to store the data needed to support the calculation that layer 1 can quickly perform two runs in the local cache at the same time.
It will be appreciated that the above description of step-size different simultaneous computation logic is given by way of example only for adjacent layers 1 and 2. For neural network models provided with more computational layers, the computational logic can be generalized to more layer implementations. For example, layer 2 is also followed by layer 3 calculation, and the step size of layer 3 is 3. Then, the NPU may perform the calculations more times in layer 1 in process 2, so that the calculations for more runs can be performed continuously in layer 2, so that the data required for layer 3 to perform the next 1 run calculation can be obtained without the decision logic. This, of course, would require more line caches to be configured for layers 1 and 2. The scheme can be applied to the situation that the storage space allowance of the local cache is sufficient, the judging logic of the NPU can be reduced, and the system computing efficiency is improved.
In other cases, in order to save storage space of the local cache, the calculation may be performed according to a method involving the judgment logic. For example, after layer 2 performs 1 run calculation, the NPU determines whether layer 3 can perform the next 1 run calculation, and if so, continues to perform the next 1 run calculation of layer 3. Otherwise, if the calculation of the next 1 run of the layer 3 cannot be performed, the method returns to the calculation of the next 1 run of the layer 2. Similarly, if the data in the line cache corresponding to the current layer 2 cannot support the layer 2 to perform the calculation of the next 1 run, the method continues to fall back to the previous layer (such as the layer 1) to perform the calculation of the next 1 run.
Thus, as can be understood from the foregoing description, according to the data processing method provided in the embodiment of the present application, only a number of line caches corresponding to the number of lines of a kernel function (such as a convolution window of a convolution kernel in convolution calculation) need to be configured for each calculation layer in the local cache, and thus, the calculation effect of the pipeline can be obtained with reference to the method flow shown in fig. 4 and the scheme in the foregoing description. So that the NPU need not read data from the DDR a large number of times nor write data to the DDR a large number of times. Therefore, the power consumption cost introduced by reading and writing data is saved, and meanwhile, the intermediate data are stored in the line cache of the local cache, so that the calculation efficiency can be remarkably improved.
It should be noted that, the solution provided in the embodiment of the present application may also be applied to a scenario with a special computing requirement. Illustratively, in connection with FIG. 3, an example is given in which an elementwise calculation is also required in the convolutional neural network model. Referring to fig. 12, a schematic diagram of computational logic in a convolutional neural network is shown. As shown in fig. 12, in the convolutional neural network, calculation of elementwise is required in addition to the convolutional layer calculation shown in fig. 3. For example, the calculation of the elementwise may include an addition operation. The object of the addition may be a convolution layer output feature map, and an output feature map W obtained after the original input feature map is calculated by the calculation layer W. The calculation layer W may be the same calculation layer as any one of the convolution layers, or may be a calculation layer different from any one of the convolution layers. In this example, the addition of elementwise may be performed in the Eltwise calculation layer. In conjunction with the foregoing description, the local cache of the NPU may configure a corresponding line cache for the computation layer W. For example, to perform convolution calculations in the calculation layer W, the window size of which is A w Line B w Listed as an example. Then a may be configured in the local cache for the compute layer W w Line cache. Similarly, the Eltwise computation layer may also be configured with a corresponding line cache in the local cache. The number of line caches may be an integer greater than or equal to 1, for example.
The NPU may perform convolution calculations of the convolution layer and convolution calculations in the calculation layer W in a time-sharing manner before performing the elementwise addition operation. For example, the NPU may perform a convolution calculation of run 1 corresponding to the calculation layer W to obtain line 1 data of the output feature map W. The NPU may store the 1 st line data of the output feature map W into a line cache corresponding to the Eltwise calculation layer. After that, the NPU may perform convolution calculation in the convolution layer, and when acquiring the 1 st line data of the convolution layer output feature map, may input the 1 st line data of the convolution layer output feature map to the Eltwise calculation layer, so that the NPU performs an addition operation on the already stored 1 st line data of the output feature map W and the 1 st line data of the convolution layer output feature map in the Eltwise calculation layer, thereby acquiring the 1 st line data of the Eltwise output feature map. If there are no other layers of computation, the NPU may output the data 1 of the Eltwise output profile into the DDR as part of a round of convolutional neural network computation.
After acquiring the 1 st line data of the Eltwise output feature map, the NPU may update the data in the corresponding line cache of the computation layer W according to the method in the foregoing example, so as to perform the convolution computation of the 2 nd run. In addition, the NPU may perform convolution calculation of the convolution layer according to the foregoing method to obtain line 2 data of the output feature map of the convolution layer, and perform addition operation in the Eltwise calculation layer, thereby obtaining line 2 data of the output feature map of the Eltwise. And by analogy, the complete data of the output characteristic diagram calculated by the convolutional neural network can be obtained. It can be seen that, according to the method provided by the embodiment of the present application, when special operations including elementwise operations are performed, only a line buffer capable of storing data required by 1 stroke in the calculation process is required to be configured for the corresponding calculation layer, so that a large number of times of data reading and writing from the DDR are not required, thereby avoiding power consumption overhead caused by data reading and writing.
In the above example, the row cache is configured for the calculation layer W alone. In other implementations of the present application, the line caches of compute layer W may also be multiplexed with layer 1. Illustratively, in this example, the convolution calculations in calculation layer W are identical to those in layer 1 in that they are all convolution calculations on the data in the original input signature, except that their convolution kernels may be different.
In this example, when the convolution kernels of the calculation layer W and the layer 1 are different, a common line buffer may be configured for the calculation layer W and the layer 1, and the number of line buffers may be determined according to a convolution window with a larger line number in the convolution windows respectively corresponding to the convolution kernels of the calculation layer W and the layer 1. For example, calculate the number of convolution window lines A for layer W w The convolution window row number A1 of layer 1 is 2 and is 3. Then, the computing layer W and layer 1 may be co-configured with a local store comprising a 3-line cache to support the convolution computation of computing layer W and layer 1.
It should be noted that, since the calculation layer W and the layer 1 may need to perform convolution calculation using different convolution checks for the data stored in the line buffer, updating of the data in the line buffer may be performed after the calculation of the convolution calculation of W and the layer 1 at the corresponding positions is completed, respectively. For example, in connection with the description of fig. 7. When only the convolution computation of layer 1 needs to be performed on the input signature, it may be that, as shown in FIG. 7, after completing the convolution computation of 1 st in run 1 of layer 1, the NPU reads a from DDR 31 Replacement a 11 . In the present example, however, due to a 11 Not only the 1 st convolution calculation of layer 1 run 1, but also the 1 st convolution calculation of run 1 of calculation layer W is required. Thus, can be at a 11 After completing the 1 st convolution calculation of layer 1 run 1 and calculating the 1 st convolution calculation of run 1 of layer W, the NPU reads a from DDR 31 Replacement a 11 . In this way, the effect of line cache multiplexing can be achieved, thereby saving the storage space of the local cache.
It will be appreciated that in the above examples, the NPU is taken as a single core for illustration. For example, in conjunction with the flow diagram shown in fig. 4, the NPU may perform process 1 and process 2 therein in time sequence. Of course, in conjunction with the description of the scheme shown in fig. 4, the execution sequence of part of the steps in the process 1 and the process 2 may be different from that of fig. 4, and will not be repeated here. Currently, with the improvement of chip processing technology, multi-core NPUs are also often used. When the data processing method provided by the embodiment of the application is used in the multi-core NPU, calculation concurrency can be realized, and the effect of further improving the calculation efficiency is achieved.
Illustratively, the NPU has two compute cores. Fig. 13 shows a comparison schematic of a calculation flow according to time sequence in a single core scenario and a multi-core scenario. As shown in fig. 13, in a single core scenario, a computing core (e.g., core 1) in the NPU may perform layer 1 run 4 computation at time T1. Core 1 may perform a layer 2 run 1 calculation at time T2. Core 1 may perform a layer 1 run 5 calculation at time T3. Correspondingly, when the NPU is a dual-core processor (i.e., in a dual-core scenario), a computing core (e.g., core 1) in the NPU may perform the layer 1 run 4 computation at time T1. Core 1 may perform a layer 1 run 5 calculation at time T2. The calculation of layer 1 run 6 is performed at time T3. While the layer 2 computation may be performed by another computational core of the NPU (e.g., core 2). For example, at time T2, core 2 may perform the calculation of layer 2 run 1 while core 1 performs the calculation of layer 1 run 5. At time T3, core 2 may perform the calculation of layer 2 run 2 while core 1 performs the calculation of layer 1 run 6.
Obviously, compared with the calculation flow of the single-core NPU, the multi-core NPU can realize concurrency of a plurality of calculation processes in the calculation process of the multi-core NPU, so that the NPU can simultaneously execute the calculation of the next 1 travel of the layer 1 and the calculation of the layer 2 travel 1 after acquiring the data required by the calculation of the layer 2 travel 1, and does not need to wait for the calculation of the layer 2 travel 1 to be completed and then fall back to the calculation of the next 1 travel of the layer 1. In the above example, the core 1 performs the calculation of the layer 1, and the core 2 performs the calculation of the layer 2 is described as an example. In this application, there is no correspondence restriction for computation of the computation core and the computation layer. That is, in other implementations of the present application, the computation of different compute layers may also be performed in the same compute core. For example, when the computing power of the core 1 (such as by using the throughput identification) is larger and the throughput of the core 2 is smaller, the core 1 can process the computation of the partial layer 2 in a time-sharing multiplexing mode besides completing the computation of the layer 1, so as to ensure the consistency of the throughput. Therefore, full utilization of calculation bandwidth is realized, and working efficiency of the multi-core NPU is improved.
The foregoing description of the solution provided in the embodiments of the present application has been mainly presented in terms of a processor. It will be appreciated that the processor, in order to implement the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those skilled in the art will readily appreciate that the elements of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiment of the present application, the functional modules of the data processing device corresponding to the processor may be divided according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. Optionally, the division of the modules in the embodiments of the present application is schematic, which is merely a logic function division, and other division manners may be actually implemented.
Referring to fig. 14, a schematic diagram of a data processing apparatus 1400 according to an embodiment of the present application is provided. The data processing apparatus 1400 may be applied to perform neural network calculations that include N calculation layers, where N is an integer greater than or equal to 2. The data processing apparatus 1400 is provided with a local cache. As shown in fig. 14, the data processing apparatus 1400 includes: an obtaining unit 1401 is configured to obtain first data, where the first data is used to perform a first calculation run of a first calculation layer, and the first calculation layer is any one calculation layer of the N calculation layers. A storage unit 1402, configured to store the first data in a first line cache of the first computing layer, where the first line cache of the first computing layer is included in the local cache. A calculating unit 1403, configured to calculate a first calculation run of the first calculation layer to obtain second data corresponding to the first calculation run of the first calculation layer, where the first calculation run of the first calculation layer includes a convolution calculation of one or more lines of data of the first data using a convolution window of the first calculation layer. The storage unit 1402 is further configured to store the second data in a first line cache of a second computing layer, where the first line cache of the second computing layer is included in the local cache, and the second computing layer is a computing layer after the first computing layer of the N computing layers. The calculating unit 1403 is further configured to calculate, in a case where the accumulated data stored in the first line buffer of the second calculation layer is capable of performing the first calculation pass of the second calculation layer, the first calculation pass of the second calculation layer to obtain fifth data corresponding to the first calculation pass of the second calculation layer, where the first calculation pass of the second calculation layer includes convolution calculation of one or more lines of data of the second data using a convolution window of the second calculation layer.
In a possible design, the calculating unit 1403 is further configured to calculate a second calculation run of the first calculation layer if the accumulated data cannot perform the first calculation run of the second calculation layer, where the second calculation run of the first calculation layer is a calculation run after the first calculation run of the first calculation layer. In one possible design, the number of lines of the first line cache is equal to the number of lines of the convolution window of the first calculation layer. In a possible design, the obtaining unit 1401 is further configured to read the first data from an external memory, the first data being at least part of an input feature map stored in the external memory, the external memory being a storage medium coupled to the processor. In a possible design, the first data is part of an input feature map stored in an external memory, the obtaining unit 1401 is further configured to obtain third data from the external memory, the third data being another part of the input feature map, and the third data being used to perform a second calculation run of the first calculation layer. And the third data is stored in an overlaying manner, and the fourth data is data which is not participated in the calculation of the first calculation layer in the first data. In one possible design, the storage unit 1402 is further configured to store, during the first calculation run of the first calculation layer, a calculation result of a convolution window of the first calculation layer at a location every time the calculation result is acquired in a first line buffer of the second calculation layer. In one possible design, the obtaining unit 1401 is further configured to obtain fifth data corresponding to the first calculation run of the second calculation layer. A storage unit 1402 that stores the fifth data in a first line buffer of a third computation layer, the first line buffer of the third computation layer being included in the local buffer; the third calculation layer is a calculation layer after the second calculation layer, and the fifth data is used for performing convolution calculation of the third calculation layer.
All relevant contents of each step related to the above method embodiment may be cited to the functional descriptions of the corresponding functional modules, which are not described herein. That is, any of the above units may be implemented in software, hardware, or a combination of both to implement the functions as shown in the method. The data processing apparatus 1400 including the above units may be part of, integrated within, the above-described processor, such as functional hardware within the processor or functional software running within the processor. For example, any of the elements may be implemented in a software module that is then run on the neural network computing device 200 as shown in fig. 2.
Fig. 15 is a schematic structural diagram of an electronic device 1500 according to an embodiment of the present application. The electronic device 1500 may include: a processor 1501 and a memory 1502. The memory 1502 is used to store computer-executable instructions. Illustratively, in some embodiments, the processor 1501, when executing the instructions stored in the memory 1502, causes the electronic device 1500 to perform one or more of the steps S401-S413 as shown in fig. 4, as well as other operations that the electronic device needs to perform. In some embodiments, the electronic device 1500 may be provided with a neural network computing apparatus 200 as described in fig. 2. The processor 1501 may refer to the neural network computing device 200 of fig. 2.
It is to be understood that the processor of the present embodiment includes, but is not limited to, one or more of CPU, NPU, FPGA, CPU, GPU and DSP (digital signal processor) as previously described. The above processor may be implemented as one or more chips. When the processor is integrated on a chip, the chip is also referred to as a system on a chip (SoC).
It should be noted that, all relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein. The focusing device provided by the embodiment of the application is used for executing the functions of the terminal in the focusing method, so that the same effect as that of the focusing method can be achieved.
The functions or acts or operations or steps and the like in the embodiments described above may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more servers, data centers, etc. that can be integrated with the medium. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
Although the present application has been described in connection with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely exemplary illustrations of the present application as defined in the appended claims and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present application. It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to include such modifications and variations as well.

Claims (10)

  1. A data processing method, wherein the method is applied to a processor for performing calculation of a neural network, the neural network comprises N calculation layers, and N is an integer greater than or equal to 2; the processor is provided with a local cache; the method comprises the following steps:
    acquiring first data, wherein the first data is used for carrying out a first calculation stroke of a first calculation layer, and the first calculation layer is any one of the N calculation layers;
    Storing the first data in a first line cache of the first computing layer, the first line cache of the first computing layer included in the local cache;
    calculating a first calculation run of the first calculation layer to obtain second data corresponding to the first calculation run of the first calculation layer, wherein the first calculation run of the first calculation layer comprises a convolution calculation of one or more lines of data of the first data using a convolution window of the first calculation layer;
    storing the second data in a first line cache of a second computing layer, the first line cache of the second computing layer being included in the local cache, the second computing layer being a computing layer subsequent to the first computing layer among the N computing layers;
    and calculating the first calculation travel of the second calculation layer to acquire fifth data corresponding to the first calculation travel of the second calculation layer under the condition that the accumulated data stored in the first line cache of the second calculation layer can carry out the first calculation travel of the second calculation layer, wherein the first calculation travel of the second calculation layer comprises convolution calculation of one or more lines of data of the second data by using a convolution window of the second calculation layer.
  2. The method according to claim 1, wherein the method further comprises:
    and calculating a second calculation trip of the first calculation layer under the condition that the accumulated data cannot carry out the first calculation trip of the second calculation layer, wherein the second calculation trip of the first calculation layer is a calculation trip after the first calculation trip of the first calculation layer.
  3. A method according to claim 1 or 2, characterized in that,
    the number of lines of the first line cache is equal to the number of lines of the convolution window of the first calculation layer.
  4. A method according to any of claims 1-3, wherein, when the first computational layer is the first computational layer of the neural network, the acquiring the first data comprises:
    the first data is read from an external memory, the first data being at least part of an input profile stored in the external memory, the external memory being a storage medium coupled to the processor.
  5. The method of claim 4, wherein the first data is part of an input signature stored in the external memory, the method further comprising:
    acquiring third data from the external memory, wherein the third data is another part of the input feature diagram and is used for performing a second calculation stroke of the first calculation layer;
    And the third data is stored in an overlaying manner, and the fourth data is data which is not participated in the calculation of the first calculation layer in the first data.
  6. The method of any of claims 1-5, wherein the storing the second data in a first line cache of a second compute layer comprises:
    and in the process of performing the first calculation stroke of the first calculation layer, storing the calculation result in a first line cache of the second calculation layer every time the calculation result of the convolution window of the first calculation layer at one position is obtained.
  7. The method of any of claims 1-6, wherein after obtaining fifth data corresponding to a first computing run of the second computing layer, the method further comprises:
    storing the fifth data in a first line cache of a third computing layer, the first line cache of the third computing layer included in the local cache; the third calculation layer is a calculation layer after the second calculation layer, and the fifth data is used for performing convolution calculation of the third calculation layer.
  8. A processor comprising one or more computing cores and a local cache, the processor configured to implement the data processing method of any of claims 1-7.
  9. An electronic device comprising one or more processors of claim 8 and one or more memories; the memory is coupled with the processor, and the memory stores computer instructions;
    the computer instructions, when executed by the processor, cause the electronic device to perform the data processing method of any of claims 1-7.
  10. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises computer instructions which, when run, perform the data processing method according to any of claims 1-7.
CN202180077853.3A 2021-01-30 2021-01-30 Data processing method and processor Pending CN116472537A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/074548 WO2022160310A1 (en) 2021-01-30 2021-01-30 Data processing method and processor

Publications (1)

Publication Number Publication Date
CN116472537A true CN116472537A (en) 2023-07-21

Family

ID=82652937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180077853.3A Pending CN116472537A (en) 2021-01-30 2021-01-30 Data processing method and processor

Country Status (2)

Country Link
CN (1) CN116472537A (en)
WO (1) WO2022160310A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862374B (en) * 2017-10-30 2020-07-31 中国科学院计算技术研究所 Neural network processing system and processing method based on assembly line
CN111582451B (en) * 2020-05-08 2022-09-06 中国科学技术大学 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
CN111767986A (en) * 2020-06-24 2020-10-13 深兰人工智能芯片研究院(江苏)有限公司 Operation method and device based on neural network

Also Published As

Publication number Publication date
WO2022160310A1 (en) 2022-08-04

Similar Documents

Publication Publication Date Title
US20180032911A1 (en) Parallel information processing apparatus, information processing method and non-transitory recording medium
JP7451614B2 (en) On-chip computational network
CN111028360B (en) Data reading and writing method and system in 3D image processing, storage medium and terminal
US11163686B2 (en) Method and apparatus for accessing tensor data
US11853866B2 (en) Implementation of a neural network in multicore hardware
EP3844610B1 (en) Method and system for performing parallel computation
JP7419574B2 (en) Dilated convolution acceleration calculation method and device
WO2021244045A1 (en) Neural network data processing method and apparatus
CN116431562B (en) Multi-head attention mechanism fusion calculation distribution method based on acceleration processor
GB2599910A (en) Implementation of a neural network in multicore hardware
CN111767243A (en) Data processing method, related device and computer readable medium
CN111756802A (en) Method and system for scheduling data stream tasks on NUMA platform
CN116472537A (en) Data processing method and processor
WO2023124304A1 (en) Chip cache system, data processing method, device, storage medium, and chip
US20220138563A1 (en) Method and device with deep learning operations
Wu et al. Hetero layer fusion based architecture design and implementation for of deep learning accelerator
CN113392959A (en) Method for reconstructing architecture in computing system and computing system
WO2021120036A1 (en) Data processing apparatus and data processing method
US11544213B2 (en) Neural processor
US20220114015A1 (en) Electronic device and method with scheduling
CN116894462A (en) Data processing method and device
CN118278474A (en) Three-dimensional convolution parallel computing method, device and equipment based on multi-core processor
CN117151191A (en) Hardware accelerator, processor, chip and electronic equipment
CN114781637A (en) Convolutional neural network acceleration method, device and system
CN115952835A (en) Data processing method, readable medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination