CN114330686A

CN114330686A - Configurable convolution processing device and convolution calculation method

Info

Publication number: CN114330686A
Application number: CN202111528821.9A
Authority: CN
Inventors: 张官兴; 王赟
Original assignee: Shaoxing Ewa Technology Co ltd; Shanghai Ewa Intelligent Technology Co ltd
Current assignee: Shanghai Ehua Chuangxing Technology Co.,Ltd.
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-04-12

Abstract

The application aims to provide a configurable convolution processing device and a convolution calculation method. Compared with the prior art, the convolution processing device determines any one of a single microprocessor array, a plurality of microprocessor arrays, a single sub-computing unit or a plurality of sub-computing units as a target computing unit from the processing module based on a convolutional neural network model structure and parameters, obtains a preset program instruction for configuring the target computing unit through the control module, configures the target computing unit, and then the target computing unit obtains cache data obtained based on an external cache input and output interface through the first on-chip interconnection network or the second on-chip interconnection network and realizes convolution calculation based on the cache data. By the mode, the calculation modules with different sizes can be configured based on the preset program instruction, flexible configuration of convolution processing can be realized, and the device architecture can reduce calculation power consumption and improve calculation energy efficiency.

Description

Configurable convolution processing device and convolution calculation method

Technical Field

The present application relates to the field of convolution processing technologies, and in particular, to a configurable convolution processing technology.

Background

In the traditional von Neumann computing system mechanism, a computing unit ALU needs to continuously read and write data from an external memory, and the access read-write rate and the power consumption of the data limit the performance of intensive computing, so the defects of high computing power consumption and low computing efficiency exist.

Disclosure of Invention

The application aims to provide a configurable convolution processing device and a convolution calculation method.

According to an aspect of the present application, there is provided a configurable convolution processing apparatus, wherein the apparatus comprises:

the processing module comprises a plurality of microprocessor arrays arranged in a matrix form, wherein each microprocessor array comprises a plurality of sub-computing units arranged in the matrix form, at least any two sub-computing units of each microprocessor array can be coupled and connected through a first on-chip interconnection network, at least any two microprocessor arrays can be coupled and connected through a second on-chip interconnection network, and the first on-chip interconnection network and the second on-chip interconnection network are both connected with an external cache input/output interface;

and the control module is coupled with the processing module and is used for configuring any one of a single microprocessor array, a plurality of microprocessor arrays, a single sub-computing unit or a plurality of sub-computing units in the processing module based on a preset program instruction to execute convolution processing operation based on the cache data acquired by the external cache input/output interface.

Optionally, the apparatus further includes a plurality of parallel processing modules, where the plurality of parallel processing modules and the processing modules are arranged in a matrix, and any two of the processing modules may be coupled and connected through a third on-chip interconnection network.

Optionally, each of the sub-computing units includes a memory storage space, where the memory storage space of each sub-computing unit is expandable by configuring each sub-computing unit in each microprocessor array.

Further, the expansion of the memory storage space of each microprocessor array can be realized through the configuration of each microprocessor array in the processing module.

Optionally, wherein each of the sub-computing units comprises:

a multiply-add array for performing a multiply-add operation; the data flow cache is used for caching input data of the multiply-add array or an output result of the multiply-add array;

and the custom microprocessor is used for acquiring data to be processed and executing one or more custom operations based on a preset configuration instruction, wherein the acquisition of the data to be processed comprises the acquisition of an output result of the multiply-add array.

Further, the acquiring of the data to be processed further includes directly acquiring processing data from the first on-chip internetwork or the second on-chip internetwork, and the directly acquiring of the processing data from the first on-chip internetwork or the second on-chip internetwork includes any one of:

receiving/transmitting data of other sub-computing units through the first on-chip interconnection network coupling circuit;

receiving/transmitting data of other microprocessor arrays through the second on-chip interconnection network coupling circuit;

the external buffer data is received/transmitted through the first or second on-chip interconnection network coupling circuit.

Further wherein the custom operation comprises any one of: step length operation, activation operation, pooling operation, normalization operation, quantization operation and convergence operation.

Optionally, wherein the custom microprocessor comprises:

the micro-instruction cache is used for caching data to be processed and one or more micro-instructions corresponding to the user-defined operation;

the microinstruction list module is used for integrating the microinstructions cached by the microinstruction cache into a microinstruction list based on a preset sequence;

the instruction counter is coupled with the micro instruction list module and is used for counting the execution times of the micro instructions corresponding to each user-defined operation;

the instruction decoder is coupled with the micro instruction list module and the instruction counter and used for determining a micro instruction address index from the instruction counter and acquiring decoding information of each user-defined operation micro instruction from the micro instruction list module and generating a control instruction and an enabling signal;

and the user-defined operation integrated circuit comprises a plurality of user-defined operation circuits and is used for controlling the corresponding user-defined operation circuits to execute corresponding operations based on the enabling signals and the control instructions.

Optionally, the first on-chip internetwork and the second on-chip internetwork may share an external cache input/output interface.

Optionally, the first on-chip interconnect network includes an input/output interface and a configuration switch or a routing unit, and the first on-chip interconnect network may be coupled to the input/output interface of each sub-computation unit in each microprocessor array through the configuration switch or the routing unit.

Optionally, the second on-chip interconnection network includes an input/output interface and a configurable switch or routing unit, and the second on-chip interconnection network may be coupled to the input/output interface of each microprocessor array through the configurable switch or routing unit.

According to another aspect of the present application, there is also provided a convolution calculation method based on the configurable convolution processing apparatus as described above, wherein the method includes:

determining any one of a single microprocessor array, a plurality of microprocessor arrays, a single sub-computing unit or a plurality of sub-computing units from the processing module as a target computing unit based on the convolutional neural network model structure and parameters;

acquiring a preset program instruction for configuring the target computing unit through a control module and configuring the target computing unit;

and the target computing unit acquires cache data acquired based on an external cache input/output interface through the first on-chip internet or the second on-chip internet and realizes convolution computation based on the cache data.

Further, if the convolution calculation includes a direct convolution operation, the performing convolution calculation based on the buffered data includes:

and the cache data is loaded into a multiply-add array through the data stream cache of the sub-computing unit or the cache data is directly loaded into the multiply-add array of the sub-computing unit to execute multiply-add operation.

Further, if the convolution calculation further includes a custom operation, the implementing convolution calculation based on the cache data further includes:

obtaining external cache data as first data and obtaining activation characteristic data output after multiplication and addition operation as second data through a custom microprocessor of the sub-computing unit;

performing the custom operation based on the first data and the second data to achieve a convolution calculation.

Further wherein said performing the custom operation based on the first data and the second data comprises:

caching the first data, the second data and a micro instruction corresponding to the custom operation in a micro instruction cache and configuring a micro instruction list;

generating a microinstruction address index based on the microinstruction list through an instruction counter in a circulating way and acquiring a microinstruction corresponding to each user-defined operation;

after decoding by the decoder, a selection signal is generated to determine a corresponding custom operational circuit and the corresponding custom operation is executed by the custom operational circuit.

According to yet another aspect of the present application, there is also provided a computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement the operations of the method as described above.

Compared with the prior art, the application provides a configurable convolution processing device, which comprises: the processing module comprises a plurality of microprocessor arrays arranged in a matrix form, wherein each microprocessor array comprises a plurality of sub-computing units arranged in the matrix form, at least any two sub-computing units of each microprocessor array can be coupled and connected through a first on-chip interconnection network, at least any two microprocessor arrays can be coupled and connected through a second on-chip interconnection network, and the first on-chip interconnection network and the second on-chip interconnection network are both connected with an external cache input/output interface; and the control module is coupled with the processing module and is used for configuring any one of a single microprocessor array, a plurality of microprocessor arrays, a single sub-computing unit or a plurality of sub-computing units in the processing module based on preset program instructions and executing convolution processing operation on the cache data acquired by the external cache input/output interface.

In addition, in the present application, the apparatus further includes a plurality of parallel processing modules, where the plurality of parallel processing modules and the processing modules are arranged in a matrix form, and any two processing modules may be coupled and connected through a third on-chip interconnection network. By the method, the parallel processing modules can be flexibly added according to the requirements, and the expansion of the device can be flexibly realized.

In addition, each of the sub-computing units of the apparatus in the present application includes a memory storage space, wherein the expansion of the memory storage space of each sub-computing unit can be realized by configuring each sub-computing unit in each microprocessor array. The method can flexibly realize the expansion of the memory space and can improve the scalability of hardware accelerated computation.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 shows a schematic diagram of a configurable convolution processing apparatus according to an aspect of the present application;

FIG. 2 shows a schematic diagram of a configurable convolution processing apparatus according to a preferred embodiment of the present application;

FIG. 3 shows an example of a three-channel input profile convolved with two convolution kernels, respectively;

FIG. 4 is a diagram illustrating the internal architecture of a sub-computing unit and the architecture of connections to other sub-computing units in accordance with a preferred embodiment of the present application;

FIG. 5 illustrates a convolution calculation method in accordance with another aspect of the subject application;

FIG. 6 is a schematic diagram illustrating a configuration of a convolution operation local memory data load and target compute unit according to a preferred embodiment of the present application;

FIG. 7 is a diagram illustrating a parallel computing arrangement based on one input and output;

FIG. 8 shows a schematic diagram of a neural network with hopping connections;

FIG. 9 illustrates a schematic diagram of a multi-stem convolutional layer gather operation;

figure 10 shows a schematic diagram of a U-shaped neural network with hopping connections.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present invention is described in further detail below with reference to the attached drawing figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

To further illustrate the technical means and effects adopted by the present application, the following description clearly and completely describes the technical solution of the present application with reference to the accompanying drawings and preferred embodiments.

Fig. 1 illustrates a configurable convolution processing apparatus provided in an aspect of the present application, where the apparatus includes:

Specifically, the number of the microprocessor arrays in the processing module or the number of the sub-computing units in the microprocessor array can be configured based on actual requirements, and the expansion of the processing module can be realized. The processing module is coupled with the off-chip memory and can receive cache data input by the off-chip memory through an external cache. Specifically, the control module may configure, based on a preset program instruction, part or all of the sub-computing units in the control processing module to implement the convolution operation, for example, the control module may be connected to the resource scheduling module, and configure the processing module by receiving the preset program instruction sent by the resource scheduling module, where the resource scheduling module may generate the preset program instruction based on the size of the convolution layer, specifically, different size information may correspond to the number of different sub-computing units, for example, the size of the convolution layer may be divided into four levels, and the different levels may correspond to the number of different sub-computing units, so as to configure a single sub-computing unit, multiple sub-computing units, a single microprocessor array, or multiple microprocessor arrays in the processing module to implement the convolution operation.

Preferably, the apparatus further includes a plurality of parallel processing modules, wherein the plurality of parallel processing modules and the processing modules are arranged in a matrix, and at least any two of the processing modules may be coupled and connected through a third on-chip interconnection network. In this embodiment, in order to realize a larger-scale convolution processing operation, the apparatus may be expanded, for example, one or more parallel processing modules may be added, where the internal configurations of the parallel processing module and the processing module may be completely or partially the same or different.

Optionally, each of the sub-computing units includes a memory storage space, where the memory storage space of each sub-computing unit is expandable by configuring each sub-computing unit in each microprocessor array. In this embodiment, in order to implement data storage and data interaction more efficiently, a memory storage space may be provided in each sub-computing unit, and the memory storage spaces of at least any two sub-computing units may be integrated by configuration, so as to form a larger memory storage space.

Further, in a preferred embodiment, the expansion of the memory storage space of each microprocessor array is achieved by configuring the respective microprocessor arrays within the processing module. Because each microprocessor array is composed of one or more sub-computing units, the memory storage spaces of at least any two sub-computing units are integrated, the integrated memory storage space can be used as the memory storage space of each microprocessor array, and further, the memory storage spaces of at least any two microprocessor arrays can be continuously integrated, so that the expansion of the memory storage space of each microprocessor array is realized, namely the integrated memory storage space can be used for each microprocessor array participating in integration.

Preferably, wherein each of said sub-computation units comprises: a multiply-add array for performing a multiply-add operation; the data flow cache is used for caching input data of the multiply-add array or an output result of the multiply-add array; and the custom microprocessor is used for acquiring data to be processed and executing one or more custom operations based on a preset configuration instruction, wherein the acquisition of the data to be processed comprises the acquisition of an output result of the multiply-add array. Optionally, wherein the custom operation includes, but is not limited to, any of: step length operation, activation operation, pooling operation, normalization operation, quantization operation and convergence operation. Here, the custom operation is related to different neural network structures, and different neural network structures may correspond to different custom operations, and here, the custom operation is only an example and is not limited specifically.

Preferably, the acquiring the data to be processed further includes directly acquiring the processing data from the first on-chip internetwork or the second on-chip internetwork, and the directly acquiring the processing data from the first on-chip internetwork or the second on-chip internetwork includes any one of: receiving/transmitting data of other sub-computing units through the first on-chip interconnection network coupling circuit; receiving/transmitting data of other microprocessor arrays through the second on-chip interconnection network coupling circuit; the external buffer data is received/transmitted through the first or second on-chip interconnection network coupling circuit.

Preferably, the first on-chip interconnection network includes an input/output interface and a configurable switch or routing unit, and the first on-chip interconnection network may be coupled to the input/output interface of each sub-computation unit in each microprocessor array through the configurable switch or routing unit. The second on-chip interconnection network comprises input and output interfaces and configurable switches or routing units, and the second on-chip interconnection network can be coupled with the input and output interfaces of each microprocessor array through the configurable switches or routing units. The third on-chip interconnection network comprises an input/output interface and a configurable switch or a routing unit, and the third on-chip interconnection network can be coupled with the input/output interface of each processing module through the configurable switch or the routing unit.

Specifically, the control module may configure the configurable switch or routing unit on the first on-chip interconnection network and/or the second on-chip interconnection network through the configuration instruction, thereby implementing the coupling connection between the sub-computing units or between the microprocessor arrays. Moreover, not only can the efficient communication transmission among the microprocessor array or the sub-computing units be realized, but also the whole-column convolution calculation can have more flexible scalability, for example, in the face of different sizes, convolution layers with different granularities can realize high-parallelism calculation by configuring combinations among different units.

Preferably, the first on-chip internetwork and the second on-chip internetwork can share an external cache input/output interface, and resource utilization rate can be improved by sharing the input/output interface.

Fig. 2 is a schematic diagram of a configurable convolution processing apparatus according to a preferred embodiment of the present application. The Central Processing Unit (CPU) corresponds to the control module, the characteristic cache and the weight cache correspond to the external cache, the local memory corresponds to the memory storage space, the PE corresponds to the multiply-add array, and the CIMPU corresponds to the custom microprocessor, wherein the interconnection network on the first chip and the interconnection network on the second chip realize the coupling with the sub-computing unit or the microprocessor array through the routing unit.

In this way, a new flexible interconnection architecture (including the first and second on-chip interconnection networks) and a mapping strategy are applied to realize efficient operation, for example, data can be acquired from a corresponding local memory storage space to execute corresponding calculation between each microprocessor array and each sub-calculation unit based on operation instructions, so that local data access is realized, a data read-write function can be efficiently realized based on the distributed local memory storage and calculation kernel distribution mode, and high bandwidth and low delay are provided for intensive calculation.

Compared with the traditional external storage read-write or internal centralized storage read-write calculation, the design can reduce the energy consumption and the occupation of bus bandwidth caused by the movement of data among each microprocessor array or sub-calculation units to the maximum extent when convolution calculation parts and accumulation (the characteristic graphs of a plurality of input channels are respectively convolved with the sub-convolution kernels with different depths of each convolution kernel and then the characteristic data of the output channel corresponding to the current convolution kernel) and the like are carried out in a convolution layer or a plurality of input channels of the convolution neural network, thereby realizing the local or near-distance storage access and calculation of the data.

In addition, a plurality of computing kernels are organized by flexibly configuring an on-chip interconnection network to realize scalable computing, wherein the computing kernels are not limited to any one of a single sub-computing unit, a plurality of sub-computing units, a single microprocessor array and a plurality of microprocessor arrays, for example, the computing kernels can be organized to form a first array by a plurality of sub-computing units or a second array by a plurality of microprocessor arrays to respectively realize scalable computing on convolutional layers with different sizes, and the problems of unstable and unbalanced computing load caused by tensor dimension or depth or connection relation change in/among neural network layers can be effectively solved, so that the design meets the requirement of different neural network models for accelerating scalable computing. Meanwhile, the distributed interconnection network design avoids the computation delay and waiting caused by the limitation of the bus address space, and provides asynchronous, synchronous or asynchronous computation logic which is not limited by clock signals for the intensive convolution operation; meanwhile, the interconnection topology is adopted to connect with the network, so that various data flow calculation modes can be realized.

Fig. 3 shows an example of a three-channel input signature graph convolved with two convolution kernels respectively. It can be seen that each convolution kernel has a size of 3 × 3 and a depth of 3, i.e., 3 × 3, and that the sub-convolution kernels in the depth direction of each convolution kernel are convolved with the corresponding input feature map channel data during the convolution operation, then, the convolution results of the 3 sub-convolution kernels are subjected to accumulation calculation to be used as the output of the current convolution kernel convolution operation, the number of output channels of the convolution operation is consistent with the number of the convolution kernels, after the output, the relevant calculation processing such as activation, pooling, normalization and the like needs to be respectively carried out on each input feature, the above-mentioned convolution processing mode is the most basic convolution processing mode, and there is a special convolution operation, such as operation 1 x 512 of the full connection layer, that is, the convolution kernel size is 1 x 1, at this time, the whole sub-computing unit or the microprocessor array needs to be reconfigured for the calculation, so that the expansion calculation can be realized when facing convolution kernels or feature maps with different sizes, thereby efficiently utilizing the multiplication and addition circuit resources in the processing module.

In a preferred embodiment, the custom microprocessor of the sub-computational unit comprises:

Fig. 4 is a diagram illustrating an internal architecture of a sub-computing unit and an architecture of connections with other sub-computing units according to a preferred embodiment of the present application. The PE corresponds to a multiply-add array, and the CIMPU corresponds to a custom microprocessor, where the custom operation integrated circuit includes an activation, pooling, normalization, and collection operation circuit, and here, the custom operation integrated circuit is only an example, and may also be set based on actual requirements. In the conventional calculation, the data after convolution processing by the multiply-add array needs to be copied or stored in an external storage area such as an external memory module for subsequent calculation after convolution processing. In this embodiment, the design of customizing the micro instruction cache, the micro instruction list, and the like of the microprocessor further reduces the bandwidth occupation of the on-chip internet by the related instructions and addresses and the centralized processing of the data to be activated or pooled when the related operations such as convolution activation and pooling are performed.

Taking a common convolution operation as an example, the activation characteristic data and the weight data are fed to a data stream BUFFER through an on-chip interconnection network, then the PE array performs a multiply-add operation on the data according to a control module instruction, and an operation result is fed back to the data stream BUFFER or directly fed to the CIMPU to perform a subsequent operation to be processed. Taking the activation and pooling operation as an example, the micro instruction cache is preloaded with corresponding micro instructions, the corresponding micro instructions are loaded to an instruction list when the device is initialized, in the actual operation process, the CIMPU reads each instruction operation cycle of the counter as a micro instruction address index through the instruction decoder based on initial configuration information, then acquires corresponding activation micro instruction decoding information from the micro instruction list to generate activation calculation control (such as internal communication bus arbitration control, data selection control, functional operation circuit selection control and the like) and an enabling signal, so as to realize the activation operation of data to be processed, then the decoder indicates that the activation of the current input characteristic data is completed based on the preset cycle times of the counter, generates the pooling instruction index based on the cycle, acquires corresponding pooling instruction decoding information from the micro instruction list, and generates pooling calculation control and the enabling signal, a pooling operation is performed. Further, the CIMPU may perform a SIMD operation, i.e., taking an average pooling of 2 × 2 as an example, 4 active outputs may be obtained per reference clock cycle, and the pooling calculation circuit performs the pooling operation asynchronously after the active data is ready.

The CIMPU is respectively coupled and connected with the first on-chip interconnection network and the PE array through jump connection I/O and direct data acquisition I/O, and when the convolution calculation process is configured in a jump connection state, special neural network structure calculation including but not limited to residual error network calculation, different activation characteristic layer convergence calculation, related calculation and the like can be achieved.

Preferably, the first on-chip interconnection network further includes an accumulation tree in addition to the data flow path configuration between the sub-computation units or the microprocessor array, so as to realize the second accumulation based on the first accumulation result of the sub-computation units or the microprocessor array.

Specifically, the convolution operation includes two accumulations, wherein the first accumulation is performed in the PE, that is, each sub-convolution kernel is accumulated after convolution operation with the characteristic data window data of the corresponding input channel, the second accumulation is the accumulation of convolution operation results of sub-convolution kernels with different depths and the characteristic data of the corresponding different input channels, the first accumulation of the conventional accelerator is generally completed by an internal accumulator of PE, the second accumulation requires partial sum data to be cached to an external buffer or cache area of a computing unit after convolution operation of each sub-convolution kernel in the convolution kernel and the characteristic map of the corresponding input channel is completed, then, the external accumulator executes the second accumulation operation, and in the embodiment, the accumulation tree is arranged on the first chip through the internet to obtain the first accumulation results of other sub-calculation units or other microprocessor arrays so as to realize the second accumulation.

Specifically, each sub-calculation unit in each microprocessor array respectively calculates convolution operation results corresponding to different sub-convolution kernels, then accumulation operation is performed by an accumulation tree in an interconnection network on a first chip, if the depth of the current convolution kernel is 8, the accumulation tree in the interconnection network on the first chip in the first microprocessor array performs convolution result accumulation of the first four sub-convolution kernels, the accumulation tree in the interconnection network on the first chip in a second microprocessor array performs convolution result accumulation of the last four sub-convolution kernels, and then one of the accumulation results on the two microprocessor arrays is sent to the other microprocessor array to perform final accumulation operation, so that the problems of high power consumption, delay and bandwidth occupation caused by reading and performing the accumulation operation after data is cached in an external memory are solved.

Fig. 5 illustrates a convolution calculation method according to another aspect of the present application, the method being implemented based on the configurable convolution processing apparatus shown in fig. 1, wherein the method includes:

s11, determining any one of a single microprocessor array, a plurality of microprocessor arrays, a single sub-calculation unit or a plurality of sub-calculation units from the processing module as a target calculation unit based on the convolutional neural network model structure and parameters;

s12, acquiring a preset program instruction for configuring the target computing unit through a control module and configuring the target computing unit;

s13, the target computing unit obtains the cache data obtained based on the external cache input/output interface through the first on-chip internet or the second on-chip internet and realizes convolution computation based on the cache data.

In this embodiment, in step S11, different target calculation units may be configured to implement the operation because different neural network model structures and parameters are different. For example, the target computing unit may be dynamically configured according to convolution size and/or computing type. Further, in step S12, the control module is configured, and specifically, the preset program instruction is used to configure the configurable switch or routing unit of the first on-chip interconnection network and/or the second on-chip interconnection network, and configure the input/output port of each sub-computing unit or the microprocessor array, so as to implement the coupling connection of the sub-computing units in the target computing unit. Further, after configuration is completed, the convolution calculation is realized by loading the cache data acquired by the external cache input/output interface into the target calculation unit.

Preferably, if the convolution calculation includes a direct convolution operation, the implementing the convolution calculation based on the buffered data includes: s131 (not shown) loads the cache data into the multiply-add array through the data stream cache of the sub-computing unit or directly loads the cache data into the multiply-add array of the sub-computing unit to perform the multiply-add operation.

Preferably, if the convolution calculation further includes a custom operation, the implementing convolution calculation based on the cache data further includes:

s132 (not shown) acquiring external cache data as first data and acquiring activation characteristic data output after multiplication and addition operation as second data through a custom microprocessor of the sub-computing unit;

s133 (not shown) performs the custom operation based on the first data and the second data to implement a convolution calculation.

In a preferred embodiment, the custom microprocessor of the convolution processing apparatus includes a FIFO queue, wherein the obtaining, by the custom microprocessor of the sub-computation unit, external buffer data as first data and/or obtaining, as second data, activation characteristic data output after a multiply-add operation includes:

dynamically loading the first data and/or the second data into FIFO queues of the custom microprocessors of the current sub-computing unit and other sub-computing units coupled with the current sub-computing unit based on the loading state of the FIFO queues of the custom microprocessors corresponding to each sub-computing unit;

wherein performing the custom operation based on the first data and/or the second data comprises:

and each sub-computing unit executes the custom operation based on the acquired first data and/or the acquired second data.

Preferably, wherein the performing the custom operation based on the first data and the second data comprises:

s1331 (not shown) buffering the first data, the second data, and the micro instructions corresponding to the custom operation in a micro instruction cache, and configuring a micro instruction list;

s1332 (not shown) generating a microinstruction address index based on the microinstruction list and through the instruction counter cycle and obtaining a microinstruction corresponding to each custom operation;

s1333 (not shown) generates a selection signal to determine a corresponding custom operation circuit and performs a corresponding custom operation through the custom operation circuit after decoding by a decoder.

Fig. 6 is a schematic diagram illustrating a configuration of a convolution operation local memory data loading and target computing unit according to a preferred embodiment of the present application. The local memory corresponds to the memory storage space of each sub-computing unit, and the MPU corresponds to the microprocessor array. Specifically, the convolution operation with the input signature size of 256 × 16 (width and height are 256, respectively, and the number of input channels is 16), the current layer convolution kernel size of 3 × 16 × 32 (width and height of convolution kernel are 3, and depth is 16, 32 convolution kernels), the step size of 1, and the computation on 16 microprocessor arrays are taken as examples.

In this embodiment, the processing module may be configured such that every 4 MPUs form a convolution operation macro processing unit, where each convolution operation macro processing unit processes feature data of 4 input channels, and the 4 convolution operation macro processing units complete feature map data processing of 16 input channels, that is, the target calculation unit includes 4 convolution operation macro processing units, and corresponds to 16 microprocessor arrays.

Specifically, the feature map data of 16 channels corresponding to the current 4 convolution operation macro processing units are respectively cached in the feature cache BUFFERs _0 to _15, and the convolution kernel weights are also loaded into the corresponding weight cache BUFFERs. Meanwhile, the interconnection network on the second chip is configured to enable the 4 microprocessor arrays to form a convolution operation macro processing unit for processing the feature data of the 4 corresponding input channels (in the actual calculation process, the interconnection network on the first chip and the interconnection network on the second chip need to be configured according to constraints such as the parameters, the structure and the calculation type of the neural network building model to realize different combinations). The local memory data can be loaded to the corresponding local memory.

Specifically, the feature data of the 4 input channels are fed to the convolution operation macro processing units corresponding to the local memories corresponding to the first, second, third, and fourth sub-computing units in each microprocessor array in the current convolution operation macro processing unit, and the local memories corresponding to the first, second, third, and fourth sub-computing units in each microprocessor array in the current convolution operation macro processing unit are configured as a shared memory storage space through the second on-chip interconnection network, and each local memory respectively caches part of the data of the corresponding input channel, such as the partial data channel 1_1 where the local memory corresponding to the first sub-computing unit in the MPU1 caches the feature data of the first input channel, the partial data channel 1_2 where the local memory corresponding to the first sub-computing unit in the MPU2 caches the feature data of the first input channel, and so on, the channels 1_1, 1_2, 1_3, and 1_4 cached in each corresponding local memory constitute part or all of the data of the first input channel after data combination (if the feature data of the input channel is small, if the shared memory space formed by a plurality of local memories can be cached, all the caches are cached; if the input channel characteristic data is large, only part of the input channel characteristic data can be cached, and other part of characteristic data of the current input channel needs to be loaded to the corresponding local memory in a plurality of operation cycles in sequence until the current channel data is completely calculated).

Meanwhile, the interconnection network on the first chip corresponding to the four microprocessor arrays is configured to load four sub-convolution kernels of the four corresponding convolution kernels respectively, namely the interconnection network on the first chip is configured to load sub-convolution kernel data corresponding to the characteristic data of the current input channel to the corresponding sub-computing units. In the embodiment shown in fig. 7, a schematic diagram of a parallel computing arrangement based on one input and output is shown.

Specifically, each MPU respectively completes convolution operation of one convolution kernel partial molecular convolution kernel (for example, if the depth of the convolution kernel is 16 and the number of the convolution kernels is 32 in this example, the convolution kernel 1 includes sub-convolution kernels 1 to 16, which are denoted as convolution kernels 1_1 to 1_16, and 2 to 36, and so on) and the input feature map data, and the MPUs respectively calculate convolution operation of the plurality of convolution kernel partial molecular convolution kernels and the input feature map data (in this example, the MPUs respectively calculate convolution operation of the 4 convolution kernel partial molecular convolution kernels and the input feature map data) (i.e., the MPUs respectively include 4 MPU convolution operation macro processing units, that is, the MPUs respectively calculate convolution operation of the 4 convolution kernel partial molecular convolution kernels and the input feature map data).

Each sub-calculation unit in each MPU respectively completes convolution operation of one sub-convolution kernel of one convolution kernel and corresponding input channel characteristic data, the plurality of sub-calculation units respectively complete convolution operation of the plurality of sub-convolution kernels in one convolution kernel and the plurality of corresponding input channel characteristic data, and in this embodiment, 4 sub-calculation units respectively complete convolution operation of 4 sub-convolution kernels in one convolution kernel and 4 corresponding input channel characteristic data.

The PE multiply-add array in each sub-calculation unit realizes multiply-accumulate operation of specific characteristic elements and corresponding sub-convolution kernel weights, and obtains a first part sum result of convolution operation of the sub-convolution kernels and corresponding input channel data based on an internal accumulator; the accumulation tree in the first interconnection network completes accumulation of a plurality of first partial sums of each sub-calculation unit of the MPU to obtain a convolution result of convolution operation corresponding to a convolution kernel (for example, 4 sub-convolution kernels) of a convolution kernel part; each convolution operation macro processing unit, namely 4 MPUs, completes the convolution result of the convolution operation corresponding to the molecular convolution kernel (for example, 4 sub-convolution kernels) of the 4 convolution kernel parts.

Further, it can be known that 4 convolution macro processing units complete input channel priority calculation: the convolution result of the convolution operation corresponding to the 16 sub-convolution kernels with 4 convolution kernels (each convolution operation macro-processing unit calculates 4 sub-convolution kernel convolution operations corresponding to the 4 convolution kernels, and the 4 convolution operation macro-processing units can complete the convolution operations of the 16 sub-convolution kernels corresponding to the 4 convolution kernels) or the output channel priority calculation: and 4 sub-convolution kernel operations corresponding to 16 convolution kernels are completed (each convolution operation macro processing unit calculates 4 sub-convolution kernel convolution operations corresponding to 4 convolution kernels, and each 4 macro convolution operation macro processing unit can be configured to calculate 4 sub-convolution kernel convolution operations corresponding to 4 × 4 ═ 16 convolution kernels, and then complete convolution operations corresponding to other sub-convolution kernels in multiple cycle periods).

The input channel priority calculation refers to the convolution operation of calculating the characteristic data of all input channels and part of convolution kernels preferentially, and then calculating the convolution operation of the rest convolution kernels in a plurality of periods; the output channel priority calculation means that the convolution operation corresponding to the partial convolution kernels of the convolution kernels of all the output channels is calculated preferentially, all the convolution kernels are calculated in a plurality of subsequent periods as the convolution operation of the other partial convolution kernels participating in the calculation, and the convolution results of all the partial convolution kernels are accumulated in the plurality of periods. The specific configuration conditions need to be configured and set after comprehensive evaluation according to factors such as the number of channels of each layer of the neural network model, the characteristic size, the number of convolution kernels and the like, namely an input channel calculation priority strategy or an output channel calculation priority strategy is adopted.

Further, 8, 16 and 32 convolution operation macro processing units are configured to realize convolution operations of 32, 64 and 128 sub-convolution kernels corresponding to 8, 16 and 32 convolution kernels or convolution operations of 4 sub-convolution kernels corresponding to 32, 64 and 128 convolution kernels.

No matter which configuration strategy is adopted, after the convolution kernel result is accumulated for the second time, the operations such as activation, pooling, normalization and the like are required to be carried out on the data, and in order to avoid bandwidth occupation, high delay and high power consumption brought by data activation, pooling and the like, the invention designs a self-defined processor as follows:

feeding the current convolution result to a custom microprocessor to perform subsequent operations such as activation and pooling, wherein each MPU comprises four custom microprocessors, the result obtained through an accumulation tree only needs to be activated once, and at the moment, four activation operation circuit units are provided, so that in order to avoid the waste of hardware resources of the custom microprocessor, other custom microprocessors can be fully utilized, the operations such as activation, pooling and normalization in the actual calculation process need to be completed in multiple clock cycles, while the accumulation of data by the accumulation tree occupies fewer clock cycles compared with the complex operations such as activation and pooling, so that multiple data distribution can be realized by configuring a first interconnection network, and the custom microprocessors of different sub-calculation units also comprise a FIFO queue for buffering to-be-activated or pooled data by adopting the FIFO queue, specifically, when data exists in the FIFO queue of the current custom microprocessor, the data is processed; the routing unit dynamically loads data into each FIFO queue by monitoring whether the loading state of the FIFO queue of the custom processor corresponding to each sub-computing unit is empty, collects and stores each computing result data after computing is completed for input of convolution calculation of a next convolution layer, and further acquires the data to be activated in the convolution calculation and configures the on-chip interconnection network, the MPU and the sub-computing units according to the parameters of the input size to be activated and the number of output channels so that at least one sub-computing unit completes activation, pooling and planning operation of input channel data.

Specifically, the microinstruction opcode is loaded from the microinstruction cache to the microinstruction list according to the configuration instruction information, where, for example, the current microinstruction cache includes: an activate instruction (Sigmoid function, tanh function, Relu function, Leaky ReLU function, etc.), a pooling microinstruction (max pooling instruction, average pooling instruction, etc.), a normalize instruction, a quantize operation instruction, a gather operation instruction, etc., which are only examples herein.

If the current operation needs to sequentially execute Relu function activation operation, maximum pooling operation and quantization operation, loading the current micro instruction list according to the configuration instruction is needed, so as to load the corresponding instruction operation code.

For example, initially, the first circulation operation instruction is a Relu function activation operation, the instruction decoder acquires a corresponding Relu function operation instruction from the micro instruction list, and selects and enables the corresponding activation operation circuit to execute the Relu activation operation after decoding; the activation data is cached to a data stream cache BUFFER or a local memory or a shared memory space composed of a plurality of local memories and used for executing the next step of pooling operation, an instruction counter counts the current operation times, when the execution times reach a preset time, a next micro-instruction address index is generated, an instruction decoder obtains the address index to obtain a corresponding micro-instruction such as a pooling instruction from an instruction list, and after decoding, the corresponding pooling operation circuit is selected and enabled to be activated to execute the maximum pooling operation.

In another preferred embodiment, the custom processor may also perform jump-join operations, such as performing residual networking, feature layer aggregation/cross-correlation, U-shaped neural networking, etc., as shown in FIG. 8, which is a schematic diagram of a neural network with jump-joins. FIG. 9 illustrates a multi-trunk convolutional layer assembly operation. Figure 10 shows a schematic diagram of a U-shaped neural network with hopping connections.

And the jump connection operation is to acquire data of other sub-computing units or external cache through the I/O port corresponding to the first on-chip interconnection network, so as to jump over the PE array of the sub-computing units to execute corresponding computation.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A configurable convolution processing apparatus, wherein the apparatus comprises:

2. The apparatus of claim 1, wherein the apparatus further comprises a plurality of parallel processing modules, wherein the plurality of parallel processing modules and the processing modules are arranged in a matrix, and at least any two processing modules can be coupled and connected through a third on-chip interconnection network.

3. The apparatus of claim 1, wherein each of said sub-compute units includes a memory storage space, wherein the expansion of the memory storage space of each sub-compute unit is accomplished by the configuration of the respective sub-compute unit within each microprocessor array.

4. The apparatus of claim 3, wherein the expansion of the memory storage space of each microprocessor array is achieved by configuring the respective microprocessor arrays within the processing module.

5. The apparatus of any of claims 1-4, wherein each of the sub-computing units comprises:

6. The apparatus of claim 5, wherein the obtaining the data to be processed further comprises obtaining the processed data directly from the first on-chip internetwork or the second on-chip internetwork, and the obtaining the processed data directly from the first on-chip internetwork or the second on-chip internetwork comprises any one of:

7. The apparatus of claim 5 or 6, wherein the custom operation comprises any of: step length operation, activation operation, pooling operation, normalization operation, quantization operation and convergence operation.

8. The apparatus of claim 5, wherein the custom microprocessor comprises:

9. The apparatus of claim 1, wherein the first on-chip interconnect network and the second on-chip interconnect network share an external cache input output interface.

10. The apparatus of claim 1 or 9, wherein the first on-chip interconnect network comprises input and output interfaces and configurable switches or routing units, and the first on-chip interconnect network is capable of coupling with the input and output interfaces of each sub-computation unit in each microprocessor array through the configurable switches or routing units.

11. The apparatus of claim 10, wherein the second on-chip interconnect network comprises input/output interfaces and configurable switches or routing units, and the second on-chip interconnect network is couplable to the input/output interfaces of each microprocessor array via the configurable switches or routing units.

12. The apparatus of claim 10, wherein the first on-chip interconnect network further comprises an accumulation tree to enable a second accumulation based on the first accumulation results of the sub-compute units or the microprocessor array.

13. A convolution calculation method based on the convolution processing apparatus according to claim 1, wherein the method includes:

14. The method of claim 13, wherein if the convolution calculation comprises a direct convolution operation, the performing the convolution calculation based on the buffered data comprises:

15. The method of claim 13 or 14, wherein, if the convolution calculation further comprises a custom operation, the performing convolution calculation based on the buffered data further comprises:

obtaining external cache data as first data and/or obtaining activation characteristic data output after multiply-add operation as second data through a custom microprocessor of the sub-computing unit;

performing the custom operation based on the first data and/or the second data to implement a convolution calculation.

16. The method of claim 15, wherein the performing the custom operation based on the first data and/or the second data comprises:

caching the first data and/or the second data and a micro instruction corresponding to the custom operation in a micro instruction cache and configuring a micro instruction list;

17. The method of claim 15, wherein the obtaining, by the custom microprocessor of the sub-computing unit, external cache data as the first data and/or obtaining activation characteristic data output after a multiply-add operation as the second data comprises:

18. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 13 to 17.